Institutions and consortia looking to gain an understanding of the publishing trends of their authors in order to prepare for negotiation of transformative and open access publishing agreements will need to find answers to key questions such as:
- In what journals do our authors most frequently publish their articles?
- What share of their articles is published by a given publisher?
- For what percentage of their articles are they the corresponding author (responsible for payment of the relative open access publishing costs or APCs)?
- How often do our authors elect to publish their articles immediately open access in a ‘hybrid’ or fully open access journal?
- Can we measure or estimate what are our authors currently spending on APCs “in the wild”?
- How does that article share relate to the proportion of our subscription budget currently paid to that publisher?
- Would a shift to open access with publisher X cost more, less, or the same as what we currently spend on subscriptions (and our authors spend on APCs)?
- What do we consider a fair price for open access publishing services for our articles?
Here are just some of the approaches, data sources and tools that the data analysts behind some of the most impactful transformative agreements to date have used to answer these questions. If you have conducted similar analyses and would like to share your approach, methods or results, please get in touch!
Collecting publication information of your institution/consortium/country
Today, there is no single source of bibliographic data that could serve as a one-stop-shop for all the metadata required for the analyses featured here. Even publishers have limited ability to provide complete, comparable and reliable data on the articles published in their journals. While some bibliographic data sources offer APIs, other data can only be gathered in CSV/TSV files, which must then be formatted. The community therefore relies on a number of sources and, ideally, integrates and reconciles data from more than one.
Here are some of the primary data sources that we have used as a starting point.
National Current Research Information System (CRIS) or publication database
In some cases a national CRIS provides the most complete picture of the research produced in a given country, even though these systems do not always capture all of the metadata elements that are required to carry out analyses for preparing negotiations, such as corresponding author or normalized institution names. While the metadata quality is not always up to standard, using a CRIS can be a good opportunity for greater engagement within the local research community and collaboration on metadata standards and processes.
Web of Science
Many use the commercial data source Web of Science as their starting point because it reliably indexes a number of fields that are critical to their analysis, including corresponding authorship and grant acknowledgement statements. Also, because it is a widely used source, internationally, the datasets can later be easily compared with data of other countries/institutions. As WoS is a proprietary database, getting the data can be cumbersome if you do not have a raw-data subscription or InCites API access, as data can only be extracted in batches of a maximum of 500 records at a time. It’s important to keep in mind that, as is the case with other sources, WoS does not provide full coverage and, indeed, coverage is particularly limited in certain subject areas and geographic regions, creating potential blind spots in analyses. When using WoS data, further data integration is required to address non-standardized publisher names, and delays in indexing.
Dimensions is a commercial source that brings together data on publications, datasets, grants, citations and altmetrics, clinical trials, patents and policy documents. Some of the modules are freely accessible to the public, while the complete dataset can be accessed through a subscription. Historically, corresponding authorship data has not been included in Dimensions, and while this seems to be changing, gaps may persist. Generally, some find the metadata on publication type to be oversimplified; for example, “articles” include editorial material, letters and other types of publications. Nevertheless, the interface seems to be quite easy to use with a lot of functionality.
Scopus is a relatively comprehensive commercial source of research publication data and includes most of the commonly used metadata as well as citations, a useful journal-based field and discipline classifications (AJSC). As with other, similar databases such as Web of Science and Dimensions, there are gaps in coverage and these gaps can sometimes be difficult to identify. Scopus provides corresponding author information, but at the time of writing this data can be limited to only one per publication. Some have noted that the data quality in Scopus can be erratic; certain metadata fields will contain good quality data consistently and less so. The accessibility of certain metadata fields can be challenging and cumbersome, depending on your form of access (API, XML, SciVal, etc.)
Crossref is a very good source in terms of article coverage, but as the publication data in Crossref is submitted by its members, the quality can vary significantly. A useful resource for checking the data quality of Crossref members are the Crossref Participation Reports, which can provide details on the metadata submitted by publishers, such as licence URLs, open references, funding information, and so on. Crossref has a very robust API and a public data dump option that can be a valuable resource for bibliometric analyses.
Martín-Martín, A., Thelwall, M., Orduna-Malea, E. et al. Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. Scientometrics 126, 871–906 (2021). https://doi.org/10.1007/s11192-020-03690-4 and https://albertomartin.shinyapps.io/citation_overlap_2019/
Visser, M., van Eck, N. J., & Waltman, L. (2021). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. Quantitative Science Studies. Advance publication. https://doi.org/10.1162/qss_a_00112
Huang, C.-K., Neylon, C., Brookes-Kenworthy, C., Hosking, R., Montgomery, L., Wilson, K., & Ozaygen, A. (2020). Comparison of bibliographic data sources: Implications for the robustness of university rankings. Quantitative Science Studies, 1(2), 445–478. https://doi.org/10.1162/qss_a_00031
Guerrero-Bote VP, Chinchilla-Rodríguez Z, Mendoza A and de Moya-Anegón F (2021) Comparative Analysis of the Bibliographic Data Sources Dimensions and Scopus: An Approach at the Country and Institutional Levels. Front. Res. Metr. Anal. 5:593494. https://doi.org/10.3389/frma.2020.593494
Clean your data
When using several different sources, an important part of compiling a reliable dataset is cleaning, converting and reorganizing the data into an interoperable format that works for you. This primary cleaning involves evaluating your summary data in general, to make sure that there are no surprises or issues that require further investigation. Often this also means digging into portions of a dataset that are missing an important data point (such as DOI) and figuring out why this is the case. Additionally, you will have to do some manual work to normalize some fields that may not be normalized in the raw data outputs of your sources; for example, publisher name, affiliation of corresponding author, and grant acknowledgement statements. Other data cleaning tasks include inspection and assessment of any outliers when integrating additional datasets (e.g. why did this particular set of articles from data source A not match to data from source B)?
Here are some of the approaches we have taken to cleaning metadata fields in our datasets.
Author affiliations / Reprint author
As the affiliations field of authors (or corresponding authors) can take many forms and are usually in a textual format, you will need to normalize these and match them with the official names of institutions, institutes, etc. A way to do this could be to build up a dictionary of all the possible name variations and the official names. Some data sources may include the email addresses of reprint authors, which can be helpful in validating the outcome of the normalization process, or if the affiliation texts are messy. You might also want to use standard institutional IDs, such as ROR or GRID. Nevertheless, be prepared to invest quite a lot of manual work in the normalization process.
Publisher (name) disambiguation
Certain publishers can operate with different imprints, and different bibliographic sources can list these as different publishers, or group them on a higher level. Depending on the source you are using, it might be necessary to normalize the publisher fields and pair these with the highest possible entity (e.g. Pergamon Press – Elsevier, or Routledge – Taylor & Francis). Looking up the DOI in Crossref can also help in disambiguating publisher names.
Depending on the data source, bibliographic records can have their scientific disciplines assigned to them on the journal-level (as in the case of Scopus and Web of Science), or the article level (Dimensions). If you intend to create subject-based analyses, reports or visualizations, you will need to standardized these in your own data.
Many bibliographic databases determine the publication data of an article based on when it was assigned to a specific journal volume and issue. This, however, could differ significantly from the first online publication date. In the case of transformative agreements, however, accounting is primarily done based on the acceptance date (i.e. the date that an author is notified that their article has been accepted for publication), or even the submission date of the articles (as is often the case of fully open access journals). Since APC payments also tend to occur at the point of acceptance, it is important to take into account that these can be earlier than the publication date data derived simply from the bibliographic databases.
When conducting analyses, you should also be aware that journals can move from one publisher to another over time, which, depending on the level of your analysis (publisher-level, package-level, or journal-level), is something that needs to be factored into your consideration, as it can have a significant effect on your article output projections and cost calculations.
Enrich your data
Having gathered an initial dataset to work with, it will need to be validated and enriched with data from other sources. Here are some data enrichments that we found are often needed and some sources that we used to integrate our datasets.
Open Access status of articles
Most of the standard bibliographic databases provide the following OA status for articles:
- Gold (published in a fully OA journal. Some bibliographic databases limit this to articles that are published in journals indexed in DOAJ, Unpaywall does not)
- Hybrid (published in a journal which also contains paywalled articles)
- Bronze (freely accessible article, without an OA licence. These are typically articles that are made available for a certain period of time in journals that have an open archive, or similar means, such as COVID-19 papers)
- Green (with various types of versions: submitted, accepted, published)
Unpaywall provides the open access status(es) of articles (whether they are green or gold), links to these versions, and offers easy mapping for most datasets and analysis goals. Although many of the listed bibliographic databases contain article-level OA information (usually based on the very same Unpaywall data), Unpaywall can be used to enrich this information with additional data. Unpaywall can also be useful to regularly update the OA status of previously downloaded data you may have obtained from various sources (including Unpaywall, itself); this is important, as the OA status of articles can change over time (such as when the embargo period of a green OA articles expires). As a complex dataset, it can be difficult to navigate for more nuanced use cases.
The Directory of Open Access Journals is an independent source of regularly updated metadata on over 15 000 peer-reviewed open access journals, covering all areas of science. Most bibliographic databases have a flag in the metadata of the articles that are published in journals indexed in DOAJ (usually “DOAJ gold”, or similar) – however, it is still a useful source of reputable OA journals, their respective business models (for example, the majority of journals indexed in DOAJ don’t charge an APC), and various further information.
Fully OA journals
While DOAJ indexes prominent open access journals, its coverage is far from complete. The ISSN-GOLD-OA dataset by the University of Bielefeld compiles a list of journals from many other sources, and it is likely the most comprehensive source of OA journals out there. The dataset is updated regularly, the current version is the ISSN-Matching of Gold OA Journals (ISSN-GOLD-OA) 4.0
Cost modelling and calculations of scholarly publishing
When considering the costs involved in scholarly publishing, we consider two main expenditures: amounts paid in subscription fees and amounts paid for open access publishing, i.e. article processing charges or article publishing charges (APC’s) paid “in the wild” by authors or, centrally, by the institution. [Note: often publishers charge authors additional fees for add-on services such as color charges and figure charges. As those services are related to print versions of articles and are, currently, quite difficult to track, such charges are usually not covered in central transformative and open access agreement (thus far).]
Article processing charges (APC’s)
To calculate an estimation of the overall annual expenditure for open access publishing—be that in author-facing or institutionally paid article processing charges, some OA types can be grouped together (gold and hybrid) while others are not relevant for these calculations (green and bronze). A good way to double-check the gold OA status of articles could be to look up the journal in the ISSN-Gold-OA dataset, so you can determine if the OA status from your initial dataset was inaccurate. Working with your subset of APC-eligible OA articles, you can match these with the mean APCs from OpenAPC or the list prices from Delta Think, to come up with an estimated value for historical OA expenditure. You can later work with this set and create institution- or subject-specific analyses, calculate hybrid expenditure (author-facing ‘hybrid’ APC’s paid “in the wild”), and combine them with your subscription costs and expenditure to calculate your total costs with certain publishers.
Delta Think is a proprietary cross-publisher data source which includes historical APC data, although it is not as comprehensive at a journal level as other, more specific sources.
OpenAPC releases institutional, funder or consortium-level data sets under an Open Database License of the article processing charge (APC) amounts paid for open access publishing. As it collects individual payments for articles, the mean values of APCs can differ from the list price APCs, but it provides a good estimate on the APC levels that can then be used in various analyses and calculations.
Publisher price lists
Publisher-specific journal price lists will give you comprehensive data on the relative publisher’s corpus, but there is no consistent standard for formatting across different publishers’ price lists, so data sets of each individual publisher require extra work to incorporate them.
Your institution’s local acquisitions database will be a good source of information customized to your situation, but local lists of subscribed journals and there relative costs don’t always match perfectly to publisher-provided journal and price lists.
Value indicators and assessment criteria
The open access transition demands a shift in how institutions assess the value proposition of their business relationship with scholarly publishers. In the domain of subscriptions, value was assessed primarily based on journal downloads or usage/cost analyses. In the open access transition, however, assessment criteria expand to include the value that authors place on journal content, as expressed in where they choose to submit and publish their articles and in the articles that they cite.
COUNTER usage reports quantify usage by publication year so that you can filter the report to assess how much usage occurred with current content (to understand the potential impact of cancelling a subscription) and with backfile content (to which you may have already licensed permanent access). COUNTER 5 usage reports also contain information on downloads of OA articles, which you can exclude from analyses of your subscription expenditure, or which you can use to calculate trends based on historical data about the proportion of downloads that are fulfilled outside of your subscriptions.
ezPAARSE and publisher platform log files
ezPAARSE and publisher platform log files provide a more granular view of user activity so that you can filter out repeat downloads, anomalies, etc. in order to assess the actual usage and make more informed cost/use assessments.
There are various commercial citation databases and freely available sources of citation data, for example, Crossref, which, while still limited, has a growing coverage of references, thanks also to the efforts of I4OC. These sources can give you an indication of what your authors cite in their publications, and can further enrich your overall picture when you are evaluating your agreements. As absolute values can differ from year to year, it might be reasonable to calculate with percentages. Combining the ratio of publications, citations, usage, and costs, and observing the trends over time can provide you with a particularly rich picture of the value journals and journal packages hold for readers and authors.
While primarily aimed at modelling subscription cancellation scenarios, Unsub can help you in value assessment of subscriptions, and it can also provide you with calculations around OA spend, OA ratio of your usage (including green OA), cost-effectiveness, and so on.
Organizing, storing and interrogating your data
Once you have compiled a workable dataset, there are a variety of tools you can use to manage, interrogate and store the data. Here are some of our use cases and approaches.
Analyst A: We used variety of tools for different tasks: Python scripting in BASH/shell, ATOM as editor and BASH/Shell to format from BIBtex to CSV, OpenRefine to format data from API output to CSV, Microsoft Excel to handle data template (mandatory metadata schema), etc. We used a VLOOKUP in Excel to match data points with DOI as the primary key. We also used Falcon SQL and Metabase to query and interrogate data in web-based platforms.
Analyst B: For filtering and faceting, we used OpenRefine. It has a user-friendly interface and is more powerful than Excel for working with large amounts of data. Also, it is possible to connect to an API with OpenRefine – a feature we used to connect to the Unpaywall or Crossref API.
Analyst C: I use R Tidyverse for collecting, processing, cleaning, enriching and analyzing the data.
Analyst D: We use R to manipulate large data sets, for any computationally intensive scripting needed, and for connecting to other datasets through APIs. Excel is used for more basic organizational tasks, for mapping in simpler data sets, and for structuring the data into formats for sharing with other stakeholders; to interrogate the data, it is filtered to create subsets as needed, and functions are used to create outputs ranging from simple data summary tables to complex financial modeling sheets performing calculations based on given parameters.
Analyst E: We use Excel mostly, because so many of our colleagues are familiar with it. We also use a PostGreSQL database as it is Open Source and easily deployable locally or on a server. PostGreSQL is also a good interface to visualize the data.
Analyst F: We store our data in an internal storage space, even though it is not dynamic and we are still looking for a way to organize our data in a way that it suits the needs of those doing the primary analysis as well as the needs of those who just want access to some specific data points. We also use Github/Gitlab as a backup and for collaborating.
Communicating data and results to stakeholders
Data formatting and visualizations are very important to effectively communicate and help stakeholders understand the data, trends and insights that can be gleaned from your analysis. Here are some approaches that we have adopted.
Analyst A: We created simple dashboards featuring a drop-down list from which individual institutions or publishers could be selected and the formulas in Excel would update the tables accordingly. We created graphs and visualizations that were based on the tables in excel, so they would also be updated automatically based on the selections of the users. With this approach you can combine criteria and create very granular-level queries based on the raw data available. For example, we built queries to show the growth in OA publication ratios over time for a specific institution in specific fields – and we can easily change the variables in the tables, if needed.
Analyst B: We created a series of tables and graphs based on aggregated numbers but are now working on report templates for common use cases and hope to develop an interactive dashboard.
Analyst C: We built outputs for stakeholders in Excel and Tableau. Excel is generally used for simple data visualizations and for building calculation tools which take various parameters as inputs, run the data through a series of functions, and output desired results (such as the financial results of a given business model). Tableau is used for generating complex, interactive data visualizations.
Analyst D: We tried to format our data into easily queryable dashboards in Metabase, but in the end we also produced a final report submitted to the Ministry of Higher Education & Science and to the management of the national library.
Analyst E: While data ownership and disclosure is an important consideration, it is also important to share as much as possible to grow awareness and understanding within the local and global communities. An important part of this is creating summary tables from Excel spreadsheets and with article-level data to produce a gallery of sample analysis in, for example, a slide deck, and blogging about our analysis and findings.
The Open Access 2020 dataset compiled by Najko Jahn of the Göttingen State and University Library highlights corresponding author country affiliations per publisher, journal and open access publishing model 2014 – 2018 using an in-house database from the German Competence Center for Bibliometrics and is freely available at https://github.com/subugoe/oa2020cadata.
Read more about the work of Najko and his colleagues in their blog series, Scholarly Communication Analytics with R, here.
The US OA2020 Working Group organizes Community of Practice (CoP) calls on Negotiating and Implementing OA and Transformative Agreements, including on how to gather, analyze, and use publication data for negotiating transformative open access agreements. See https://oa2020.us/community-of-practice-2/
In one session, Mat Willmott, Open Access Collections Strategist at the California Digital Library gives an overview of some of the various data sources and methods commonly used and Keith Webster, University Librarian at Carnegie Mellon, offers a case study of Carnegie Mellon’s approach using tools like Digital Science’s Dimensions to analyze their campus publishing: https://keeper.mpdl.mpg.de/f/aa8e0ddcd933417e8414/
The data analysts participating in the ESAC Initiative are happy to share their insights, methods and, where possible, their data. Let us know if you have any queries or additional insight and data to share!