Institutions and consortia looking to gain an understanding of the publishing trends of their authors in order to prepare for negotiation of transformative and open access publishing agreements will need to find answers to key questions such as:
- In what journals do our authors most frequently publish their articles?
- What share of their articles is published by a given publisher?
- For what percentage of their articles are they the corresponding author (responsible for payment of the relative open access publishing costs or APCs)?
- How often do our authors elect to publish their articles immediately open access in a ‘hybrid’ or fully open access journal?
- Can we measure or estimate what are our authors currently spending on APCs “in the wild”?
- How does that article share relate to the proportion of our subscription budget currently paid to that publisher?
- Would a shift to open access with publisher X cost more, less, or the same as what we currently spend on subscriptions (and our authors spend on APCs)?
- What do we consider a fair price for open access publishing services for our articles?
Here are just some of the approaches, data sources and tools that the data analysts behind some of the most impactful transformative agreements to date have used to answer these questions. If you have conducted similar analyses and would like to share your approach, methods or results, please get in touch!
Collecting publication information of your institution/consortium/country
Today, there is no single source of bibliographic data that could serve as a one-stop-shop for all the metadata required for the analyses featured here. Even publishers have limited ability to provide complete, comparable and reliable data on the articles published in their journals. While some bibliographic data sources offer APIs, other data can only be gathered in CSV/TSV files, which must then be formatted. The community therefore relies on a number of sources and, ideally, integrates and reconciles data from more than one.
Here are some of the primary data sources that we have used as a starting point.
Clean your data
When using several different sources, an important part of compiling a reliable dataset is cleaning, converting and reorganizing the data into an interoperable format that works for you. This primary cleaning involves evaluating your summary data in general, to make sure that there are no surprises or issues that require further investigation. Often this also means digging into portions of a dataset that are missing an important data point (such as DOI) and figuring out why this is the case. Additionally, you will have to do some manual work to normalize some fields that may not be normalized in the raw data outputs of your sources; for example, publisher name, affiliation of corresponding author, and grant acknowledgement statements. Other data cleaning tasks include inspection and assessment of any outliers when integrating additional datasets (e.g. why did this particular set of articles from data source A not match to data from source B)?
Here are some of the approaches we have taken to cleaning metadata fields in our datasets.
Enrich your data
Having gathered an initial dataset to work with, it will need to be validated and enriched with data from other sources. Here are some data enrichments that we found are often needed and some sources that we used to integrate our datasets.
Organizing, storing and interrogating your data
Once you have compiled a workable dataset, there are a variety of tools you can use to manage, interrogate and store the data. Here are some of our use cases and approaches.
Communicating data and results to stakeholders
Data formatting and visualizations are very important to effectively communicate and help stakeholders understand the data, trends and insights that can be gleaned from your analysis. Here are some approaches that we have adopted.
The Open Access 2020 dataset compiled by Najko Jahn of the Göttingen State and University Library highlights corresponding author country affiliations per publisher, journal and open access publishing model 2014 – 2018 using an in-house database from the German Competence Center for Bibliometrics and is freely available at https://github.com/subugoe/oa2020cadata.
Read more about the work of Najko and his colleagues in their blog series, Scholarly Communication Analytics with R, here.
The US OA2020 Working Group organizes Community of Practice (CoP) calls on Negotiating and Implementing OA and Transformative Agreements, including on how to gather, analyze, and use publication data for negotiating transformative open access agreements. See https://oa2020.us/community-of-practice-2/
In one session, Mat Willmott, Open Access Collections Strategist at the California Digital Library gives an overview of some of the various data sources and methods commonly used and Keith Webster, University Librarian at Carnegie Mellon, offers a case study of Carnegie Mellon’s approach using tools like Digital Science’s Dimensions to analyze their campus publishing: https://keeper.mpdl.mpg.de/f/aa8e0ddcd933417e8414/
The data analysts participating in the ESAC Initiative are happy to share their insights, methods and, where possible, their data. Let us know if you have any queries or additional insight and data to share!