Reference guide

** under construction **


Collecting publication information of your institution/consortium/country

There is no single, comprehensive source of bibliographic data on scholarly journal articles, and even the publishers, themselves, have limited ability to provide complete, comparable and reliable data. Also, while some sources offer APIs, other data can only be gathered in CSV/TSV files which must then be formatted. The community therefore relies on a number of data sources, ideally integrating and reconciling data from more than one source, to create a relevant and reliable dataset from which to base their transformative agreement and open access strategies.

Here are some of the primary data sources used as a starting point by the data analysts participating in ESAC.


Additional sources to validate, integrate and enrich your data

Having gathered an initial dataset to work with, it will need to be validated and enriched with data from other data sources. Here are some data enrichments that are often needed and sources commonly used.


Cleaning your data

When using several different sources, an important part of compling a reliable dataset is cleaning, converting and reorganizing data into an interoperable format that works for you. This primary cleaning involves evaluating your summary data in general, to make sure that there are no surprises or issues that require further investigation. Often this also means digging into portions of a dataset that are missing an important data point (such as DOI) and figuring out why this is the case. Additionally, you will have to do some manual work to normalize some fields that may not be normalized in the raw data outputs of your sources; for example, publisher name, affiliation of corresponding author, and grant acknowledgement statements. Other data cleaning tasks include inspection and assessment of any outliers when integrating additional datasets (e.g. why did this particular set of articles from data source A not match to data from source B)?

Here are a few examples of different cleaning exercises and tools that participants in ESAC have adopted.


Organizing, storing and interrogating your data

Once you have compiled a workable dataset, there are a variety of tools you can use to manage, interrogate and store the data. Here are some examples and use cases.


Communicating data and results to stakeholders

Data formatting and visualizations are very important to effectively communicate and help stakeholders understand the data, trends and insights that can be gleaned from your analysis. Here are some approaches that have been adopted.

The data analysts participating in the ESAC Initiative are happy to share their insights, methods and, where possible, their data. Below are some resources you might find helpful. Please do not hesitate to get in touch if you have further queries.

Scholarly Communication Analytics with R

(( this isn’t open – update link or replace with something else?))