This project uses Python, Jupyter Notebooks and the OpenAlex API to collect, clean, and examine open data on cited references for Iowa State University in the year 2021.
We would like to better understand how campus researchers use journal content. Analyzing which years our authors cite and how many papers they cite gives us a better feel for how content is being used. We can use this information as we make journal cancellation and renewal decisions. Research questions include:
- How often do researchers from a university cite a particular journal?
- What years are those cited references from, and when was the reference made?
- Does the usage justify paying for backfile access to a journal, or do our researchers tend to use more recent content?
This repository contains two main folders:
-
notebooks
: contains Jupyter notebooks that gather, process and analyze the input data. These notebooks contain Python code and markdown cells that describe the data processing steps and results.-
The first notebook
1-Pull_the_data_OpenAlex-citedreferences.ipynb
demonstrates how to use the OpenAlex API to extract publications which meet user-defined criteria and collect the cited references within. -
The second notebook
2-Graph_and_explore_data_OpenAlex-citedreferences.ipynb
provides a standardized set of graphs, data visualizations, and tables to explore and answer questions about cited reference patterns.
-
-
files/ISU_2021_fullyear
: contains the data on cited references for Iowa State University in the year 2021 as a case study. It was gathered in January 2023 using the first notebook and subsequently used in the data exploration of the second notebook.publications.csv
stores the metadata about all publications from Iowa State University in 2021references.parquet
stores the metadata about all references listed in the publications' bibliographiespub2ref.csv
stores the connections between the publications and their references
Note: You can browse through the notebooks right here on GitHub. However, the code snippets won't be executable.
The easiest way to run Jupyter notebooks is via cloud services like Binder. They provide you with a free execution environment that you can access directly in your browser - no setup needed. Just click on the Binder badge at the top of this README!
If you are familiar with the command-line and Python, you can also set up a local environment on your computer to run the notebooks.
Clone this repository and change into its folder
git clone https://github.com/eschares/OpenAlex-CitedReferences.git
cd OpenAlex-CitedReferences
Create a virtual environment and activate it
python3 -m venv jupyterenv
source jupyterenv/bin/activate
Install the Python packages jupyter
and jupyterlab
python3 -m pip install jupyter jupyterlab
Install the Python packages specified in requirements.txt
python3 -m pip install -r requirements.txt
Start the Jupyter server
jupyter lab
If everything went well, a new tab in your browser should pop up showing you the contents of this repo in a Jupyter environment.
Many integrated development environments also support running Jupyter notebooks out of the box or via a plugin. If you have one installed, you may want to consult its docs or marketplace.
OpenAlex is a very useful open scholarly database, but it is still evolving. As such, executing the notebooks at different points in time may yield different results, and bugs may appear which were not present in January 2023 when this data was downloaded and processed. To be as transparent as possible, we will list problems that arise when executing the notebooks here as we become aware of them:
The field host_venue
of an OpenAlex work object is deprecated. We use it to extract the publisher, journal and issn of each reference and publication. The notebook will have to be adapted to use the new field primary_location
instead.
We noticed that some references do not exist anymore in OpenAlex. For example the OpenAlex ID "W4362225795" is referenced in other works, but querying OpenAlex for the entity results in a "404-not found" error. We notified the OpenAlex support about it (support ID #241).
We noticed that there is a difference between retrieved and expected references. Some requests to the OpenAlex API fetching 50 references using the approach outlined in the OurResearch blog did not return 50 entities but 49 or sometimes 48. We notified the OpenAlex support about it (support ID #238).