Skip to content

Latest commit

 

History

History

systematic-literature-review

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Tutorial: Systematic Literature Review (Train the Trainees for Librarians)

Filter out and find relevant publications, to support you doing a systematic review around your research question - in a fully open and reproducible way.

SETUP

Please visit installation.md to learn how to install the needed software.

Additional requirements

  • Memory: The downloaded data needs around 700 MB on your harddrive.

In addition, for preparation we recommend to have a look at the resources list in installation.md.

TUTORIAL

General

A systematic literature review normally consists of these steps:

  1. Define research question
  2. Get data
  3. Extract relevant data
  4. Assess quality of data
  5. Analyse and combine data

Text and data mining normaly can support with steps 2 + 3, but also not more. TDM can decrease human workload by huge numbers, but can not do magic ether. This means, that especially the research design and the assessment must be done by humans in a sincere way.

Our approach with Open Source and Open Access makes the process fully transparent and reproducible. Besides, and this is not just a small issue, it creates no legal issues.

Download the data

Go into the systematic literature review folder

cd tutorials/systematic-literature-review/

The first step always is to get the needed data from the APIs. For this, we use getpapers, the ContentMine tool for getting papers via different Publisher APIs. In this tutorial, we will only use open access literature from Europe PMC. We can search within their database of 3.5 million fulltext papers from the life-sciences. About one million of these are Open Access. Please refer to Europe PMC-Data for details. This will take less than 200MB of memory.

Find the right query terms

This is the most crucial step of the whole tutorial. If you decide to take the wrong query term for the download of the publications, the whole dataset is wrong. This means, that also all the methods applied later will not deliver the needed outcome you want or expect to.

An important point is to find the balance between a very sensitive query, where a lot of publications are in, but maybe also a lot of false positives (e. g. virus), and a very specific approach, where just a few hits are found, but a lot of false negatives may appeared (e. g. origin zika virus .

Get publications from Europe PubMedCentral (EUPMC)

Then, we have a look at how many results we find for the query term. For further information on how to create more complex queries for the EUPMC API, read here or here.

One important thing to keep in mind: the query only gets the publications accessible through the EUPMC API. So this is never the full list of literature available for a query.

Look, how many results are found for the query "zika":

getpapers -q zika -o zika

1465 papers were found in this case (5. 6. 2017).

If your query consists of more than one word, use "TERM1 TERM2" to encapsulate them.

Then, we download the fulltext.xml files for each publication. For this, add -x flag to the query. The results can then be viewed again with the tree command.

getpapers -q zika -o zika -x
tree zika

An important step after downloading the publications is to have a look at some papers, if they fit the requirements. If not, adapt your query and try it again and again, until the outcome fits to your needs.

Normalize the data

Before we can start with the extraction and analysis, we have to normalize the raw data and convert it to Scholarly HTML, so it is easier to process further on. For this, we convert with norma the fulltext.xml files to scholarly.html files, and view the results.

norma --project zika -i fulltext.xml -o scholarly.html --transform nlm2html
tree zika

Normalizing is a very important step, which rarely gets attention. The input data is normally heterogenous - in terms of structure as in content - and needs to be transformed into one central standard. This helps then other tools to connect to the data and function as a central interface to the data coming from diverse sources.

Extract the needed facts

The prepared data now can be used to extract the facts via ami's different plugins. This will be — besides the metadata of the publications — the main datasource for the further analysis later on.

This is, where the actual text data mining takes place. Everything before was just preparation, which takes normally most of the time. Here all kinds of entities can be extracted, from species to drugs to genus. Via regex's and dictionaries also your own terms can be looked for.

Extract Species

First, we use the species plugin to get the genus and binomial nomenclature.

ami2-species --project zika -i scholarly.html --sp.species --sp.type genus
tree zika

Analyse the data with Jupyter Notebook

The analysis of the extracted data is done with Python in a Jupyter Notebook. There are several methods applied. Some of them are descriptive and show the wanted outcome, but some are explorativ, and conclusions must be done by a domain expert by exploring the data and its presentation by her/himselves. The following analysis is done:

Get the Jupyter Notebook: tutorial-systematic-literature-review.ipynb.

Use the Jupyter Notebook:

If you are not already in it, go to the tutorials/systematic-literature-review/ folder and start jupyter via:

cd tutorials/systematic-literature-review/
jupyter notebook

This should let your browser open a new tab with the actual directory in it. Click on the tutorial-systematic-literature-review.ipynb file to open the jupyter notebook. Then you can execute cell by cell and adapt the notebook to your needs. There is a more detailed description of the functionality and analysis done in the Jupyter notebook.

Look at OpenKnowledgeMaps

A good way to get a basic understandic of the topics under a certain research field and the relations between publications and topics can be found at our befriended project OpenKnowledgeMaps.

Let's have a look on the knowledge map for the term zika.

FOLLOW UPS

Systematic Review

RESSOURCES

All materials worked out in this repository where conducted within the EU Horizon2020 project Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.

FutureTDM ContentMine Open Knowledge International

All content and data is licensed under the Creative Commons Attribution 4.0 International License. All code is under the MIT license.

Creative Commons by