Filter out and find relevant publications, to support you doing a systematic review around your research question - in a fully open and reproducible way.
Please visit installation.md to learn how to install the needed software.
Additional requirements
- Memory: The downloaded data needs around 700 MB on your harddrive.
In addition, for preparation we recommend to have a look at the resources list in installation.md.
A systematic literature review normally consists of these steps:
- Define research question
- Get data
- Extract relevant data
- Assess quality of data
- Analyse and combine data
Text and data mining normaly can support with steps 2 + 3, but also not more. TDM can decrease human workload by huge numbers, but can not do magic ether. This means, that especially the research design and the assessment must be done by humans in a sincere way.
Our approach with Open Source and Open Access makes the process fully transparent and reproducible. Besides, and this is not just a small issue, it creates no legal issues.
Go into the systematic literature review folder
cd tutorials/systematic-literature-review/
The first step always is to get the needed data from the APIs. For this, we use getpapers, the ContentMine tool for getting papers via different Publisher APIs. In this tutorial, we will only use open access literature from Europe PMC. We can search within their database of 3.5 million fulltext papers from the life-sciences. About one million of these are Open Access. Please refer to Europe PMC-Data for details. This will take less than 200MB of memory.
Find the right query terms
This is the most crucial step of the whole tutorial. If you decide to take the wrong query term for the download of the publications, the whole dataset is wrong. This means, that also all the methods applied later will not deliver the needed outcome you want or expect to.
An important point is to find the balance between a very sensitive query, where a lot of publications are in, but maybe also a lot of false positives (e. g. virus), and a very specific approach, where just a few hits are found, but a lot of false negatives may appeared (e. g. origin zika virus .
Get publications from Europe PubMedCentral (EUPMC)
Then, we have a look at how many results we find for the query term. For further information on how to create more complex queries for the EUPMC API, read here or here.
One important thing to keep in mind: the query only gets the publications accessible through the EUPMC API. So this is never the full list of literature available for a query.
Look, how many results are found for the query "zika":
getpapers -q zika -o zika
1465 papers were found in this case (5. 6. 2017).
If your query consists of more than one word, use "TERM1 TERM2" to encapsulate them.
Then, we download the fulltext.xml
files for each publication. For this, add -x
flag to the query. The results can then be viewed again with the tree command.
getpapers -q zika -o zika -x
tree zika
An important step after downloading the publications is to have a look at some papers, if they fit the requirements. If not, adapt your query and try it again and again, until the outcome fits to your needs.
Before we can start with the extraction and analysis, we have to normalize the raw data and convert it to Scholarly HTML, so it is easier to process further on. For this, we convert with norma the fulltext.xml
files to scholarly.html
files, and view the results.
norma --project zika -i fulltext.xml -o scholarly.html --transform nlm2html
tree zika
Normalizing is a very important step, which rarely gets attention. The input data is normally heterogenous - in terms of structure as in content - and needs to be transformed into one central standard. This helps then other tools to connect to the data and function as a central interface to the data coming from diverse sources.
The prepared data now can be used to extract the facts via ami's different plugins. This will be — besides the metadata of the publications — the main datasource for the further analysis later on.
This is, where the actual text data mining takes place. Everything before was just preparation, which takes normally most of the time. Here all kinds of entities can be extracted, from species to drugs to genus. Via regex's and dictionaries also your own terms can be looked for.
First, we use the species plugin to get the genus and binomial nomenclature.
ami2-species --project zika -i scholarly.html --sp.species --sp.type genus
tree zika
The analysis of the extracted data is done with Python in a Jupyter Notebook. There are several methods applied. Some of them are descriptive and show the wanted outcome, but some are explorativ, and conclusions must be done by a domain expert by exploring the data and its presentation by her/himselves. The following analysis is done:
Get the Jupyter Notebook: tutorial-systematic-literature-review.ipynb.
Use the Jupyter Notebook:
If you are not already in it, go to the tutorials/systematic-literature-review/
folder and start jupyter via:
cd tutorials/systematic-literature-review/
jupyter notebook
This should let your browser open a new tab with the actual directory in it. Click on the tutorial-systematic-literature-review.ipynb
file to open the jupyter notebook. Then you can execute cell by cell and adapt the notebook to your needs. There is a more detailed description of the functionality and analysis done in the Jupyter notebook.
A good way to get a basic understandic of the topics under a certain research field and the relations between publications and topics can be found at our befriended project OpenKnowledgeMaps.
Let's have a look on the knowledge map for the term zika.
- Do another tutorial from the FutureTDM project
- Learn more about the tools used with our software tutorials
- Contribute to this repository
- Send us your results at Discourse
- Share the tutorial with others in your department or social network.
- Tell us your questions at Discourse, via Email (contact@contentmine.org) or on Twitter (@TheContentMine)
Systematic Review
- Cochrane Handbook for Systematic Reviews of Interventions
- How to do a systematic literature review and meta-analysis
- Five steps to conducting a systematic review
- Writing a systematic literature review: Resources for students and trainees
All materials worked out in this repository where conducted within the EU Horizon2020 project Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.
All content and data is licensed under the Creative Commons Attribution 4.0 International License. All code is under the MIT license.