Skip to content

docanalysis Tutorial

Shweata N. Hegde edited this page Jul 9, 2022 · 16 revisions

docanalysis documentation

Tester

  • Shweata N. Hegde
  • Windows 10

Setting up (skip the step if you are running this in a Jupyter Notebook or Google Colab)

Useful; not essential.

  • Make a directory
mkdir docanalysis_tutorial
cd docanalysis_tutorial
  • Create a virtual environment
python -m venv venv
  • Active virtual environment (different command on Mac)
venv\Scripts\activate.bat

Installing docanalysis

  • Run pip install docanalysis
  • Once installed, you can run docanalysis --help. The help message should show up.
(venv) C:\Users\shweata\docanalysis_tutorial>docanalysis --help
c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\_distutils_hack\__init__.py:36: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
usage: docanalysis [-h] [--run_pygetpapers] [--make_section] [-q QUERY] [-k HITS] [--project_name PROJECT_NAME]
                   [-d [DICTIONARY [DICTIONARY ...]]] [-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
                   [--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]] [--entities [ENTITIES [ENTITIES ...]]]
                   [--spacy_model SPACY_MODEL] [--html HTML] [--synonyms SYNONYMS] [--make_json MAKE_JSON]
                   [--search_html] [--extract_abb EXTRACT_ABB] [-l LOGLEVEL] [-f LOGFILE]

Welcome to docanalysis version 0.1.9. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  --run_pygetpapers     [Command] downloads papers from EuropePMC via pygetpapers
  --make_section        [Command] makes sections; requires a fulltext.xml in CTree directories
  -q QUERY, --query QUERY
                        [pygetpapers] query string
  -k HITS, --hits HITS  [pygetpapers] number of papers to download
  --project_name PROJECT_NAME
                        CProject directory name
  -d [DICTIONARY [DICTIONARY ...]], --dictionary [DICTIONARY [DICTIONARY ...]]
                        [file name/url] existing ami dictionary to annotate sentences or support supervised entity
                        extraction
  -o OUTPUT, --output OUTPUT
                        outputs csv with sentences/terms
  --make_ami_dict MAKE_AMI_DICT
                        [Command] title for ami-dict. Makes ami-dict of all extracted entities; works only with spacy
  --search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
                        [NER/dictionary search] section(s) to annotate. Choose from: ALL, ACK, AFF, AUT, CON, DIS,
                        ETH, FIG, INT, KEY, MET, RES, TAB, TIL. Defaults to ALL
  --entities [ENTITIES [ENTITIES ...]]
                        [NER] entities to extract. Default (ALL). Common entities SpaCy: GPE, LANGUAGE, ORG, PERSON
                        (for additional ones check: ); SciSpaCy: CHEMICAL, DISEASE
  --spacy_model SPACY_MODEL
                        [NER] optional. Choose between spacy or scispacy models. Defaults to spacy
  --html HTML           outputs html with sentences/terms
  --synonyms SYNONYMS   annotate the corpus/sections with synonyms from ami-dict
  --make_json MAKE_JSON
                        outputs json with sentences/terms
  --search_html         searches html documents (mainly IPCC)
  --extract_abb EXTRACT_ABB
                        [Command] title for abb-ami-dict. Extracts abbreviations and expansions; makes ami-dict of all
                        extracted entities
  -l LOGLEVEL, --loglevel LOGLEVEL
                        provide logging level. Example --log warning <<info,warning,debug,error,critical>>,
                        default='info'
  -f LOGFILE, --logfile LOGFILE
                        saves log to specified file in output directory as well as printing to terminal

As you can see, docanalysis does a lot of things. Let's test them one by one.

Running docanalysis

1. Download papers from EPMC (skip if you are analysing reports like IPCC)

You can call pygetpapers (a tool to automatically download papers) from docanalysis using:

docanalysis --run_pygetpapers -q "terpenes" -k 20 --project_name terpene_20

--run_pygetpapers tells docanalysis to use pygetpapers to download -k 20 papers on -q "terpenes" into --project_name terpene_20.

c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\_distutils_hack\__init__.py:36: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
INFO: Total Hits are 35508
20it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\docanalysis_tutorial\terpene_20\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:58<00:00,  2.91s/it]
INFO: making CProject C:\Users\shweata\docanalysis_tutorial\terpene_20 with 20 papers on terpenes
ERROR: section papers using --make_sections before search

The last error message indicates that docanalysis is not meant to run just pygetpapers. Maybe I should make it say something useful.

2. Section the downloaded papers

For docanalysis to ingest papers, they need to be sectioned. We do that by running:

docanalysis --project_name terpene_20 --make_section

Notice that you only have to reference the folder name using --project_name and don't have to use run_pygetpapers. Once run, you will have sectioned papers

...
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9228083\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9228083\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9228083\sections
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230113\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230113\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230113\sections
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230896\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230896\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230896\sections
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9236214\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9236214\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9236214\sections

Here's how the tree looks for a paper:

+---PMC9222602
|   \---sections
|       +---0_processing-meta
|       +---1_front
|       |   +---0_journal-meta
|       |   \---1_article-meta
|       |       \---19_funding-group
|       |           \---0_award-group
|       +---2_body
|       |   +---0_1._introduction
|       |   +---1_2._materials_and_methods
|       |   |   +---1_2.1._grape_variety_and_winemak
|       |   |   +---2_2.2._characterization_of_the_w
|       |   |   +---3_2.3._determination_of_the_arom
|       |   |   \---4_2.4._statistical_analyses
|       |   +---2_3._results_and_discussion
|       |   |   +---1_3.1._characterization_of_must_
|       |   |   +---2_3.2._effects_from_the_applicat
|       |   |   +---3_3.3._effects_from_the_applicat
|       |   |   \---4_3.4._specific_effects_on_the_l
|       |   \---3_4._conclusions
|       +---3_back
|       |   +---0_fn-group
|       |   |   \---0_fn
|       |   \---6_ref-list
|       \---4_floats-group
|           +---4_table-wrap
|           |   \---4_table-wrap-foot
|           |       \---0_fn
|           \---5_table-wrap
|               \---4_table-wrap-foot
|                   \---0_fn

Now, we are ready to analyse papers in interesting ways!

3. Search papers...

3.1. For specific terms using dictionaries

Dictionaries are a set of terms -- plant names, country names, organization names, drugs -- with links to Wikidata. docanalysis has default dictionaries that you could use for searching. They are:

  • EO_ACTIVITY
  • EO_COMPOUND
  • EO_ANALYSIS
  • EO_EXTRACTION
  • EO_PLANT
  • PLANT_GENUS
  • EO_PLANT_PART
  • EO_TARGET
  • COUNTRY
  • DISEASE
  • ORGANIZATION
  • DRUG
  • TEST_TRACE

You can either use the default dictionaries or custom ones. If you have a custom dictionary, you can point docanalysis to it by giving its absolute path.

Default dictionary
docanalysis --project_name terpene_20 --dictionary EO_PLANT --output plant.csv --make_json plant.json

You can output the results either in .csv or .json format.

This task might take anywhere from a few seconds to more than 15 min. depending on the number of papers in the folder.

c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\_distutils_hack\__init__.py:36: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 65.98it/s]
0it [00:00, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 56/56 [00:00<00:00, 107.05it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 87.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:00<00:00, 124.78it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 202/202 [00:02<00:00, 88.29it/s]
0it [00:00, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 124/124 [00:01<00:00, 103.37it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 113/113 [00:01<00:00, 97.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 115.82it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 147.96it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 263/263 [00:02<00:00, 125.15it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 34/34 [00:00<00:00, 70.70it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 100.15it/s]
0it [00:00, ?it/s]
INFO: Found 4260 sentences in the section(s).
INFO: getting terms from EO_PLANT
100%|█████████████████████████████████████████████████████████████████████████████| 4260/4260 [00:34<00:00, 122.46it/s]
c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\docanalysis\entity_extraction.py:452: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df[col] = df[col].astype(str).str.replace(
INFO: wrote output to C:\Users\shweata\docanalysis_tutorial\terpene_20\plant.csv
INFO: wrote JSON output to C:\Users\shweata\docanalysis_tutorial\terpene_20\plant.json

Let's look at the results in more detail. Here's one entry from the output (in .json).

{
    "5": {
        "file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC8886108\\sections\\1_front\\1_article-meta\\18_abstract.xml",
        "sentence": "Our group settled six formulations combining CBD and terpenes purified from  Cannabis sativa  L,  Origanum vulgare , and  Thymus mastichina .",
        "section": "ABS",
        "0": [
            [
                "Cannabis sativa",
                "Origanum vulgare",
                "Thymus mastichina"
            ]
        ],
        "0_span": [
            [
                [
                    77,
                    92
                ],
                [
                    98,
                    114
                ],
                [
                    122,
                    139
                ]
            ]
        ],
        "weight_0": 
...

docanalysis has pulled sentences that mention terms in the dictionary -- plant species name, for example. It also tells us that it comes from the ABS (abstract) section and the span (starting and ending character positions). You can check out the full results, here.