docanalysis README suggested edits to early version of "Running docanalysis" section

I've re-written some of it the README with the intent of writing to a non-academic / non-programmer audience. My comments are added throughout -- usually highlighted by emoji

<!doctype html>

running `docanalysis`

🔔🔔

It would be great to have a startup command that would work something like this...

docanalyis —-start

If you would like to first create a new virtual environment (venv) [do this…]

If you would like to activate an existing venv [type this] ….

🔔🔔

help menu

Once docanalysis is installed, typing `docanalysis --help` (followed by enter/return) into your terminal will display the help menu (see below).

the usage message:

As is customary, near the top of the help menu is the menu section title “usage:”.

On the left, the word “docanalysis” is displayed. This is the command that actually launches the program.

On the right, is a list of all the** argument options, or “flags” (displayed here in square brackets “#section” to indicate the syntax by which they may be used). Flags operate as sub-commands by which you will operate the program and customize it’s use to suit your particular purposes.

(Note that square brackets “[]” are used here in the usage message solely to facilitate ease of reading. To actually use the argument options you will use either a single or double dash as shown in the “optional arguments:” section of the help menu).

the argument options/flags (and explanations)

In this section of the help menu, a list of arguments options (also known as “flags”) is displayed along with descriptions as to their purpose and/or use. Flags can be specified with either a single dash (-) or a double dash (–), and sometimes both. When building docanalysis commands, use one or the other, but not both.

Rather than being listed alphabetically, in our help menu we’ve chosen to display them in the relative syntax order with which they would most likely be used, and grouped together in any sub-options that are similar in function. For example, besides defining the directory on your computer where you would like an export to be saved, you must also define the filetype(s) you wish to export (html, json, or .csv), and it makes sense to write those together in your command.

!!⛔️⛔️ Help Menu Suggestions:
🔔Top of help should begin with “Welcome to docanalysis version x.x.x.” To check for and install updates, type docanalysis --update"
🔔Use a lines of dashes to visually separate different parts/categories of information in the help dialog
🔔standardize single and double dash use. Why do some (eg --html HTML) not have the single dash version? Is this a PC/MacOS thing?
🔔Remember to activate (launch) the required venv every time you run docanalysis and deactivate (quit) it thereafter.
⛔️⛔️!!
Welcome to docanalysis version 0.1.1
🔔New versions: https://pypi.org/project/docanalysis/
🔔To upgrade on Windows:  pip install --force-reinstall --no-cache-dir docanalysis
🔔To upgrade on Mac:      pip3 install --force-reinstall --no-cache-dir docanalysis
🔔For detailed setup, usage and background information, see the docanalysis READMEhttps://github.com/petermr/docanalysis/blob/main/README.md

🔔docanalysis           initializes the program and preceeds the launch of all
other sub-programs and customizes their operation via the
argument options (also called “flags”) displayed in square brackets below.❓
usage: "docanalysis [options]"

docanalysis [options]  [-h] 🔔[-V] [--run_pygetpapers] [--make_section] [-q QUERY]
[-k HITS] [--project_name PROJECT_NAME] [-d DICTIONARY]
[-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
⛔explain the use of sub-brackets such as these below this paragraph⛔
[--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]]
[--entities [ENTITIES [ENTITIES ...]]]
[--spacy_model SPACY_MODEL] [--html HTML]
[--synonyms SYNONYMS] [--make_json MAKE_JSON] [-l LOGLEVEL]
[-f LOGFILE]
options:
-h, --help              ⛔️display this help menu and usage information❓Dialog??❓ ❓and exit❓⛔️
-V, --version         display currently installed the version number docanalysis
⛔️and it's sub-programs??⛔️
========= GETPAPERS ARGUMENT OPTIONS =========
⛔️⛔️ Is docanalysis the “program” and the other tools, such as “pygetpapers” sub-programs? If so, distinguishing this will make the part about building command-line queries easier to explain and understand.⛔️⛔️
--run_pygetpapers       launches pygetpapers, the sub-program within docanalysis
that downloads papers from europepmc.org, subject to the
user’s QUERY parameters
-q ,<query>, --query <query>
replace <query> with the boolean search parameters that
pygetpapers will use to download the desired articles from
europepmc.org. NOTE:⛔️ specified queries must begin and
end with quotation marks ("").⛔️
Example: docanalysis --run_pygetpapers -q "terpene"
========== GETPAPERS EUPMC DOWNLOAD OPTIONS =========
-k <hits>, --hits <hits>    replace <hits> with the numerical value specifying the
maximum number of papers you wish find and download
Example: docanalysis --run_pygetpapers -q “terpene” -k 10
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
What happened to the other options that were available in the original version of getpapers?
-n, --noexecute       reports how many results match the query,                      without actually downloading anything.
There are over 39 million articles, preprints and more in EUPMC; we don't want to download all by mistake, so it's worth running a query with -n to test, and perhaps -k 200 to download the first trial set. You can download thousands, but the connection may break and it's worth being able to develop the analysis anyway.
-a, --all                 search all papers, not just open access
--api <name>            API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml               download fulltext XMLs if available
-p, --pdf               download fulltext PDFs if available
-s, --supp              download supplementary files if available
-t, --minedterms        download text-mined terms if available
--filter <filter object>  filter by key value pair, passed straight to the crossref api only
-r, --restart             restart file downloads after failure
we need --INPUTTEXTLOC and --OUTPUTTEXTDIR
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
========= ANNOTATION OPTIONS =========
-d <dictionary>, --dictionary <dictionary>
Replace "DICTIONARY" with the name⛔️path??⛔️ of an ⛔️ami
dictionary by which to annotate sentences or
support
supervised entity extraction.
🔔How do I point at dictionaries? Can I point at a directory full of them and have them all discovered automatically?🔔
--spacy_model SPACY_MODEL
optional. Choose between spacy or scispacy models.
Defaults to spacy
--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
provide section(s) to annotate. Choose from: ✍️ALL, ACK,
AFF, AUT, CON, DIS, ETH, FIG, INT, KEY, MET, RES, TAB,
TIL. Defaults to ALL✍️
--entities [ENTITIES [ENTITIES ...]]
provide entities to extract. Default(ALL), or choose from
SpaCy: ✍️CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW,
LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON,
PRODUCT, QUANTITY, TIME, WORK_OF_ART; SciSpaCy:
CHEMICAL, DISEASE✍️
⛔️⛔️What about SciSpacy? I think we should include
SciSpacy in the installation and provide instructions for
using SpaCy, SciSpacy or both simultaneously. This would
also show us whether SciSpacy installation is in compatible
or breaks the docanalysis installation⛔️⛔️
--synonyms SYNONYMS     searches the corpus/sections with synonymns from ami-dict
========= MAKE/EXPORT OPTIONS =========

-o OUTPUT, --output OUTPUT
outputs csv file ⚠️csv only, or is there a list of options?
⚠️ ⁉️wouldn't tsv be "safer" for chemical
names, etc.?⁉️
⛔️--html HTML           saves output in html format ⁉️to given path⁉️ (can user
choose path?)
--make_json MAKE_JSON   output in json format ⁉️To what end?⁉️
--make_section          makes sections ⁉️ALL? or can these be specified?⁉️
--make_ami_dict MAKE_AMI_DICT
provide title for ami-dict. Makes ami-dict of all
extracted entities
========= EXPORT FOLDER/PATH OPTIONS =========

--project_name <project_name> ⛔️replaced capitalization with lower case in “<>"⛔️
⁉️Suggest that we combine project_Name with output_directory <-o <path>, --⛔️outdir⛔️ <path>< (as was used in original version of get papers to avoid confusion about naming a folder and deciding where it goes⁉️
Replace "PROJECT_NAME" with your choice of name for the
folder/directory that will be created
⁉️in your venv? is file path chosen here?⁉️
to store/contain the papers you download for further
docanalysis processing.
⁉️(I think --project_folder would be more
"for Dummies" user-friendly)⁉️
========= LOG DISPLAY AND EXPORT =========
⛔️⛔️-l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)⛔️⛔️
-l LOGLEVEL, --loglevel LOGLEVEL
provide logging level. Example --log warning
⛔️choose one? let's add descriptions for each level⛔️<<info,warning,debug,error,critical>>, default='info'
-f LOGFILE, --logfile LOGFILE
saves log to specified file in output directory as
well as printing to terminal
⁉️(-x -s -t -p and -n)  ⛔️⛔️⛔️ What happened to the other options that were
available in the original version of getpapers?⛔️⛔️⛔️
--api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml             download fulltext XMLs if available
-p, --pdf             download fulltext PDFs if available
-s, --supp            download supplementary files if available
-t, --minedterms          download text-mined terms if available

Example commands

Purpose/Category	Command	Sub-Command/Program
Run Program	docanalysis
Run Sub-Program		—run_pygetpapers

Example

INPUT

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --entities ORG --output org.csv

LOGS

INFO: Found 7134 sentences in the section(s).
INFO: Loading spacy
100% 7134/7134 [01:08<00:00, 104.16it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org.csv

https://github.com/petermr/docanalysis/blob/main/README.md#extract-information-from-specific-sections

Extract information from specific section(s)

You can choose to extract entities from specific sections

Example

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csv

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 106.66it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csv

https://github.com/petermr/docanalysis/blob/main/README.md#create-dictionary-of-extracted-entities

Create dictionary of extracted entities

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csvv --make_ami_dict org

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 96.56it/s] 
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csvv
INFO: Wrote all the entities extracted to ami dict

Snippet of the dictionary

<?xml version="1.0"?>
- dictionary title="/content/terpene_10/org.xml">
<entry count="2" term="Department of Biochemistry"/>
<entry count="2" term="Chinese Academy of Agricultural Sciences"/>
<entry count="2" term="Tianjin University"/>
<entry count="2" term="Desert Research Center"/>
<entry count="2" term="Chinese Academy of Sciences"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Department of Neurology"/>
<entry count="1" term="Max Planck Institute for Chemical Ecology"/>
<entry count="1" term="College of Forest Resources and Environmental Science"/>
<entry count="1" term="Michigan Technological University"/>https://github.com/petermr/docanalysis/blob/main/README.md#what-is-a-dictionary

https://github.com/petermr/docanalysis/blob/main/README.md#all-at-one-go

All at one go!

docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10 --make_section --output entities_202202019.csv --make_ami_dict entities_20220209.xml

https://github.com/petermr/docanalysis/blob/main/README.md#credits

credits:

developers

special thanks

if any

technologies

pygetpapers — searches for and downloads papers from europepmc.org (“EUPMC”) (.html, .xml, .pdf, and/or .json)

NLTK and other Python tools for many operations, and

that ingests CProjects and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses NLTK and other Python tools for many operations, and spaCy or scispaCy for extraction and annotation of entities. Outputs summary data and word-dictionaries.

extraction

docanalysis integrates and leverages the power of the following open-source technologies:

py4ami
- spaCy
  - Here's the list of NER labels SpaCy's English model provides:
    CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
- sciSpaCy
  - https://allenai.github.io/scispacy/ - recognize Named-Entities and label them
- NLTK
pygetpapers - scrape open repositories to download papers of interest
- EUPMC
pyamiimage
- EasyOCR
- Tesseract
- NLTK splits sentences

running `docanalysis` =====================

🔔🔔

It would be great to have a startup command that would work something like this...

docanalyis —-start

If you would like to first create a new virtual environment (venv) [do this…]

If you would like to activate an existing venv [type this] ….

🔔🔔

help menu

Once docanalysis is installed, typing `docanalysis --help` (followed by enter/return) into your terminal will display the help menu (see below).

the usage message:

[As is customary](https://en.wikipedia.org/wiki/Usage_message), near the top of the help menu is the menu section title “usage:”.

On the left, the word “docanalysis” is displayed. This is the command that actually launches the program.

On the right, is a list of all the** argument options, or “flags” (displayed here in square brackets “[#section](#section)” to indicate the syntax by which they may be used). Flags operate as sub-commands by which you will operate the program and customize it’s use to suit your particular purposes.

[(Note that square brackets “[]” are used here in the usage message solely to facilitate ease of reading. To actually use the argument options you will use either a single or double dash as shown in the “optional arguments:” section of the help menu)](https://en.wikipedia.org/wiki/Command_line_argument).

the argument options/flags (and explanations)

In this section of the help menu, a list of arguments options (also known as “flags”) is displayed along with descriptions as to their purpose and/or use. Flags can be specified with either a single dash (-) or a double dash (–), and sometimes both. When building docanalysis commands, use one or the other, but not both.

Rather than being listed alphabetically, in our help menu we’ve chosen to display them in the relative syntax order with which they would most likely be used, and grouped together in any sub-options that are similar in function. For example, besides defining the directory on your computer where you would like an export to be saved, you must also define the filetype(s) you wish to export (html, json, or .csv), and it makes sense to write those together in your command.

!!⛔️⛔️ Help Menu Suggestions:
🔔Top of help should begin with “Welcome to docanalysis version x.x.x.” To check for and install updates, type docanalysis --update"
🔔Use a lines of dashes to visually separate different parts/categories of information in the help dialog
🔔standardize single and double dash use. Why do some (eg --html HTML) not have the single dash version? Is this a PC/MacOS thing?
🔔Remember to activate (launch) the required venv every time you run docanalysis and deactivate (quit) it thereafter.
⛔️⛔️!!

Welcome to docanalysis version 0.1.1 

🔔New versions: https://pypi.org/project/docanalysis/

🔔To upgrade on Windows:  pip install --force-reinstall --no-cache-dir docanalysis
🔔To upgrade on Mac:      pip3 install --force-reinstall --no-cache-dir docanalysis

🔔For detailed setup, usage and background information, see the docanalysis READMEhttps://github.com/petermr/docanalysis/blob/main/README.md

---------

🔔docanalysis           initializes the program and preceeds the launch of all 
                        other sub-programs and customizes their operation via the 
                        argument options (also called “flags”) displayed in square brackets below.❓

usage: "docanalysis [options]"

-----------------------------------------------------------------------------------
docanalysis [options]  [-h] 🔔[-V] [--run_pygetpapers] [--make_section] [-q QUERY]
                       [-k HITS] [--project_name PROJECT_NAME] [-d DICTIONARY]
                       [-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
⛔explain the use of sub-brackets such as these below this paragraph⛔
                       [--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]]
                       [--entities [ENTITIES [ENTITIES ...]]]
                       [--spacy_model SPACY_MODEL] [--html HTML]
                       [--synonyms SYNONYMS] [--make_json MAKE_JSON] [-l LOGLEVEL]
                       [-f LOGFILE]
-----------------------------------------------------------------------------------



options:
------------------

-h, --help              ⛔️display this help menu and usage information❓Dialog??❓ ❓and exit❓⛔️

-V, --version         display currently installed the version number docanalysis 
                        ⛔️and it's sub-programs??⛔️




========= GETPAPERS ARGUMENT OPTIONS ========= 
⛔️⛔️ Is docanalysis the “program” and the other tools, such as “pygetpapers” sub-programs? If so, distinguishing this will make the part about building command-line queries easier to explain and understand.⛔️⛔️

--run_pygetpapers       launches pygetpapers, the sub-program within docanalysis
                        that downloads papers from europepmc.org, subject to the 
                        user’s QUERY parameters

-q ,<query>, --query <query>
                        replace <query> with the boolean search parameters that 
                        pygetpapers will use to download the desired articles from
                        europepmc.org. NOTE:⛔️ specified queries must begin and 
                        end with quotation marks ("").⛔️
                        Example: docanalysis --run_pygetpapers -q "terpene"



========== GETPAPERS EUPMC DOWNLOAD OPTIONS ========= 
-k <hits>, --hits <hits>    replace <hits> with the numerical value specifying the
                        maximum number of papers you wish find and download
                        Example: docanalysis --run_pygetpapers -q “terpene” -k 10


⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️

What happened to the other options that were available in the original version of getpapers?

-n, --noexecute       reports how many results match the query,                      without actually downloading anything. 

There are over 39 million articles, preprints and more in EUPMC; we don't want to download all by mistake, so it's worth running a query with -n to test, and perhaps -k 200 to download the first trial set. You can download thousands, but the connection may break and it's worth being able to develop the analysis anyway.

-a, --all                 search all papers, not just open access

--api <name>            API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml               download fulltext XMLs if available
-p, --pdf               download fulltext PDFs if available
-s, --supp              download supplementary files if available
-t, --minedterms        download text-mined terms if available
--filter <filter object>  filter by key value pair, passed straight to the crossref api only
-r, --restart             restart file downloads after failure

we need --INPUTTEXTLOC and --OUTPUTTEXTDIR

⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️

========= ANNOTATION OPTIONS =========   

-d <dictionary>, --dictionary <dictionary>
                        Replace "DICTIONARY" with the name⛔️path??⛔️ of an ⛔️ami 
                        dictionary by which to annotate sentences or
support 
                        supervised entity extraction.
🔔How do I point at dictionaries? Can I point at a directory full of them and have them all discovered automatically?🔔

--spacy_model SPACY_MODEL
                        optional. Choose between spacy or scispacy models.
                        Defaults to spacy

--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
                        provide section(s) to annotate. Choose from: ✍️ALL, ACK,
                        AFF, AUT, CON, DIS, ETH, FIG, INT, KEY, MET, RES, TAB,
                        TIL. Defaults to ALL✍️

--entities [ENTITIES [ENTITIES ...]]
                        provide entities to extract. Default(ALL), or choose from
                        SpaCy: ✍️CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW,
                        LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON,
                        PRODUCT, QUANTITY, TIME, WORK_OF_ART; SciSpaCy:
                        CHEMICAL, DISEASE✍️
                        ⛔️⛔️What about SciSpacy? I think we should include 
                        SciSpacy in the installation and provide instructions for 
                        using SpaCy, SciSpacy or both simultaneously. This would 
                        also show us whether SciSpacy installation is in compatible
                        or breaks the docanalysis installation⛔️⛔️

--synonyms SYNONYMS     searches the corpus/sections with synonymns from ami-dict



========= MAKE/EXPORT OPTIONS =========  
-o OUTPUT, --output OUTPUT
                        outputs csv file ⚠️csv only, or is there a list of options?
                        ⚠️ ⁉️wouldn't tsv be "safer" for chemical 
                        names, etc.?⁉️

⛔️--html HTML           saves output in html format ⁉️to given path⁉️ (can user 
                        choose path?)

--make_json MAKE_JSON   output in json format ⁉️To what end?⁉️

--make_section          makes sections ⁉️ALL? or can these be specified?⁉️

--make_ami_dict MAKE_AMI_DICT
                        provide title for ami-dict. Makes ami-dict of all
                        extracted entities


========= EXPORT FOLDER/PATH OPTIONS =========  
--project_name <project_name> ⛔️replaced capitalization with lower case in “<>"⛔️

⁉️Suggest that we combine project_Name with output_directory <-o <path>, --⛔️outdir⛔️ <path>< (as was used in original version of get papers to avoid confusion about naming a folder and deciding where it goes⁉️
                        Replace "PROJECT_NAME" with your choice of name for the 
                        folder/directory that will be created 
⁉️in your venv? is file path chosen here?⁉️
                        to store/contain the papers you download for further 
                        docanalysis processing.
                        ⁉️(I think --project_folder would be more
                        "for Dummies" user-friendly)⁉️


========= LOG DISPLAY AND EXPORT ========= 
⛔️⛔️-l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)⛔️⛔️

-l LOGLEVEL, --loglevel LOGLEVEL
                        provide logging level. Example --log warning
                        ⛔️choose one? let's add descriptions for each level⛔️<<info,warning,debug,error,critical>>, default='info'

-f LOGFILE, --logfile LOGFILE
                        saves log to specified file in output directory as
                        well as printing to terminal

⁉️(-x -s -t -p and -n)  ⛔️⛔️⛔️ What happened to the other options that were 
                        available in the original version of getpapers?⛔️⛔️⛔️

--api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml             download fulltext XMLs if available
-p, --pdf             download fulltext PDFs if available
-s, --supp            download supplementary files if available
-t, --minedterms          download text-mined terms if available

Example commands

+----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Purpose/Category | Command | Sub-Command/Program | Option | Sub-Option | Description | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Run Program | docanalysis | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Run Sub-Program | | —run_pygetpapers | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+

Downloading articles from [EUPMC](https://europepmc.org/)

In the example below, we build a docanalsysis “command” to perform a simple task. (Note: For help building more advanced search queries, see this [EuropePMC Search syntax reference](https://europepmc.org/searchsyntax).)

We begin our command with "docanalysis" (to launch our program); followed by the sub-command “--run_pygetpapers" (to invoke pygetpapers, the docanalysis sub-program that downloads papers from EUPMC); followed by the argument option “-q ” which precedes our search term(s) that begin and end in quotation marks (“terpene”). To specify now many papers we want download, we use the argument option “-k“ followed by the number of papers we desire (in this case, 10) and finally, using the argument option "--project_name“ followed by the name we have chosen for the directory/folder we have named for our project (in this case, “terpene_10”). (See example below.)

Example

We want to use docanalysis to run pygetpapers to search for papers containing the term “terpene” and then download 10 of them into a directory we want to be named “terpene_10”

COMMAND (Input)

docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10

Running this command will display this output in our terminal window:

LOGS (Displayed Output)

⛔️Somewhere (preferably following the log output itself), we should include a key to decipher the log output⛔️

INFO: making project/searching terpene for 10 hits into C:\Users\MY_COMPUTER\docanalysis\terpene_10
INFO: Total Hits are 13935
1it [00:00, 936.44it/s]
INFO: Saving XML files to C:\Users\MY_COMPUTER\docanalysis\terpene_10\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:30<00:00,  3.10s/it]

… and export the downloaded files into sub-folders (named by their PMC identification numbers) into the directory we’ve specified to be named called “TERPINE_10” on our machine:

CPROJ (Downloaded output)

C:\USERS\MY_COMPUTER\DOCANALYSIS\TERPENE_10
│   eupmc_results.json
│
├───PMC8625850
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8727598
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8747377
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8771452
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8775117
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8801761
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8831285
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8839294
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8840323
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8879232
        eupmc_result.json
        fulltext.xml

https://github.com/petermr/docanalysis/blob/main/README.md#section-the-papers

Section the papers

⛔️⛔️⛔️Why and when do we want to do this??⛔️⛔️⛔️

COMMAND

docanalysis --project_name terpene_10 --make_section

LOGS

WARNING: Making sections in /content/terpene_10/PMC9095633/fulltext.xml
INFO: dict_keys: dict_keys(['abstract', 'acknowledge', 'affiliation', 'author', 'conclusion', 'discussion', 'ethics', 'fig_caption', 'front', 'introduction', 'jrnl_title', 'keyword', 'method', 'octree', 'pdfimage', 'pub_date', 'publisher', 'reference', 'results_discuss', 'search_results', 'sections', 'svg', 'table', 'title'])
WARNING: loading templates.json
INFO: wrote XML sections for /content/terpene_10/PMC9095633/fulltext.xml /content/terpene_10/PMC9095633/sections
WARNING: Making sections in /content/terpene_10/PMC9120863/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9120863/fulltext.xml /content/terpene_10/PMC9120863/sections
WARNING: Making sections in /content/terpene_10/PMC8982386/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC8982386/fulltext.xml /content/terpene_10/PMC8982386/sections
WARNING: Making sections in /content/terpene_10/PMC9069239/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9069239/fulltext.xml /content/terpene_10/PMC9069239/sections
WARNING: Making sections in /content/terpene_10/PMC9165828/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9165828/fulltext.xml /content/terpene_10/PMC9165828/sections
WARNING: Making sections in /content/terpene_10/PMC9119530/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9119530/fulltext.xml /content/terpene_10/PMC9119530/sections
WARNING: Making sections in /content/terpene_10/PMC8982077/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC8982077/fulltext.xml /content/terpene_10/PMC8982077/sections
WARNING: Making sections in /content/terpene_10/PMC9067962/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9067962/fulltext.xml /content/terpene_10/PMC9067962/sections
WARNING: Making sections in /content/terpene_10/PMC9154778/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9154778/fulltext.xml /content/terpene_10/PMC9154778/sections
WARNING: Making sections in /content/terpene_10/PMC9164016/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9164016/fulltext.xml /content/terpene_10/PMC9164016/sections

⛔️⛔️⛔️Can we <SNIP> this with an explanation? We're going to have to explain this to the user, preferably at the bottom of this log⛔️⛔️⛔️

 47% 1056/2258 [00:01<00:01, 1003.31it/s]ERROR: cannot parse /content/terpene_10/PMC9165828/sections/1_front/1_article-meta/26_custom-meta-group/0_custom-meta/1_meta-value/0_xref.xml
 67% 1516/2258 [00:01<00:00, 1047.68it/s]ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/7_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/14_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/3_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/6_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/9_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/10_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/4_xref.xml
...
⛔️⛔️⛔️We're going to have to explain log warnings, errors, etc., to the user —  preferably at the bottom of this log⛔️⛔️⛔️
100% 2258/2258 [00:02<00:00, 949.43it/s]

CTREE of sectioned papers (Visualisation of folders, sub-folders, and files created/saved in the specified PROJECT_NAME folder.)

⛔️Is this actually shown in the log display, or is it a representation? Can we use a screenshot instead? Shouldn’t we start from the PROJECT_NAME folder and after the first or secnnd PMC folder?⛔️

├───PMC8625850
│   └───sections
│       ├───0_processing-meta
│       ├───1_front
│       │   ├───0_journal-meta
│       │   └───1_article-meta
│       ├───2_body
│       │   ├───0_1._introduction
│       │   ├───1_2._materials_and_methods
│       │   │   ├───1_2.1._materials
│       │   │   ├───2_2.2._bacterial_strains
│       │   │   ├───3_2.3._preparation_and_character
│       │   │   ├───4_2.4._evaluation_of_the_effect_
│       │   │   ├───5_2.5._time-kill_studies
│       │   │   ├───6_2.6._propidium_iodide_uptake-e
│       │   │   └───7_2.7._hemolysis_test_from_human
│       │   ├───2_3._results
│       │   │   ├───1_3.1._encapsulation_of_terpene_
│       │   │   ├───2_3.2._both_terpene_alcohol-load
│       │   │   ├───3_3.3._farnesol_and_geraniol-loa
│       │   │   └───4_3.4._farnesol_and_geraniol-loa
│       │   ├───3_4._discussion
│       │   ├───4_5._conclusions
│       │   └───5_6._patents
│       ├───3_back
│       │   ├───0_ack⛔️rename for clarity?⛔️
│       │   ├───1_fn-group⛔️rename for clarity?⛔️
│       │   │   └───0_fn⛔️rename for clarity?⛔️
│       │   ├───2_app-group
│       │   │   └───0_app
│       │   │       └───2_supplementary-material
│       │   │           └───0_media
│       │   └───9_ref-list
│       └───4_floats-group
│           ├───4_table-wrap⛔️rename for clarity?⛔️
│           ├───5_table-wrap⛔️rename for clarity?⛔️
│           ├───6_table-wrap⛔️rename for clarity?⛔️
│           │   └───4_table-wrap-foot⛔️rename for clarity?⛔️
│           │       └───0_fn⛔️rename for clarity?⛔️
│           ├───7_table-wrap⛔️rename for clarity?⛔️
│           └───8_table-wrap⛔️rename for clarity?⛔️
...

⛔️https://github.com/petermr/docanalysis/blob/main/README.md#search-sections-using-dictionary⛔️ How do I link to anchors in

Search sections using a dictionary

In ami's terminology, a “dictionary” is a set of terms/phrases in XML format.

Dictionaries related to ethics and acknowledgments are available in [Ethics Dictionary](https://github.com/petermr/docanalysis/tree/main/ethics_dictionary) folder

If you'd like to create a custom dictionary, you can find the steps, [here]

Example

COMMAND

docanalysis --project_name terpene_10 --output entities.csv --make_ami_dict entities.xml

LOGS

INFO: Found 7134 sentences in the section(s).
INFO: getting terms from /content/activity.xml
100% 7134/7134 [00:02<00:00, 3172.14it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: 
⛔️FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.⛔️
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/activity.csv

https://github.com/petermr/docanalysis/blob/main/README.md#extract-entities

Extract Named Entities

The argument option --spacy_model spacy —entities invokes spacy (a

a free open-source library for Natural Language Processing tool included in docanalysis)

to [extract Named Entitles](https://spacy.io/) from the corpus of text downloaded when we use pygetpapers via docanalysis (docanalysis —run_pygetpapers).

Below is the list of Named Entities supported by spacy:

⛔️This information is duplicated in the end credits.

⛔️Suggest we use the full, spelled-out terms for entities, rather than use contractions.

⛔️Are all of these entities found, or can individual ones be selected via flag options?

Example

INPUT

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --entities ORG --output org.csv

LOGS

INFO: Found 7134 sentences in the section(s).
INFO: Loading spacy
100% 7134/7134 [01:08<00:00, 104.16it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org.csv

https://github.com/petermr/docanalysis/blob/main/README.md#extract-information-from-specific-sections

Extract information from specific section(s)

You can choose to extract entities from specific sections

Example

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csv

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 106.66it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csv

https://github.com/petermr/docanalysis/blob/main/README.md#create-dictionary-of-extracted-entities

Create dictionary of extracted entities

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csvv --make_ami_dict org

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 96.56it/s] 
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csvv
INFO: Wrote all the entities extracted to ami dict

Snippet of the dictionary

<?xml version="1.0"?>
- dictionary title="/content/terpene_10/org.xml">
<entry count="2" term="Department of Biochemistry"/>
<entry count="2" term="Chinese Academy of Agricultural Sciences"/>
<entry count="2" term="Tianjin University"/>
<entry count="2" term="Desert Research Center"/>
<entry count="2" term="Chinese Academy of Sciences"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Department of Neurology"/>
<entry count="1" term="Max Planck Institute for Chemical Ecology"/>
<entry count="1" term="College of Forest Resources and Environmental Science"/>
<entry count="1" term="Michigan Technological University"/>https://github.com/petermr/docanalysis/blob/main/README.md#what-is-a-dictionary

https://github.com/petermr/docanalysis/blob/main/README.md#all-at-one-go

All at one go!

docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10 --make_section --output entities_202202019.csv --make_ami_dict entities_20220209.xml

https://github.com/petermr/docanalysis/blob/main/README.md#credits

credits:

developers

special thanks

if any

technologies

[pygetpapers](https://github.com/petermr/pygetpapers) — searches for and downloads papers from [europepmc.org (“EUPMC”)](www.europepmc.org) (.html, .xml, .pdf, and/or .json)

[NLTK](https://www.nltk.org/) and other Python tools for many operations, and

that ingests [CProjects](https://github.com/petermr/tigr2ess/blob/master/getpapers/TUTORIAL.md#cproject-and-ctrees) and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses [NLTK](https://www.nltk.org/) and other Python tools for many operations, and [spaCy](https://spacy.io/) or [scispaCy](https://allenai.github.io/scispacy/) for extraction and annotation of entities. Outputs summary data and word-dictionaries.

extraction

docanalysis integrates and leverages the power of the following open-source technologies:

py4ami
- spaCy
  - Here's the list of NER labels [SpaCy's English model](https://spacy.io/models/en) provides:
    CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
- sciSpaCy
  - https://allenai.github.io/scispacy/ - recognize Named-Entities and label them
- NLTK
[pygetpapers](https://github.com/petermr/pygetpapers) - scrape open repositories to download papers of interest
- EUPMC
pyamiimage
- EasyOCR
- Tesseract
- NLTK splits sentences

docanalysis README suggested edits to early version of "Running docanalysis" section

running docanalysis

help menu

the usage message:

the argument options/flags (and explanations)

options:

Example commands

Example

Extract information from specific section(s)

Example

Create dictionary of extracted entities

All at one go!

credits:

developers

special thanks

technologies

help menu

the usage message:

the argument options/flags (and explanations)

Example commands

Downloading articles from [EUPMC](https://europepmc.org/)

Example

Section the papers

⛔️https://github.com/petermr/docanalysis/blob/main/README.md#search-sections-using-dictionary⛔️ How do I link to anchors in

Search sections using a dictionary

Example

Extract Named Entities

Example

Extract information from specific section(s)

Example

Create dictionary of extracted entities

All at one go!

credits:

developers

special thanks

technologies

Clone this wiki locally

running `docanalysis`