-
Notifications
You must be signed in to change notification settings - Fork 3
docanalysis README suggested edits to early version of "Running docanalysis" section
I've re-written some of it the README with the intent of writing to a non-academic / non-programmer audience. My comments are added throughout -- usually highlighted by emoji
<!doctype html>
🔔🔔
It would be great to have a startup command that would work something like this...
docanalyis —-start
If you would like to first create a new virtual environment (venv) [do this…]
If you would like to activate an existing venv [type this] ….
🔔🔔
Once docanalysis
is installed, typing `docanalysis --help`
(followed by enter/return) into your terminal will display the help menu (see below).
As is customary, near the top of the help menu is the menu section title “usage:”.
On the left, the word “docanalysis
” is displayed. This is the command that actually launches the program.
On the right, is a list of all the** argument options, or “flags” (displayed here in square brackets “#section” to indicate the syntax by which they may be used). Flags operate as sub-commands by which you will operate the program and customize it’s use to suit your particular purposes.
In this section of the help menu, a list of arguments options (also known as “flags”) is displayed along with descriptions as to their purpose and/or use. Flags can be specified with either a single dash (-) or a double dash (–), and sometimes both. When building docanalysis
commands, use one or the other, but not both.
Rather than being listed alphabetically, in our help menu we’ve chosen to display them in the relative syntax order with which they would most likely be used, and grouped together in any sub-options that are similar in function. For example, besides defining the directory on your computer where you would like an export to be saved, you must also define the filetype(s) you wish to export (html, json, or .csv), and it makes sense to write those together in your command.
!!⛔️⛔️ Help Menu Suggestions:
🔔Top of help should begin with “Welcome to docanalysis version x.x.x.” To check for and install updates, type docanalysis --update"
🔔Use a lines of dashes to visually separate different parts/categories of information in the help dialog
🔔standardize single and double dash use. Why do some (eg --html HTML) not have the single dash version? Is this a PC/MacOS thing?
🔔Remember to activate (launch) the required venv every time you run docanalysis and deactivate (quit) it thereafter.
⛔️⛔️!!
Welcome to docanalysis version 0.1.1
🔔New versions: https://pypi.org/project/docanalysis/
🔔To upgrade on Windows: pip install --force-reinstall --no-cache-dir docanalysis
🔔To upgrade on Mac: pip3 install --force-reinstall --no-cache-dir docanalysis
🔔For detailed setup, usage and background information, see the docanalysis READMEhttps://github.com/petermr/docanalysis/blob/main/README.md
🔔docanalysis initializes the program and preceeds the launch of all
other sub-programs and customizes their operation via the
argument options (also called “flags”) displayed in square brackets below.❓
usage: "docanalysis [options]"
docanalysis [options] [-h] 🔔[-V] [--run_pygetpapers] [--make_section] [-q QUERY]
[-k HITS] [--project_name PROJECT_NAME] [-d DICTIONARY]
[-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
⛔explain the use of sub-brackets such as these below this paragraph⛔
[--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]]
[--entities [ENTITIES [ENTITIES ...]]]
[--spacy_model SPACY_MODEL] [--html HTML]
[--synonyms SYNONYMS] [--make_json MAKE_JSON] [-l LOGLEVEL]
[-f LOGFILE]
options:
-h, --help ⛔️display this help menu and usage information❓Dialog??❓ ❓and exit❓⛔️
-V, --version display currently installed the version number docanalysis
⛔️and it's sub-programs??⛔️
========= GETPAPERS ARGUMENT OPTIONS =========
⛔️⛔️ Is docanalysis the “program” and the other tools, such as “pygetpapers” sub-programs? If so, distinguishing this will make the part about building command-line queries easier to explain and understand.⛔️⛔️
--run_pygetpapers launches pygetpapers, the sub-program within docanalysis
that downloads papers from europepmc.org, subject to the
user’s QUERY parameters
-q ,<query>, --query <query>
replace <query> with the boolean search parameters that
pygetpapers will use to download the desired articles from
europepmc.org. NOTE:⛔️ specified queries must begin and
end with quotation marks ("").⛔️
Example: docanalysis --run_pygetpapers -q "terpene"
========== GETPAPERS EUPMC DOWNLOAD OPTIONS =========
-k <hits>, --hits <hits> replace <hits> with the numerical value specifying the
maximum number of papers you wish find and download
Example: docanalysis --run_pygetpapers -q “terpene” -k 10
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
What happened to the other options that were available in the original version of getpapers?
-n, --noexecute reports how many results match the query, without actually downloading anything.
There are over 39 million articles, preprints and more in EUPMC; we don't want to download all by mistake, so it's worth running a query with -n to test, and perhaps -k 200 to download the first trial set. You can download thousands, but the connection may break and it's worth being able to develop the analysis anyway.
-a, --all search all papers, not just open access
--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml download fulltext XMLs if available
-p, --pdf download fulltext PDFs if available
-s, --supp download supplementary files if available
-t, --minedterms download text-mined terms if available
--filter <filter object> filter by key value pair, passed straight to the crossref api only
-r, --restart restart file downloads after failure
we need --INPUTTEXTLOC and --OUTPUTTEXTDIR
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
========= ANNOTATION OPTIONS =========
-d <dictionary>, --dictionary <dictionary>
Replace "DICTIONARY" with the name⛔️path??⛔️ of an ⛔️ami
dictionary by which to annotate sentences or
support
supervised entity extraction.
🔔How do I point at dictionaries? Can I point at a directory full of them and have them all discovered automatically?🔔
--spacy_model SPACY_MODEL
optional. Choose between spacy or scispacy models.
Defaults to spacy
--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
provide section(s) to annotate. Choose from: ✍️ALL, ACK,
AFF, AUT, CON, DIS, ETH, FIG, INT, KEY, MET, RES, TAB,
TIL. Defaults to ALL✍️
--entities [ENTITIES [ENTITIES ...]]
provide entities to extract. Default(ALL), or choose from
SpaCy: ✍️CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW,
LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON,
PRODUCT, QUANTITY, TIME, WORK_OF_ART; SciSpaCy:
CHEMICAL, DISEASE✍️
⛔️⛔️What about SciSpacy? I think we should include
SciSpacy in the installation and provide instructions for
using SpaCy, SciSpacy or both simultaneously. This would
also show us whether SciSpacy installation is in compatible
or breaks the docanalysis installation⛔️⛔️
--synonyms SYNONYMS searches the corpus/sections with synonymns from ami-dict
========= MAKE/EXPORT OPTIONS =========
-o OUTPUT, --output OUTPUT
outputs csv file ⚠️csv only, or is there a list of options?
⚠️ ⁉️wouldn't tsv be "safer" for chemical
names, etc.?⁉️
⛔️--html HTML saves output in html format ⁉️to given path⁉️ (can user
choose path?)
--make_json MAKE_JSON output in json format ⁉️To what end?⁉️
--make_section makes sections ⁉️ALL? or can these be specified?⁉️
--make_ami_dict MAKE_AMI_DICT
provide title for ami-dict. Makes ami-dict of all
extracted entities
========= EXPORT FOLDER/PATH OPTIONS =========
--project_name <project_name> ⛔️replaced capitalization with lower case in “<>"⛔️
⁉️Suggest that we combine project_Name with output_directory <-o <path>, --⛔️outdir⛔️ <path>< (as was used in original version of get papers to avoid confusion about naming a folder and deciding where it goes⁉️
Replace "PROJECT_NAME" with your choice of name for the
folder/directory that will be created
⁉️in your venv? is file path chosen here?⁉️
to store/contain the papers you download for further
docanalysis processing.
⁉️(I think --project_folder would be more
"for Dummies" user-friendly)⁉️
========= LOG DISPLAY AND EXPORT =========
⛔️⛔️-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error, or debug)⛔️⛔️
-l LOGLEVEL, --loglevel LOGLEVEL
provide logging level. Example --log warning
⛔️choose one? let's add descriptions for each level⛔️<<info,warning,debug,error,critical>>, default='info'
-f LOGFILE, --logfile LOGFILE
saves log to specified file in output directory as
well as printing to terminal
⁉️(-x -s -t -p and -n) ⛔️⛔️⛔️ What happened to the other options that were
available in the original version of getpapers?⛔️⛔️⛔️
--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml download fulltext XMLs if available
-p, --pdf download fulltext PDFs if available
-s, --supp download supplementary files if available
-t, --minedterms download text-mined terms if available
Purpose/Category | Command | Sub-Command/Program | Option | Sub-Option | Description |
---|---|---|---|---|---|
Run Program | docanalysis | ||||
Run Sub-Program | —run_pygetpapers | ||||
INPUT
docanalysis --project_name terpene_10 --make_section --spacy_model spacy --entities ORG --output org.csv
LOGS
INFO: Found 7134 sentences in the section(s).
INFO: Loading spacy
100% 7134/7134 [01:08<00:00, 104.16it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org.csv
https://github.com/petermr/docanalysis/blob/main/README.md#extract-information-from-specific-sections
You can choose to extract entities from specific sections
COMMAND
docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csv
LOG
INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 106.66it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csv
https://github.com/petermr/docanalysis/blob/main/README.md#create-dictionary-of-extracted-entities
COMMAND
docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csvv --make_ami_dict org
LOG
INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 96.56it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csvv
INFO: Wrote all the entities extracted to ami dict
Snippet of the dictionary
<?xml version="1.0"?>
- dictionary title="/content/terpene_10/org.xml">
<entry count="2" term="Department of Biochemistry"/>
<entry count="2" term="Chinese Academy of Agricultural Sciences"/>
<entry count="2" term="Tianjin University"/>
<entry count="2" term="Desert Research Center"/>
<entry count="2" term="Chinese Academy of Sciences"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Department of Neurology"/>
<entry count="1" term="Max Planck Institute for Chemical Ecology"/>
<entry count="1" term="College of Forest Resources and Environmental Science"/>
<entry count="1" term="Michigan Technological University"/>https://github.com/petermr/docanalysis/blob/main/README.md#what-is-a-dictionary
docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10 --make_section --output entities_202202019.csv --make_ami_dict entities_20220209.xml
if any
pygetpapers — searches for and downloads papers from europepmc.org (“EUPMC”) (.html, .xml, .pdf, and/or .json)
NLTK and other Python tools for many operations, and
that ingests CProjects and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses NLTK and other Python tools for many operations, and spaCy or scispaCy for extraction and annotation of entities. Outputs summary data and word-dictionaries.
extraction
docanalysis
integrates and leverages the power of the following open-source technologies:
-
py4ami
-
spaCy
Here's the list of NER labels SpaCy's English model provides:
CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
-
sciSpaCy
https://allenai.github.io/scispacy/ - recognize Named-Entities and label them
NLTK
-
-
pygetpapers - scrape open repositories to download papers of interest
EUPMC
-
pyamiimage
EasyOCR
Tesseract
NLTK
splits sentences
🔔🔔
It would be great to have a startup command that would work something like this...
docanalyis —-start
If you would like to first create a new virtual environment (venv) [do this…]
If you would like to activate an existing venv [type this] ….
🔔🔔
Once docanalysis
is installed, typing `docanalysis --help`
(followed by enter/return) into your terminal will display the help menu (see below).
[As is customary](https://en.wikipedia.org/wiki/Usage_message), near the top of the help menu is the menu section title “usage:”.
On the left, the word “docanalysis
” is displayed. This is the command that actually launches the program.
On the right, is a list of all the** argument options, or “flags” (displayed here in square brackets “[#section](#section)” to indicate the syntax by which they may be used). Flags operate as sub-commands by which you will operate the program and customize it’s use to suit your particular purposes.
[(Note that square brackets “[]” are used here in the usage message solely to facilitate ease of reading. To actually use the argument options you will use either a single or double dash as shown in the “optional arguments:” section of the help menu)](https://en.wikipedia.org/wiki/Command_line_argument).
In this section of the help menu, a list of arguments options (also known as “flags”) is displayed along with descriptions as to their purpose and/or use. Flags can be specified with either a single dash (-) or a double dash (–), and sometimes both. When building docanalysis
commands, use one or the other, but not both.
Rather than being listed alphabetically, in our help menu we’ve chosen to display them in the relative syntax order with which they would most likely be used, and grouped together in any sub-options that are similar in function. For example, besides defining the directory on your computer where you would like an export to be saved, you must also define the filetype(s) you wish to export (html, json, or .csv), and it makes sense to write those together in your command.
!!⛔️⛔️ Help Menu Suggestions:
🔔Top of help should begin with “Welcome to docanalysis version x.x.x.” To check for and install updates, type docanalysis --update"
🔔Use a lines of dashes to visually separate different parts/categories of information in the help dialog
🔔standardize single and double dash use. Why do some (eg --html HTML) not have the single dash version? Is this a PC/MacOS thing?
🔔Remember to activate (launch) the required venv every time you run docanalysis and deactivate (quit) it thereafter.
⛔️⛔️!!
Welcome to docanalysis version 0.1.1
🔔New versions: https://pypi.org/project/docanalysis/
🔔To upgrade on Windows: pip install --force-reinstall --no-cache-dir docanalysis
🔔To upgrade on Mac: pip3 install --force-reinstall --no-cache-dir docanalysis
🔔For detailed setup, usage and background information, see the docanalysis READMEhttps://github.com/petermr/docanalysis/blob/main/README.md
---------
🔔docanalysis initializes the program and preceeds the launch of all
other sub-programs and customizes their operation via the
argument options (also called “flags”) displayed in square brackets below.❓
usage: "docanalysis [options]"
-----------------------------------------------------------------------------------
docanalysis [options] [-h] 🔔[-V] [--run_pygetpapers] [--make_section] [-q QUERY]
[-k HITS] [--project_name PROJECT_NAME] [-d DICTIONARY]
[-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
⛔explain the use of sub-brackets such as these below this paragraph⛔
[--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]]
[--entities [ENTITIES [ENTITIES ...]]]
[--spacy_model SPACY_MODEL] [--html HTML]
[--synonyms SYNONYMS] [--make_json MAKE_JSON] [-l LOGLEVEL]
[-f LOGFILE]
-----------------------------------------------------------------------------------
options:
------------------
-h, --help ⛔️display this help menu and usage information❓Dialog??❓ ❓and exit❓⛔️
-V, --version display currently installed the version number docanalysis
⛔️and it's sub-programs??⛔️
========= GETPAPERS ARGUMENT OPTIONS =========
⛔️⛔️ Is docanalysis the “program” and the other tools, such as “pygetpapers” sub-programs? If so, distinguishing this will make the part about building command-line queries easier to explain and understand.⛔️⛔️
--run_pygetpapers launches pygetpapers, the sub-program within docanalysis
that downloads papers from europepmc.org, subject to the
user’s QUERY parameters
-q ,<query>, --query <query>
replace <query> with the boolean search parameters that
pygetpapers will use to download the desired articles from
europepmc.org. NOTE:⛔️ specified queries must begin and
end with quotation marks ("").⛔️
Example: docanalysis --run_pygetpapers -q "terpene"
========== GETPAPERS EUPMC DOWNLOAD OPTIONS =========
-k <hits>, --hits <hits> replace <hits> with the numerical value specifying the
maximum number of papers you wish find and download
Example: docanalysis --run_pygetpapers -q “terpene” -k 10
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
What happened to the other options that were available in the original version of getpapers?
-n, --noexecute reports how many results match the query, without actually downloading anything.
There are over 39 million articles, preprints and more in EUPMC; we don't want to download all by mistake, so it's worth running a query with -n to test, and perhaps -k 200 to download the first trial set. You can download thousands, but the connection may break and it's worth being able to develop the analysis anyway.
-a, --all search all papers, not just open access
--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml download fulltext XMLs if available
-p, --pdf download fulltext PDFs if available
-s, --supp download supplementary files if available
-t, --minedterms download text-mined terms if available
--filter <filter object> filter by key value pair, passed straight to the crossref api only
-r, --restart restart file downloads after failure
we need --INPUTTEXTLOC and --OUTPUTTEXTDIR
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
========= ANNOTATION OPTIONS =========
-d <dictionary>, --dictionary <dictionary>
Replace "DICTIONARY" with the name⛔️path??⛔️ of an ⛔️ami
dictionary by which to annotate sentences or
support
supervised entity extraction.
🔔How do I point at dictionaries? Can I point at a directory full of them and have them all discovered automatically?🔔
--spacy_model SPACY_MODEL
optional. Choose between spacy or scispacy models.
Defaults to spacy
--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
provide section(s) to annotate. Choose from: ✍️ALL, ACK,
AFF, AUT, CON, DIS, ETH, FIG, INT, KEY, MET, RES, TAB,
TIL. Defaults to ALL✍️
--entities [ENTITIES [ENTITIES ...]]
provide entities to extract. Default(ALL), or choose from
SpaCy: ✍️CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW,
LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON,
PRODUCT, QUANTITY, TIME, WORK_OF_ART; SciSpaCy:
CHEMICAL, DISEASE✍️
⛔️⛔️What about SciSpacy? I think we should include
SciSpacy in the installation and provide instructions for
using SpaCy, SciSpacy or both simultaneously. This would
also show us whether SciSpacy installation is in compatible
or breaks the docanalysis installation⛔️⛔️
--synonyms SYNONYMS searches the corpus/sections with synonymns from ami-dict
========= MAKE/EXPORT OPTIONS =========
-o OUTPUT, --output OUTPUT
outputs csv file ⚠️csv only, or is there a list of options?
⚠️ ⁉️wouldn't tsv be "safer" for chemical
names, etc.?⁉️
⛔️--html HTML saves output in html format ⁉️to given path⁉️ (can user
choose path?)
--make_json MAKE_JSON output in json format ⁉️To what end?⁉️
--make_section makes sections ⁉️ALL? or can these be specified?⁉️
--make_ami_dict MAKE_AMI_DICT
provide title for ami-dict. Makes ami-dict of all
extracted entities
========= EXPORT FOLDER/PATH OPTIONS =========
--project_name <project_name> ⛔️replaced capitalization with lower case in “<>"⛔️
⁉️Suggest that we combine project_Name with output_directory <-o <path>, --⛔️outdir⛔️ <path>< (as was used in original version of get papers to avoid confusion about naming a folder and deciding where it goes⁉️
Replace "PROJECT_NAME" with your choice of name for the
folder/directory that will be created
⁉️in your venv? is file path chosen here?⁉️
to store/contain the papers you download for further
docanalysis processing.
⁉️(I think --project_folder would be more
"for Dummies" user-friendly)⁉️
========= LOG DISPLAY AND EXPORT =========
⛔️⛔️-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error, or debug)⛔️⛔️
-l LOGLEVEL, --loglevel LOGLEVEL
provide logging level. Example --log warning
⛔️choose one? let's add descriptions for each level⛔️<<info,warning,debug,error,critical>>, default='info'
-f LOGFILE, --logfile LOGFILE
saves log to specified file in output directory as
well as printing to terminal
⁉️(-x -s -t -p and -n) ⛔️⛔️⛔️ What happened to the other options that were
available in the original version of getpapers?⛔️⛔️⛔️
--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml download fulltext XMLs if available
-p, --pdf download fulltext PDFs if available
-s, --supp download supplementary files if available
-t, --minedterms download text-mined terms if available
+----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Purpose/Category | Command | Sub-Command/Program | Option | Sub-Option | Description | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Run Program | docanalysis | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Run Sub-Program | | —run_pygetpapers | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+
Downloading articles from [EUPMC](https://europepmc.org/)
In the example below, we build a
docanalsysis
“command” to perform a simple task. (Note: For help building more advanced search queries, see this [EuropePMC Search syntax reference](https://europepmc.org/searchsyntax).)
We begin our command with "docanalysis"
(to launch our program); followed by the sub-command “--run_pygetpapers
" (to invoke pygetpapers
, the docanalysis sub-program that downloads papers from EUPMC); followed by the argument option “-q
” which precedes our search term(s) that begin and end in quotation marks (“terpene”). To specify now many papers we want download, we use the argument option “-k
“ followed by the number of papers we desire (in this case, 10) and finally, using the argument option "--project_name
“ followed by the name we have chosen for the directory/folder we have named for our project (in this case, “terpene_10”). (See example below.)
We want to use
docanalysis
to runpygetpapers
to search for papers containing the term “terpene” and then download 10 of them into a directory we want to be named “terpene_10”
COMMAND (Input)
docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10
Running this command will display this output in our terminal window:
LOGS (Displayed Output)
⛔️Somewhere (preferably following the log output itself), we should include a key to decipher the log output⛔️
INFO: making project/searching terpene for 10 hits into C:\Users\MY_COMPUTER\docanalysis\terpene_10
INFO: Total Hits are 13935
1it [00:00, 936.44it/s]
INFO: Saving XML files to C:\Users\MY_COMPUTER\docanalysis\terpene_10\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:30<00:00, 3.10s/it]
… and export the downloaded files into sub-folders (named by their PMC identification numbers) into the directory we’ve specified to be named called “TERPINE_10” on our machine:
CPROJ (Downloaded output)
C:\USERS\MY_COMPUTER\DOCANALYSIS\TERPENE_10
│ eupmc_results.json
│
├───PMC8625850
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8727598
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8747377
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8771452
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8775117
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8801761
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8831285
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8839294
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8840323
│ eupmc_result.json
│ fulltext.xml
│
└───PMC8879232
eupmc_result.json
fulltext.xml
⛔️⛔️⛔️Why and when do we want to do this??⛔️⛔️⛔️
COMMAND
docanalysis --project_name terpene_10 --make_section
LOGS
WARNING: Making sections in /content/terpene_10/PMC9095633/fulltext.xml
INFO: dict_keys: dict_keys(['abstract', 'acknowledge', 'affiliation', 'author', 'conclusion', 'discussion', 'ethics', 'fig_caption', 'front', 'introduction', 'jrnl_title', 'keyword', 'method', 'octree', 'pdfimage', 'pub_date', 'publisher', 'reference', 'results_discuss', 'search_results', 'sections', 'svg', 'table', 'title'])
WARNING: loading templates.json
INFO: wrote XML sections for /content/terpene_10/PMC9095633/fulltext.xml /content/terpene_10/PMC9095633/sections
WARNING: Making sections in /content/terpene_10/PMC9120863/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9120863/fulltext.xml /content/terpene_10/PMC9120863/sections
WARNING: Making sections in /content/terpene_10/PMC8982386/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC8982386/fulltext.xml /content/terpene_10/PMC8982386/sections
WARNING: Making sections in /content/terpene_10/PMC9069239/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9069239/fulltext.xml /content/terpene_10/PMC9069239/sections
WARNING: Making sections in /content/terpene_10/PMC9165828/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9165828/fulltext.xml /content/terpene_10/PMC9165828/sections
WARNING: Making sections in /content/terpene_10/PMC9119530/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9119530/fulltext.xml /content/terpene_10/PMC9119530/sections
WARNING: Making sections in /content/terpene_10/PMC8982077/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC8982077/fulltext.xml /content/terpene_10/PMC8982077/sections
WARNING: Making sections in /content/terpene_10/PMC9067962/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9067962/fulltext.xml /content/terpene_10/PMC9067962/sections
WARNING: Making sections in /content/terpene_10/PMC9154778/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9154778/fulltext.xml /content/terpene_10/PMC9154778/sections
WARNING: Making sections in /content/terpene_10/PMC9164016/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9164016/fulltext.xml /content/terpene_10/PMC9164016/sections
⛔️⛔️⛔️Can we <SNIP> this with an explanation? We're going to have to explain this to the user, preferably at the bottom of this log⛔️⛔️⛔️
47% 1056/2258 [00:01<00:01, 1003.31it/s]ERROR: cannot parse /content/terpene_10/PMC9165828/sections/1_front/1_article-meta/26_custom-meta-group/0_custom-meta/1_meta-value/0_xref.xml
67% 1516/2258 [00:01<00:00, 1047.68it/s]ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/7_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/14_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/3_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/6_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/9_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/10_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/4_xref.xml
...
⛔️⛔️⛔️We're going to have to explain log warnings, errors, etc., to the user — preferably at the bottom of this log⛔️⛔️⛔️
100% 2258/2258 [00:02<00:00, 949.43it/s]
CTREE of sectioned papers (Visualisation of folders, sub-folders, and files created/saved in the specified PROJECT_NAME folder.)
⛔️Is this actually shown in the log display, or is it a representation? Can we use a screenshot instead? Shouldn’t we start from the PROJECT_NAME folder and after the first or secnnd PMC folder?⛔️
├───PMC8625850
│ └───sections
│ ├───0_processing-meta
│ ├───1_front
│ │ ├───0_journal-meta
│ │ └───1_article-meta
│ ├───2_body
│ │ ├───0_1._introduction
│ │ ├───1_2._materials_and_methods
│ │ │ ├───1_2.1._materials
│ │ │ ├───2_2.2._bacterial_strains
│ │ │ ├───3_2.3._preparation_and_character
│ │ │ ├───4_2.4._evaluation_of_the_effect_
│ │ │ ├───5_2.5._time-kill_studies
│ │ │ ├───6_2.6._propidium_iodide_uptake-e
│ │ │ └───7_2.7._hemolysis_test_from_human
│ │ ├───2_3._results
│ │ │ ├───1_3.1._encapsulation_of_terpene_
│ │ │ ├───2_3.2._both_terpene_alcohol-load
│ │ │ ├───3_3.3._farnesol_and_geraniol-loa
│ │ │ └───4_3.4._farnesol_and_geraniol-loa
│ │ ├───3_4._discussion
│ │ ├───4_5._conclusions
│ │ └───5_6._patents
│ ├───3_back
│ │ ├───0_ack⛔️rename for clarity?⛔️
│ │ ├───1_fn-group⛔️rename for clarity?⛔️
│ │ │ └───0_fn⛔️rename for clarity?⛔️
│ │ ├───2_app-group
│ │ │ └───0_app
│ │ │ └───2_supplementary-material
│ │ │ └───0_media
│ │ └───9_ref-list
│ └───4_floats-group
│ ├───4_table-wrap⛔️rename for clarity?⛔️
│ ├───5_table-wrap⛔️rename for clarity?⛔️
│ ├───6_table-wrap⛔️rename for clarity?⛔️
│ │ └───4_table-wrap-foot⛔️rename for clarity?⛔️
│ │ └───0_fn⛔️rename for clarity?⛔️
│ ├───7_table-wrap⛔️rename for clarity?⛔️
│ └───8_table-wrap⛔️rename for clarity?⛔️
...
⛔️https://github.com/petermr/docanalysis/blob/main/README.md#search-sections-using-dictionary⛔️ How do I link to anchors in
In ami
's terminology, a “dictionary” is a set of terms/phrases in XML format.
Dictionaries related to ethics and acknowledgments are available in [Ethics Dictionary](https://github.com/petermr/docanalysis/tree/main/ethics_dictionary) folder
If you'd like to create a custom dictionary, you can find the steps, [here]
COMMAND
docanalysis --project_name terpene_10 --output entities.csv --make_ami_dict entities.xml
LOGS
INFO: Found 7134 sentences in the section(s).
INFO: getting terms from /content/activity.xml
100% 7134/7134 [00:02<00:00, 3172.14it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352:
⛔️FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.⛔️
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/activity.csv
The argument option --spacy_model spacy —entities
invokes spacy
(a
a free open-source library for Natural Language Processing tool included in docanalysis
)
to [extract Named Entitles](https://spacy.io/) from the corpus of text downloaded when we use pygetpapers via docanalysis (docanalysis —run_pygetpapers
).
Below is the list of Named Entities supported by spacy
:
⛔️This information is duplicated in the end credits.
⛔️Suggest we use the full, spelled-out terms for entities, rather than use contractions.
⛔️Are all of these entities found, or can individual ones be selected via flag options?
+-------------------+------------------------------------------+------------------------------------------+ | Named Entity: | Description | Examples | +-------------------+------------------------------------------+------------------------------------------+ | CARDINAL | Numerals that do not fall under another | 2, Two, Fifty-two | | | type | | +-------------------+------------------------------------------+------------------------------------------+ | DATE | Absolute or relative dates or periods | 9th May 1987, 4 AUG | +-------------------+------------------------------------------+------------------------------------------+ | EVENT | Nammed hurricanes, battles, wars, sports | Olympic Games | | | events., etc | | +-------------------+------------------------------------------+------------------------------------------+ | FAC | FACILITY: Buildings, airports, | Logan International Airport, The Golden | | | highways, bridges, etc | Gate | +-------------------+------------------------------------------+------------------------------------------+ | GPE | GEO-POLITICAL ENTITIES: Countries, | India, Australia, South East Asia | | | cities, States | | +-------------------+------------------------------------------+------------------------------------------+ | LANGUAGE | Any named language | English, Portuguese, French | +-------------------+------------------------------------------+------------------------------------------+ | LAW | Named documents made into laws | Roe v. Wade | +-------------------+------------------------------------------+------------------------------------------+ | LOC | LOCATION: Non-GPE locations, | Mount Everest, River Ganga | | | mountain ranges, bodies of water | | +-------------------+------------------------------------------+------------------------------------------+ | MONEY | Monetary values, including unit | million dollars, INR 4 Cror | +-------------------+------------------------------------------+------------------------------------------+ | NORP | Nationalities or religious or political | The Republican Party | | | groups | | +-------------------+------------------------------------------+------------------------------------------+ | ORDINAL | first, second, etc | 9th, Ninth | +-------------------+------------------------------------------+------------------------------------------+ | ORG | Companies, agencies, institutions, etc | Microsoft, Facebook, FBI, MIT | +-------------------+------------------------------------------+------------------------------------------+ | PERCENT | Percentage, including “%" | Eighty percent | +-------------------+------------------------------------------+------------------------------------------+ | PERSON | People, including fictional | Bill Clinton, Fred Flintstone | +-------------------+------------------------------------------+------------------------------------------+ | PRODUCT | Objects, vehicles, foods, etc. (Not | Formula 1 | | | services.) | | +-------------------+------------------------------------------+------------------------------------------+ | QUANTITY | Measurements, as of weight or distance | Several kilometers, 55kg | +-------------------+------------------------------------------+------------------------------------------+ | TIME | Times smaller than a day | 7:23 A.M., three-forty am, Four hours | +-------------------+------------------------------------------+------------------------------------------+ | WORK_OF_ART | Titles of books, songs, etc. | The Mona Lisa | +-------------------+------------------------------------------+------------------------------------------+
INPUT
docanalysis --project_name terpene_10 --make_section --spacy_model spacy --entities ORG --output org.csv
LOGS
INFO: Found 7134 sentences in the section(s).
INFO: Loading spacy
100% 7134/7134 [01:08<00:00, 104.16it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org.csv
https://github.com/petermr/docanalysis/blob/main/README.md#extract-information-from-specific-sections
You can choose to extract entities from specific sections
COMMAND
docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csv
LOG
INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 106.66it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csv
https://github.com/petermr/docanalysis/blob/main/README.md#create-dictionary-of-extracted-entities
COMMAND
docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csvv --make_ami_dict org
LOG
INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 96.56it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
"[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csvv
INFO: Wrote all the entities extracted to ami dict
Snippet of the dictionary
<?xml version="1.0"?>
- dictionary title="/content/terpene_10/org.xml">
<entry count="2" term="Department of Biochemistry"/>
<entry count="2" term="Chinese Academy of Agricultural Sciences"/>
<entry count="2" term="Tianjin University"/>
<entry count="2" term="Desert Research Center"/>
<entry count="2" term="Chinese Academy of Sciences"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Department of Neurology"/>
<entry count="1" term="Max Planck Institute for Chemical Ecology"/>
<entry count="1" term="College of Forest Resources and Environmental Science"/>
<entry count="1" term="Michigan Technological University"/>https://github.com/petermr/docanalysis/blob/main/README.md#what-is-a-dictionary
docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10 --make_section --output entities_202202019.csv --make_ami_dict entities_20220209.xml
- if any
[pygetpapers](https://github.com/petermr/pygetpapers) — searches for and downloads papers from [europepmc.org (“EUPMC”)](www.europepmc.org) (.html, .xml, .pdf, and/or .json)
[NLTK](https://www.nltk.org/) and other Python tools for many operations, and
that ingests [CProjects](https://github.com/petermr/tigr2ess/blob/master/getpapers/TUTORIAL.md#cproject-and-ctrees) and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses [NLTK](https://www.nltk.org/) and other Python tools for many operations, and [spaCy](https://spacy.io/) or [scispaCy](https://allenai.github.io/scispacy/) for extraction and annotation of entities. Outputs summary data and word-dictionaries.
extraction
docanalysis
integrates and leverages the power of the following open-source technologies:
-
py4ami
-
spaCy
- Here's the list of NER labels [SpaCy's English model](https://spacy.io/models/en) provides:
CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
- Here's the list of NER labels [SpaCy's English model](https://spacy.io/models/en) provides:
-
sciSpaCy
- https://allenai.github.io/scispacy/ - recognize Named-Entities and label them
-
NLTK
-
-
[pygetpapers](https://github.com/petermr/pygetpapers) - scrape open repositories to download papers of interest
- EUPMC
-
pyamiimage
-
EasyOCR
-
Tesseract
-
NLTK
splits sentences
-