-
Notifications
You must be signed in to change notification settings - Fork 3
docanalysis Tutorial
-
docanalysis
documentation
- Shweata N. Hegde
- Windows 10
- Python version ->
Python 3.8.10
Useful; not essential.
- Make a directory
mkdir docanalysis_tutorial
cd docanalysis_tutorial
- Create a virtual environment
python -m venv venv
- Active virtual environment Windows:
venv\Scripts\activate.bat
Mac:
source venv/bin/activate
- Run
pip install docanalysis
- Once installed, you can run
docanalysis --help
. The help message should show up.
(venv) C:\Users\shweata\docanalysis_tutorial>docanalysis --help
c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\_distutils_hack\__init__.py:36: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
usage: docanalysis [-h] [--run_pygetpapers] [--make_section] [-q QUERY] [-k HITS] [--project_name PROJECT_NAME]
[-d [DICTIONARY [DICTIONARY ...]]] [-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
[--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]] [--entities [ENTITIES [ENTITIES ...]]]
[--spacy_model SPACY_MODEL] [--html HTML] [--synonyms SYNONYMS] [--make_json MAKE_JSON]
[--search_html] [--extract_abb EXTRACT_ABB] [-l LOGLEVEL] [-f LOGFILE]
Welcome to docanalysis version 0.1.9. -h or --help for help
optional arguments:
-h, --help show this help message and exit
--run_pygetpapers [Command] downloads papers from EuropePMC via pygetpapers
--make_section [Command] makes sections; requires a fulltext.xml in CTree directories
-q QUERY, --query QUERY
[pygetpapers] query string
-k HITS, --hits HITS [pygetpapers] number of papers to download
--project_name PROJECT_NAME
CProject directory name
-d [DICTIONARY [DICTIONARY ...]], --dictionary [DICTIONARY [DICTIONARY ...]]
[file name/url] existing ami dictionary to annotate sentences or support supervised entity
extraction
-o OUTPUT, --output OUTPUT
outputs csv with sentences/terms
--make_ami_dict MAKE_AMI_DICT
[Command] title for ami-dict. Makes ami-dict of all extracted entities; works only with spacy
--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
[NER/dictionary search] section(s) to annotate. Choose from: ALL, ACK, AFF, AUT, CON, DIS,
ETH, FIG, INT, KEY, MET, RES, TAB, TIL. Defaults to ALL
--entities [ENTITIES [ENTITIES ...]]
[NER] entities to extract. Default (ALL). Common entities SpaCy: GPE, LANGUAGE, ORG, PERSON
(for additional ones check: ); SciSpaCy: CHEMICAL, DISEASE
--spacy_model SPACY_MODEL
[NER] optional. Choose between spacy or scispacy models. Defaults to spacy
--html HTML outputs html with sentences/terms
--synonyms SYNONYMS annotate the corpus/sections with synonyms from ami-dict
--make_json MAKE_JSON
outputs json with sentences/terms
--search_html searches html documents (mainly IPCC)
--extract_abb EXTRACT_ABB
[Command] title for abb-ami-dict. Extracts abbreviations and expansions; makes ami-dict of all
extracted entities
-l LOGLEVEL, --loglevel LOGLEVEL
provide logging level. Example --log warning <<info,warning,debug,error,critical>>,
default='info'
-f LOGFILE, --logfile LOGFILE
saves log to specified file in output directory as well as printing to terminal
As you can see, docanalysis
does a lot of things. Let's test them one by one.
You can call pygetpapers
(a tool to automatically download papers) from docanalysis
using:
docanalysis --run_pygetpapers -q "terpenes" -k 20 --project_name terpene_20
--run_pygetpapers
tells docanalysis
to use pygetpapers
to download -k
20 papers on -q
"terpenes" into --project_name
terpene_20.
c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\_distutils_hack\__init__.py:36: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INFO: Total Hits are 35508
20it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\docanalysis_tutorial\terpene_20\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:58<00:00, 2.91s/it]
INFO: making CProject C:\Users\shweata\docanalysis_tutorial\terpene_20 with 20 papers on terpenes
ERROR: section papers using --make_sections before search
The last error message indicates that docanalysis
is not meant to run just pygetpapers
. Maybe I should make it say something useful.
For docanalysis
to ingest papers, they need to be sectioned. We do that by running:
docanalysis --project_name terpene_20 --make_section
Notice that you only have to reference the folder name using --project_name
and don't have to use run_pygetpapers
. Once run, you will have sectioned papers
...
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9228083\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9228083\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9228083\sections
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230113\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230113\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230113\sections
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230896\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230896\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9230896\sections
WARNING: Making sections in C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9236214\fulltext.xml
INFO: wrote XML sections for C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9236214\fulltext.xml C:\Users\shweata\docanalysis_tutorial\terpene_20\PMC9236214\sections
Here's how the tree
looks for a paper:
+---PMC9222602
| \---sections
| +---0_processing-meta
| +---1_front
| | +---0_journal-meta
| | \---1_article-meta
| | \---19_funding-group
| | \---0_award-group
| +---2_body
| | +---0_1._introduction
| | +---1_2._materials_and_methods
| | | +---1_2.1._grape_variety_and_winemak
| | | +---2_2.2._characterization_of_the_w
| | | +---3_2.3._determination_of_the_arom
| | | \---4_2.4._statistical_analyses
| | +---2_3._results_and_discussion
| | | +---1_3.1._characterization_of_must_
| | | +---2_3.2._effects_from_the_applicat
| | | +---3_3.3._effects_from_the_applicat
| | | \---4_3.4._specific_effects_on_the_l
| | \---3_4._conclusions
| +---3_back
| | +---0_fn-group
| | | \---0_fn
| | \---6_ref-list
| \---4_floats-group
| +---4_table-wrap
| | \---4_table-wrap-foot
| | \---0_fn
| \---5_table-wrap
| \---4_table-wrap-foot
| \---0_fn
Now, we are ready to analyse papers in interesting ways!
Dictionaries are a set of terms -- plant names, country names, organization names, drugs -- with links to Wikidata. docanalysis
has default dictionaries that you could use for searching. They are:
- EO_ACTIVITY
- EO_COMPOUND
- EO_ANALYSIS
- EO_EXTRACTION
- EO_PLANT
- PLANT_GENUS
- EO_PLANT_PART
- EO_TARGET
- COUNTRY
- DISEASE
- ORGANIZATION
- DRUG
- TEST_TRACE
You can either use the default dictionaries or custom ones. If you have a custom dictionary, you can point docanalysis
to it by giving its absolute path.
docanalysis --project_name terpene_20 --dictionary EO_PLANT --output plant.csv --make_json plant.json
You can output any results from docanalysis
either in .csv
or .json
format.
This task might take anywhere from a few seconds to more than 15 min. depending on the number of papers in the folder.
c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\_distutils_hack\__init__.py:36: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 65.98it/s]
0it [00:00, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 56/56 [00:00<00:00, 107.05it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 87.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:00<00:00, 124.78it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 202/202 [00:02<00:00, 88.29it/s]
0it [00:00, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 124/124 [00:01<00:00, 103.37it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 113/113 [00:01<00:00, 97.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 115.82it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 147.96it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 263/263 [00:02<00:00, 125.15it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 34/34 [00:00<00:00, 70.70it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 100.15it/s]
0it [00:00, ?it/s]
INFO: Found 4260 sentences in the section(s).
INFO: getting terms from EO_PLANT
100%|█████████████████████████████████████████████████████████████████████████████| 4260/4260 [00:34<00:00, 122.46it/s]
c:\users\shweata\docanalysis_tutorial\venv\lib\site-packages\docanalysis\entity_extraction.py:452: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df[col] = df[col].astype(str).str.replace(
INFO: wrote output to C:\Users\shweata\docanalysis_tutorial\terpene_20\plant.csv
INFO: wrote JSON output to C:\Users\shweata\docanalysis_tutorial\terpene_20\plant.json
Let's look at the results in more detail. Here's one entry from the output (in .json
).
{
"5": {
"file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC8886108\\sections\\1_front\\1_article-meta\\18_abstract.xml",
"sentence": "Our group settled six formulations combining CBD and terpenes purified from Cannabis sativa L, Origanum vulgare , and Thymus mastichina .",
"section": "ABS",
"0": [
[
"Cannabis sativa",
"Origanum vulgare",
"Thymus mastichina"
]
],
"0_span": [
[
[
77,
92
],
[
98,
114
],
[
122,
139
]
]
],
"weight_0":
...
docanalysis
has pulled sentences that mention terms in the dictionary -- plant species name, for example. It also tells us that it comes from the ABS (abstract) section and the span (starting and ending character positions). You can check out the full results, here.
You can also search using multiple dictionaries. Like,
docanalysis --project_name terpene_20 --dictionary EO_PLANT EO_COMPOUND --output plant_compound.csv --make_json plant_compound.json
You can find the results, here.
From Wikipedia, "_In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence". docanalysis
uses spacy
and scispacy
models to extract Named-Entities from our papers. These models extract specific sets of Named-Entities.
-
spacy
extracts (source):
2.6 Entity Names Annotation
Names (often referred to as “Named Entities”) are annotated according to the following
set of types:
PERSON People, including fictional
NORP Nationalities or religious or political groups
FACILITY Buildings, airports, highways, bridges, etc.
ORGANIZATION Companies, agencies, institutions, etc.
GPE Countries, cities, states
LOCATION Non-GPE locations, mountain ranges, bodies of water
PRODUCT Vehicles, weapons, foods, etc. (Not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK OF ART Titles of books, songs, etc.
LAW Named documents made into laws
LANGUAGE Any named language
DATE Absolute or relative dates or periods
TIME Times smaller than a day
PERCENT Percentage (including “%”)
MONEY Monetary values, including unit
QUANTITY Measurements, as of weight or distance
ORDINAL “first”, “second”
CARDINAL Numerals that do not fall under another type
spispacy
does:
- CHEMICAL
- DISEASE ...
To use them from docanalysis
, you specify the model using --spacy_model
. Your command would look like this:
docanalysis --project_name terpene_20 --spacy_model spacy --output all_entities.csv --make_json all_entities.json
Here's the output.
"64": {
"file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC9182305\\sections\\1_front\\1_article-meta\\18_abstract.xml",
"sentence": "Interestingly, floral fragrance compounds such as 3-carene, valencene, aromandendrene, menogene, and (+)- \u03b3 -gurjunene were first reported in the flowers of P. notoginseng .",
"section": "ABS",
"entities": [
"3",
"first"
],
"labels": [
"CARDINAL",
"ORDINAL"
],
"position_start": [
50,
124
],
"position_end": [
51,
129
],
"abbreviations": [],
"abbreviations_longform": [],
"abbreviation_start": [],
"abbreviation_end": []
},
"65": {
"file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC9182305\\sections\\1_front\\1_article-meta\\18_abstract.xml",
"sentence": "Cluster analysis showed that P. notoginseng with four-forked and three-forked leaves clustered into two subgroups, respectively.",
"section": "ABS",
"entities": [
"four",
"three",
"two"
],
"labels": [
"CARDINAL",
"CARDINAL",
"CARDINAL"
],
"position_start": [
51,
67,
102
],
"position_end": [
55,
72,
105
],
"abbreviations": [],
"abbreviations_longform": [],
"abbreviation_start": [],
"abbreviation_end": []
},
Looking at the results, you might say, that's not very useful. Different kinds of entities are mixed up. But with docanalysis
, you can specify entities using --entities
, for example, ORG (organization) and extract only them. Here's how:
docanalysis --project_name terpene_20 --spacy_model spacy --entities ORG --output all_org.csv --make_json all_org.json
The results:
"2823": {
"file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC9104620\\sections\\2_body\\1_2._materials_and_methods\\3_2.3._plant_material_and_terpen\\3_p.xml",
"sentence": "The identification of terpenes was performed by using the National Institute of Standards and Technology (NIST) library based on mass spectrums and retention times (RT).",
"section": "MET",
"entities": [
"the National Institute of Standards and Technology",
"RT"
],
"labels": [
"ORG",
"ORG"
],
"position_start": [
54,
165
],
"position_end": [
104,
167
],
"abbreviations": [],
"abbreviations_longform": [],
"abbreviation_start": [],
"abbreviation_end": []
},
That's much better -- only one type of entity. But you don't have to be pulling organizations from all the sections of the papers. You would get sentence hits only to find out that the mentions of the organizations were in the Methods (MET) section. What if you wanted to know just about the affiliations of the authors? With docanalysis
, you can even specify which section of the papers you want to extract entities from.
You can choose from the following section options using --search_section
:
- ACK - Acknowledgements
- AFF - Affiliations
- AUT - Authors
- CON - Conclusion
- DIS - Discussion
- ETH - Ethics
- FIG - Figure
- INT - Introduction
- KEY - Keywords
- MET - Materials and Methods
- RES - Results
- TAB - Tabular column
- TIL - Title group
We can look into just the Affiliations by running:
docanalysis --project_name terpene_20 --spacy_model spacy --entities ORG --output aff_org.csv --make_json aff_org.json --search_section AFF
The results now look so much better. They are more useful, too.
"6": {
"file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC8982386\\sections\\1_front\\1_article-meta\\7_aff.xml",
"sentence": "1 Department of Biochemistry, Genetics and Microbiology, Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria , Pretoria 0028, South Africa",
"section": "AFF",
"entities": [
"Department of Biochemistry",
"Genetics and Microbiology, Forestry",
"Agricultural Biotechnology Institute",
"University of Pretoria"
],
"labels": [
"ORG",
"ORG",
"ORG",
"ORG"
],
"position_start": [
4,
32,
72,
117
],
"position_end": [
30,
67,
108,
139
],
"abbreviations": [],
"abbreviations_longform": [],
"abbreviation_start": [],
"abbreviation_end": []
},
"7": {
"file_path": "C:\\Users\\shweata\\docanalysis_tutorial\\terpene_20\\PMC8982386\\sections\\1_front\\1_article-meta\\8_aff.xml",
"sentence": "2 College of Forest Resources and Environmental Science, Michigan Technological University , Houghton, MI 49931-1295, USA",
"section": "AFF",
"entities": [
"College of Forest Resources and Environmental Science",
"Michigan Technological University"
],
"labels": [
"ORG",
"ORG"
],
"position_start": [
4,
59
],
"position_end": [
57,
92
],
"abbreviations": [],
"abbreviations_longform": [],
"abbreviation_start": [],
"abbreviation_end": []
The full results of this extraction are here.
You can specify sections, even when you are searching papers using dictionaries.
We previously used a dictionary to search the literature (say, EO_PLANT and EO_COMPOUND). docanalysis
can also create such dictionaries from the extracted entities.
For example, we can create our own Organization dictionary that contains all the extracted organization names from the Affiliations section. Here's how you do it:
docanalysis --project_name terpene_20 --make_ami_dict aff_org --spacy_model spacy --entities ORG --search_section AFF
Snippet of the dictionary we created:
<?xml version="1.0"?>
<dictionary title="aff_org">
<entry count="3" term="Yunnan Agricultural University"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Shanghai University"/>
<entry count="2" term="Shanghai 200444"/>
<entry count="2" term="Hangzhou Normal University"/>
<entry count="2" term="FI-33520 Tampere"/>
<entry count="1" term="R&D&Innovation Department"/>
<entry count="1" term="Applied Management and Space"/>
<entry count="1" term="Laboratório de Biotecnologia Médica"/>
<entry count="1" term="Innovation Center"/>
<entry count="1" term="Instituto de Investigação e Inovação"/>
<entry count="1" term="Escola Superior de Saúde"/>
You can find the full dictionary, here
docanalysis
uses Schwartz Hearts to extract abbreviations and expansions.
docanalysis --project_name terpene_20 --extract_abb all_abb
docanalysis
also gives us potential Wikidata IDs for the expansions.
<dictionary title="all_abb">
<entry name="SNP" term="single nucleotide polymorphism" wikidataID="['//www.wikidata.org/wiki/Q501128', '//www.wikidata.org/wiki/Q59307391', '//www.wikidata.org/wiki/Q5243761', '//www.wikidata.org/wiki/Q65372898', '//www.wikidata.org/wiki/Q65379379', '//www.wikidata.org/wiki/Q62042429', '//www.wikidata.org/wiki/Q112890801']"/>
<entry name="GC-MS" term="gas chromatography–mass spectrometry" wikidataID="['//www.wikidata.org/wiki/Q873009', '//www.wikidata.org/wiki/Q105195015', '//www.wikidata.org/wiki/Q60448872', '//www.wikidata.org/wiki/Q56814273', '//www.wikidata.org/wiki/Q55593793']"/>
<entry name="GC-IMS" term="gas chromatography–ion mobility spectrometry" wikidataID="[]"/>
<entry name="OPLS-DA" term="Orthogonal partial least square discriminant analysis" wikidataID="[]"/>
<entry name="MIC" term="minimum inhibitory concentration" wikidataID="['//www.wikidata.org/wiki/Q597889']"/>
<entry name="MBC" term="minimum bactericidal concentration" wikidataID="['//www.wikidata.org/wiki/Q1158816']"/>
<entry name="YPs" term="Yeast particles" wikidataID="[]"/>
<entry name="ROS" term="reactive oxygen species" wikidataID="['//www.wikidata.org/wiki/Q424361', '//www.wikidata.org/wiki/Q96319527', '//www.wikidata.org/wiki/Q12377931', '//www.wikidata.org/wiki/Q14863432', '//www.wikidata.org/wiki/Q18050996', '//www.wikidata.org/wiki/Q21111970']"/>
<entry name="DPPH" term="diphenyl-2-picrylhydrazyl" wikidataID="[]"/>
<entry name="OTs" term="odor thresholds" wikidataID="[]"/>
<entry name="VOC" term="Volatile organic compounds" wikidataID="['//www.wikidata.org/wiki/Q910267', '//www.wikidata.org/wiki/Q112189644', '//www.wikidata.org/wiki/Q7939989', '//www.wikidata.org/wiki/Q21761670', '//www.wikidata.org/wiki/Q26152844', '//www.wikidata.org/wiki/Q66070215', '//www.wikidata.org/wiki/Q87076997']"/>
<entry name="HCE" term="human corneal epithelial" wikidataID="['//www.wikidata.org/wiki/Q54881856', '//www.wikidata.org/wiki/Q54399847']"/>
<entry name="ORAC" term="oxygen radical absorbance capacity" wikidataID="['//www.wikidata.org/wiki/Q902552']"/>
<entry name="MIC" term="minimum inhibitory concentration" wikidataID="['//www.wikidata.org/wiki/Q597889']"/>
...
We get false hits like:
<entry name="CBD" term="cannabidiol" wikidataID="['//www.wikidata.org/wiki/Q422917', '//www.wikidata.org/wiki/Q105221018',
<entry name="AEOs" term="Amomun tsao-ko essential oils" wikidataID="[]"/>
<entry name="Lab FAS" term="Laboratorio de Fitoquímica y alimentos saludables" wikidataID="[]"/>
<entry name="H.C." term="hector.carrasco@uatotonoma.cl" wikidataID="[]"/>
<entry name="S.C." term="Southwest Forestry University, Kunming 650224, China; 15846027621@163.com" wikidataID="[]"/>
<entry name="R.R." term="ruiruiswfu@163.com" wikidataID="[]"/>
Again, being specific about the section you want to extract information from can help!
Till now, we have worked with scientific papers that come in XML
format. docanalysis
also analyzes HTML
documents in the same way. To tell that you have HTML
files, you'll have to add --search_html
flag to the command you use for searching, extracting, etc. For example,
docanalysis --project_name C:\Users\shweata\ipcc_not_sectioned --extract_abb all_abb --search_html
And the result:
<dictionary title="all_abb">
<entry name="solar PV" term="solar photovoltaic" wikidataID="['//www.wikidata.org/wiki/Q217941', '//www.wikidata.org/wiki/Q28368735', '//www.wikidata.org/wiki/Q112956013', '//www.wikidata.org/wiki/Q112891840', '//www.wikidata.org/wiki/Q59264829', '//www.wikidata.org/wiki/Q57683955', '//www.wikidata.org/wiki/Q59261248']"/>
<entry name="SIDS" term="small island developing states" wikidataID="['//www.wikidata.org/wiki/Q1434887', '//www.wikidata.org/wiki/Q58260016', '//www.wikidata.org/wiki/Q56410642']"/>
<entry name="IPR" term="intellectual property rights" wikidataID="['//www.wikidata.org/wiki/Q108855835', '//www.wikidata.org/wiki/Q56049567', '//www.wikidata.org/wiki/Q47285236', '//www.wikidata.org/wiki/Q47458162', '//www.wikidata.org/wiki/Q47483563']"/>
<entry name="SDGs" term="Sustainable Development Goals" wikidataID="[]"/>
<entry name="IAMs" term="Integrated Assessment Models" wikidataID="[]"/>
<entry name="TRLs" term="technology readiness levels" wikidataID="['//www.wikidata.org/wiki/Q1478071']"/>
<entry name="TRLs" term="Technology Readiness Levels" wikidataID="['//www.wikidata.org/wiki/Q1478071']"/>
<entry name="TRA" term="Technology Readiness Assessment" wikidataID="[]"/>
<entry name="GPTs" term="General purpose technologies" wikidataID="[]"/>
<entry name="SM" term="smart manufacturing" wikidataID="['//www.wikidata.org/wiki/Q25112020', '//www.wikidata.org/wiki/Q97170530', '//www.wikidata.org/wiki/Q56807313', '//www.wikidata.org/wiki/Q58279820']"/>
<entry name="IoT" term="internet of things" wikidataID="['//www.wikidata.org/wiki/Q251212', '//www.wikidata.org/wiki/Q96708225', '//www.wikidata.org/wiki/Q96325266', '//www.wikidata.org/wiki/Q110262629', '//www.wikidata.org/wiki/Q59408925', '//www.wikidata.org/wiki/Q105700358', '//www.wikidata.org/wiki/Q106087667']"/>
<entry name="AI" term="artificial intelligence" wikidataID="['//www.wikidata.org/wiki/Q11660', '//www.wikidata.org/wiki/Q4801030', '//www.wikidata.org/wiki/Q221113', '//www.wikidata.org/wiki/Q128447', '//www.wikidata.org/wiki/Q107307291', '//www.wikidata.org/wiki/Q2865784', '//www.wikidata.org/wiki/Q4801033']"/>
<entry name="IAMs" term="Integrated Assessment Models" wikidataID="[]"/>
<entry name="SIS" term="Sectoral innovation systems" wikidataID="[]"/>
<entry name="MIS" term="Mission-oriented innovation systems" wikidataID="[]"/>
<entry name="MLP" term="multilevel perspective" wikidataID="['//www.wikidata.org/wiki/Q56806571']"/>
One caveat:
Make sure you have your HTML files in the folder hierarchy that docanalysis
expects. Here's the structure to follow:
C:ipcc_not_sectioned
| all_abb.xml
+---chap16
| \---sections
| fulltext.flow.html
|
\---chap6
\---sections
fulltext.flow.html