This repository contains python scripts for downloading Semantic Scholar Computer Science paper abstracts and constructing a knowledge graph from them. This code is in the ontology_generation/
folder.
First we train a GloVe model using sense2vec on the abstracts. Then we extract keywords from the abstracts. We create the knowledge graph by drawing an edge between keywords whose word vector similarity is within a given threshold.
The graph can then be viewed and editing with a local web application. This code is is in the ui/
folder.
If you are using Windows it is recommended to use the Windows Subsystem for Linux as it is easier to compile the dependencies.
Install python dependencies from conda environment file and pip
conda env create -f ontovec.yml
pip install sense2vec
pip install git+https://github.com/LIAAD/yake
Clone and build the GloVe source code by running make in the GloVe directory
git clone https://github.com/stanfordnlp/GloVe
cd ./GloVe
make
Clone the sense2vec source code (will need scripts in the repository in addition to installing the pip package)
This fork contains the same version we used for the project.
git clone https://github.com/cerules/sense2vec.git
If you would like to skip the graph creation steps you can proceed to step 6 and simply use the existing edges.csv
and words.csv
files.
Downloads academic paper metadata from Semantic Scholar's Open Research Corpus
Waleed Ammar et al. 2018. Construction of the Literature Graph in Semantic Scholar. NAACL https://www.semanticscholar.org/paper/09e3cf5704bcb16e6657f6ceed70e93373a54618
Can be run with default arguments to download all computer science papers. For more information run the script with --help.
python ./01_download_data.py
To download a smaller example dataset use the --fileLimit
parameter.
python ./01_download_data.py --fileLimit 5
Extracts sentences from paper abstracts
Can be run with default arguments to extract sentences from 1000 papers. For more information run the script with --help.
python ./02_extract_sentences.py
In practice we set the limit argument to 100000
python ./02_extract_sentences.py --limit 100000
Uses YAKE! to extract keywords from paper abstracts. Make sure you install it first.
pip install git+https://github.com/LIAAD/yake
Can be run with default arguments. For more information run the script with --help.
python ./03_extract_keywords.py
In practice we extracted keywords from all papers with year >= 2010
python ./03_extract_keywords.py --yearCutOff 2010
Trains a sense2vec glove model on the paper abstracts using the sentences extracted in step 2.
Should clone forked version of sense2vec that fixes formatting issue until the issue is fixed / this pull request is completed
input arguments depend on previous step's output locations. Glove build directory as well as sense2vec scripts directory are required as input. If using the default set arguments your arguments should look something like this:
./04_sense2vec_train.sh ../../GloVe/build/ ../../sense2vec/scripts/ ../data/sense2vec_train/
Uses the word vectors generated in step 4 and the keywords generated in step 3 to ouptut a node and edges file. The nodes consist of keywords who have word vectors. Edges exist between two nodes if the cosine similarity of their two word vectors is greater than the given threshold.
outputs an edges.csv and words.csv file.
Can be run with default arguments. For more information run the script with --help.
python ./05_initial_ontology.py
This step outputs the words.csv
and edges.csv
files needed for the visualization step.
The files should look something like the examples below.
words.csv
id | word | v0 | v1 | v2 | v3 | ... | v127 |
---|---|---|---|---|---|---|---|
0 | computer_science | 0.23 | 0.123 | -0.32 | 0.832 | ... | 0.044123 |
1 | machine_learning | 0.98 | 0.123 | 0.45 | -0.32 | ... | 0.132 |
... |
edges.csv
source | target | similarity |
---|---|---|
0 | 1 | 0.6 |
... |
Visualize the words and edges in a graph. Edges can be added and removed as seen fit.
The vizualization step requires a words.csv
and edges.csv
file in the data/
directory. These should be present if you followed the previous steps with default arguments.
We also included a demo words.csv
and edges.csv
file (these files will be overwritten if you followed steps 1-5).
Additionally, our final graph is located in data/graph_0.6_threshold
. copying those csv files into the data/
directory will allow the visualization to display them.
The easiest way to run the visualization is with a local python webserver from the root of this git repo.
For example
python -m http.server 8000
Finally, navigate to localhost:8000/ui/ontology_graph.html
in your browser.