Skip to content

cerules/CSE6242-2020-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OntoVec

Description

This repository contains python scripts for downloading Semantic Scholar Computer Science paper abstracts and constructing a knowledge graph from them. This code is in the ontology_generation/ folder.

First we train a GloVe model using sense2vec on the abstracts. Then we extract keywords from the abstracts. We create the knowledge graph by drawing an edge between keywords whose word vector similarity is within a given threshold.

The graph can then be viewed and editing with a local web application. This code is is in the ui/ folder.

Installation

If you are using Windows it is recommended to use the Windows Subsystem for Linux as it is easier to compile the dependencies.

Install python dependencies from conda environment file and pip

conda env create -f ontovec.yml
pip install sense2vec
pip install git+https://github.com/LIAAD/yake

Clone and build the GloVe source code by running make in the GloVe directory

git clone https://github.com/stanfordnlp/GloVe
cd ./GloVe
make

Clone the sense2vec source code (will need scripts in the repository in addition to installing the pip package)

This fork contains the same version we used for the project.

git clone https://github.com/cerules/sense2vec.git

Execution

If you would like to skip the graph creation steps you can proceed to step 6 and simply use the existing edges.csv and words.csv files.

Step 1: Download Data

Downloads academic paper metadata from Semantic Scholar's Open Research Corpus

Waleed Ammar et al. 2018. Construction of the Literature Graph in Semantic Scholar. NAACL https://www.semanticscholar.org/paper/09e3cf5704bcb16e6657f6ceed70e93373a54618

Can be run with default arguments to download all computer science papers. For more information run the script with --help.

python ./01_download_data.py

To download a smaller example dataset use the --fileLimit parameter.

python ./01_download_data.py --fileLimit 5

Step 2: Extract Sentences

Extracts sentences from paper abstracts

Can be run with default arguments to extract sentences from 1000 papers. For more information run the script with --help.

python ./02_extract_sentences.py

In practice we set the limit argument to 100000

python ./02_extract_sentences.py --limit 100000

step 3: Extract Keywords

Uses YAKE! to extract keywords from paper abstracts. Make sure you install it first.

pip install git+https://github.com/LIAAD/yake

Can be run with default arguments. For more information run the script with --help.

python ./03_extract_keywords.py

In practice we extracted keywords from all papers with year >= 2010

python ./03_extract_keywords.py --yearCutOff 2010

step 4: Train sense2vec glove model

Trains a sense2vec glove model on the paper abstracts using the sentences extracted in step 2.

Should clone forked version of sense2vec that fixes formatting issue until the issue is fixed / this pull request is completed

input arguments depend on previous step's output locations. Glove build directory as well as sense2vec scripts directory are required as input. If using the default set arguments your arguments should look something like this:

./04_sense2vec_train.sh ../../GloVe/build/ ../../sense2vec/scripts/ ../data/sense2vec_train/

step 5: Connect keywords based on word vector distance

Uses the word vectors generated in step 4 and the keywords generated in step 3 to ouptut a node and edges file. The nodes consist of keywords who have word vectors. Edges exist between two nodes if the cosine similarity of their two word vectors is greater than the given threshold.

outputs an edges.csv and words.csv file.

Can be run with default arguments. For more information run the script with --help.

python ./05_initial_ontology.py

This step outputs the words.csv and edges.csv files needed for the visualization step.

The files should look something like the examples below.

words.csv

id word v0 v1 v2 v3 ... v127
0 computer_science 0.23 0.123 -0.32 0.832 ... 0.044123
1 machine_learning 0.98 0.123 0.45 -0.32 ... 0.132
...

edges.csv

source target similarity
0 1 0.6
...

step 6: visualize/modify

Visualize the words and edges in a graph. Edges can be added and removed as seen fit.

The vizualization step requires a words.csv and edges.csv file in the data/ directory. These should be present if you followed the previous steps with default arguments. We also included a demo words.csv and edges.csv file (these files will be overwritten if you followed steps 1-5).

Additionally, our final graph is located in data/graph_0.6_threshold. copying those csv files into the data/ directory will allow the visualization to display them.

The easiest way to run the visualization is with a local python webserver from the root of this git repo.

For example

python -m http.server 8000

Finally, navigate to localhost:8000/ui/ontology_graph.html in your browser.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •