Skip to content

This repo contains the code based used for the experiments of the AAAI2020 paper titled `Understanding the semantic content of sparse word embeddings using a commonsense knowledge base`

License

Notifications You must be signed in to change notification settings

begab/interpretability_aaai2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

interpretability_aaai2020

This repo contains the code used for the experiments of the AAAI2020 paper titled Understanding the semantic content of sparse word embeddings using a commonsense knowledge base.

Steps to reproduce our experiments

Obtaining dense embeddings

The first step is to obtain dense embeddings for later analysis. You can either train either provide your in-house trained embeddings or use any of the pre-trained ones. We did the latter in our experiments in the paper, i.e. we analysed the pre-trained Glove embeddings. More precisely, our paper reports results based on the 300 dimensional Glove-6B vectors that were trained over 6 billion tokens originating from the corpora of Wikipedia 2014 and Gigaword 5.

Requirements

Python 3.6

Install

Install the latest code from GitHub.

git clone git@github.com:begab/interpretability_aaai2020.git
cd interpretability_aaai2020
pip install -r requirements.txt

Deriving sparse representations based on the dense embeddings

We relied on three different approaches for deriving sparse word representations in the paper. The approaches are dubbed as DLSC, k-means and GMPO. For more details, please refer to the paper.

The sparse word representations that we computed can be accessed from the following folder. Note that we created further embeddings besides the ones referenced in our article.

The file names have the following structure
glove300d_l_0.[1-5]_{DLSC,kmeans,GMPO}_top400000.emb.gz.

One can determine new sparse word representations by invoking the below command as well:

cd src
python sparse_coding/sparse_coding.py <dense_embeddings_location> <sparse_output_file> <output_dimensions> <regularization_coefficient> <approach>

Preprocess sparse embedding and knowledge base

To preprocess a gzipped sparse embedding:

cd src
python preprocess/preprocess_sparse_embedding.py --embedding <path_to_gzipped_embedding>

This will result in an npz format embedding matrix and word index files in the data directory.

To preprocess a knowledge base (and its assertions) download the assertion files of the knowledge bases, then run:

cd src
python preprocess/preprocess_cnet.py --kb <path_to_json_assertion>

This will result in a pickled concept index files and word-concept dictionaries in the data directory.

Assign knowledge base concepts to sparse embedding dimensions

Make sure the knowledge base (assertaions) and the sparse embedding are preprocessed.

cd src

Filter the word embedding matrix to contain words that are also present in the vocabulary of the knowledge base:

python sparse_alignments/filter_embedding.py --embedding <path_to_npz_embedding> --vocabulary <path_to_pickled_knowledge_base_vocabulary>

Then, generate a word-concept matrix:

python sparse_alignments/word_concept_matrix.py --embedding <path_to_filtered_npz_embedding>

For the (TAB and TAC) evaluations split train and test sets for each concept:

python sparse_alignments/train_test_split.py --embedding <path_to_filtred_npz_embedding>

Compute meta-concepts based on word-concept matrix:

python sparse_alignments/meta_concepts.py --word-concept <path_to_word_concept_matrix>

Run the alignment based on NPPMI:

python sparse_alignments/alignment_NPPMI.py --embedding <path_to_filtered_npz_embedding>

The results can be found in the results\nppmi folder. The alignment needed for evaluation will be at the results\nppmi\max_concept\ folder.

Evaluate alignments

Input pickled files are in the results/nppmi/max_concepts folder.

python sparse_alignments/evaluation.py --alignment <path_to_pickled_alignment>

About

This repo contains the code based used for the experiments of the AAAI2020 paper titled `Understanding the semantic content of sparse word embeddings using a commonsense knowledge base`

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •