Skip to content

Latest commit

 

History

History
32 lines (18 loc) · 1.09 KB

README.md

File metadata and controls

32 lines (18 loc) · 1.09 KB

Sense Embeddings

The goal of this project is to construct and train the fastText algorithm to create a word embedding based on the paper 'Enriching Word Vectors with Subword Information', which was developed by Facebook's AI Research (FAIR) lab.

The dataset used for the training was the EuroSense dataset, which is a multilingual sense-annotated resource in 21 languages, however only the English language was used for this task.

For the correlation evaluation, the dataset WordSimilarity-353 is used.

The training was done using a Google Compute Engine instance running a Tesla K80 GPU.


Dimensionality reduction of 40 words with the highest number of samples

Instructions

  • Generate dictionary

python preprocess.py [resource_folder] [file_name]

  • Train

python train.py [file_name]

  • Score

python train.py [resource_folder] [gold_file] [model_name]

  • Plot PCA

python pca.py [resource_folder] [filtered_vec_name]