The Word Retrofitting Tool is a jupyter-notebook-based program that performs retrofitting on word vectors using a lexicon of word relations. Retrofitting aims to improve the quality and semantic coherence of word vectors by incorporating information from the lexicon.
- Python 3.x
- Required Python packages: numpy, pandas
In order to run the Word Retrofitting Tool, make sure you have the following models and libraries downloaded:
- Download
vecs100-linear-frwiki.zip
orvecs50-linear-frwiki.bz2
file for French word vectors trained on a Wikipedia dump.- Training on a dump of Wikipedia (frwiki-20140804-corpus.xml.bz2 downloaded here)
- Preprocessing:
- Use
xml2txt.pl
to remove XML tags. - Tokenize the text using
sxpipe-light
(e.g.,perl ~/installation/melt-2.0b7/sxpipe-melt/segmenteur.pl < frwiki_raw.txt > frwiki_tokenized.txt
).
- Use
- The word vectors are trained using
vecs100
orvecs50
with the following command:./word2vec -train frwiki_tokenized.txt -output vecs50 -threads 2 -min-count 100 -cbow 0 -negative 10
- Alternatively, you can use other word embeddings such as the
cc.fr.300.vec
embeddings. Refer to fastText documentation for more details.
- Use WordNet for semantic relations.
- For lexical similarity evaluation, use the
rg65_french.txt
file.
- Download
vectors_datatxt_250_sg_w10_i5_c500_gensim_clean.tar.bz2
file for English word vectors. - The word vectors are trained using the
gensim
library with the following configuration:gensim.models.Word2Vec(size=250, min_count=500, window=8, sample=1e-3, workers=8, sg=1, hs=0, negative=10, iter=5)
- Use WordNet for semantic relations.
- For lexical similarity evaluation, use the
ws353.txt
file. - For sentiment analysis evaluation, use the
stanford_sentiment_analysis.tar.gz
file from Stanford NLP.
- Open the jupyter notebook files
- Run the program by pressing the button “run”. The program will start executing and perform retrofitting on the word vectors using all predetermined parameters and lexicon.
- Once the process is completed, the retrofitted word vectors will be saved in the specified output file.
- Ensure that the input file, lexicon file, and output file paths are correctly specified and accessible.
- The program may take some time to complete retrofitting, depending on the size of the word vectors and the lexicon.
- Nina alias NinaNusb
- Deeksha alias deekode99
- Iness alias IAiness