Tool for aligning word vectors and inducing bilingual lexicons.
Code and data of the following paper: Classification-Based Self-Learning for Weakly Supervised Bilingual Lexicon Induction
If you use this code please cite the paper:
@inproceedings{karan2020classification,
title={Classification-based self-learning for weakly supervised bilingual lexicon induction},
author={Karan, Mladen and Vuli{\'c}, Ivan and Korhonen, Anna and Glava{\v{s}}, Goran},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
pages={6915--6922},
year={2020}
}
We also recommend you cite the original VecMap paper that this code heavily relies on.
@inproceedings{artetxe2018acl,
author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
title = {A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2018},
pages = {789--798}
}
NOTE: after checking out the repo, we recommend using a python virtual environment, testing was done using anaconda:
conda create -n "cm-env" python=3.6.9
conda activate "cm-env"
pip install -r src/requirements.txt
NOTE: the following steps can be applied to any language pair, as a working example we will use English and Croatian.
NOTE: we recommend running steps 2 and 3 with cuda, without it they are both somewhat slower and not well tested
Run the ./embeddings/dl-embeddings-all.sh
script, to download only embeddings used in the example below (en and hr). Alternatively, run the ./embeddings/dl-embeddings-all.sh
script, to download all embeddings used in the paper
Download the embedding matrices (we used those from https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/ in text format). For example:
English → https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
Croatian → https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.hr.vec
Transform each matrix file to leave only the top 200 000 most frequent words for each (this should also be reflected in the counts in the first line of the matrix file). Provided the words in matrices are sorted descending by frequency in the files, you can do this with:
head -n 200000 wiki.hr.vec | sed "1s/.*/199999 300/" > wiki.200k.hr.vec
head -n 200000 wiki.en.vec | sed "1s/.*/199999 300/" > wiki.200k.en.vec
NOTE: This filtering step is not mandatory but it should speed things up without significant impact on quality.
NOTE: if your source and target languages use different scripts you must transliterate words in both the embedding matrices and the train/test files to use the same script (most likely latin) for both languages before running subsequent steps. Scripts "translit-matrix.py" and "translit-dict.py" for this are included in the src folder, use them as:
python translit-matrix.py src_matrix.vec src_lang_code output_matrix.vec
python translit-dict.py src_dict.tsv src_lang_code trg_lang_code output_dict.tsv WHICH (WHICH can be "SRC", "TRG" or "BOTH" depending on which parts of the dictionary need transliteration)
for example
python translit-matrix.py wiki.200k.ru.vec ru wiki.200k.ru.latin.vec
python translit-dict.py ../data/en-ru/yacle.train.freq.500.en-ru.tsv en ru ../data/en-ru/yacle.train.freq.500.en-ru.transliterated.tsv TRG
NOTE: It is based on the https://pypi.org/project/transliterate/ library and will work for most cryllic scripts and greek, other scripts are currently not supported.
python classymap.py --train_dict trainfile.tsv --in_src in_src_embeddings.vec --in_tar in_tar_embeddings.vec --src_lid "en" --tar_lid "hr" --out_src out_src_embeddings.vec --out_tar out_tar_embeddings.vec --model_filename modefile.pickle --idstring experiment_id
for example
python classymap.py --train_dict "../data/en-hr/yacle.train.freq.500.en-hr.tsv" --in_src "../embeddings/wiki.200k.en.vec" --in_tar "../embeddings/wiki.200k.hr.vec" --src_lid "en" --tar_lid "hr" --out_src "../embeddings/wiki.200k.en-hr-aligned.EN.vec" --out_tar "../embeddings/wiki.200k.en-hr-aligned.HR.vec" --model_filename "./models/ENHR-model.pickle" --idstring ENHRFASTTEXT
Note: type python classymap.py --help
for many more options, but the default values for all of them should give reasonable perfomance.
Note: idstring
is an arbitrary string that will be used for caching. To avoid caching problems, we recommend using the same idstring
for experiments using the same pair of initial embedding matrices and (more importantly) different idstring
for experiments using different pairs of initial embedding matrices
python eval.py src.aligned.embeddings.vec trg.aligned.embeddings.vec -d dictionary.tsv --cuda
for example:
python eval.py "../embeddings/wiki.200k.en-hr-aligned.EN.vec" "../embeddings/wiki.200k.en-hr-aligned.HR.vec" -d ../data/en-hr/yacle.test.freq.2k.en-hr.tsv --cuda
add "--super --model model_path --src_lid sl --tar_lid tl --idstring id"
to all the above command
model_path
is the model file you generated in the step 2
src_lid
, tar_lid
, and idstring
should be the same you provided in step 2
An example of expanding the command from step 3.1. is as follows:
python eval.py "../embeddings/wiki.200k.en-hr-aligned.EN.vec" "../embeddings/wiki.200k.en-hr-aligned.HR.vec" -d ../data/en-hr/yacle.test.freq.2k.en-hr.tsv --cuda --super --model "./models/ENHR-model.pickle" --src_lid "en" --tar_lid "hr" --idstring ENHRFASTTEXT
Note: Regardless of using reranking or not. You can add "--output_file filename.txt" to dump the generated candidates into an output file of the following format:
src_word_1 --> first_target_candidate second_target_candidate ....
src_word_2 --> first_target_candidate second_target_candidate ....
...
The candidates are tab separated.
You can specifiy "--output_top_K K"
to limit the number of candidates, e.g. "--output_top_K 500"
to limit each line in the file to the top 500 candidates.