https://arxiv.org/pdf/2006.09526.pdf
CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.
- faiss: https://github.com/facebookresearch/faiss
- mosesdecoder: https://github.com/moses-smt/mosesdecoder
- flores: https://github.com/facebookresearch/flores
- LASER: https://github.com/facebookresearch/LASER
cd examples/criss
wget https://dl.fbaipublicfiles.com/criss/criss_3rd_checkpoints.tar.gz
tar -xf criss_checkpoints.tar.gz
Make sure to run all scripts from examples/criss directory
bash download_and_preprocess_flores_test.sh
bash unsupervised_mt/eval.sh
bash download_and_preprocess_tatoeba.sh
bash sentence_retrieval/sentence_retrieval_tatoeba.sh
Follow instructions on https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
bash mining/mine_example.sh
@article{tran2020cross,
title={Cross-lingual retrieval for iterative self-supervised training},
author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao},
journal={arXiv preprint arXiv:2006.09526},
year={2020}
}