Official implementation of the paper Example Selection via Similarity Search improves Low-resource Machine Translation with code, scripts and outputs.
The requirements include PyTorch which should be installed in accordance with the type of GPU available on the device. Following its installation, we recommend to clone the directory in order to use it.
git clone https://github.com/ArmelRandy/ICL-MT
cd ICL-MT
pip install -r requirements.txt
In this paper, we study the impact of carefully chosen in-context demonstrations on the quality of MT outputs. We work with decoder-based LLMs and we study multiple similarity metrics and their impact on the translation task from English to other languages. Given a pool
We consider sentence embeddings (SE) based similarity metrics (cosine similarity between sentence representations): SONAR, Cohere (Embed V3), LaBSE, Laser 2 and E5. Additionally, we include BLOOM 7B1 based embeddings obtained by considering the token-wise average hidden state or the last token's hidden state at the first, the middle or the last layer.
In our experiments, we worked with FLORES-200 using the devtest
for evaluation and the dev
as the selection pool. The dev
vectors for each embedding method mentionned above are pre-computed for English (eng), French (fra), German (deu), Swahili (swh) and Wolof (wol) and can be found in data/flores
(dev.bin
and devtest.bin
). They were obtained with the code miscellaneous/embedding.py
. Feel free to add any other sentence embedding method; in this case don't forget to add its name to SUPPORTED_EMBEDDINGS (utils.py)
and follow the same naming convention as the existing ones.
The code also supports other similarity metrics such as RoBERTa-based, BLEU, Rouge, bm25, RBM25, LCS POS (Pos), BLEU POS (BLEU_Pos) etc. However, these techniques are only implemented in a s2s
fashion.
There is also the possibility to jointly use a sentence embedding method with Bm25 via the strategy {name}+Bm25
. In this setting, we will merge the Random
baseline is also an option.
In our experiments with scaling the pool, we worked with 2 datasets, one for French and the other for Swahili. We embedded them with SONAR
and stored them in data/flores
. In main.py
, you can use the option augment_pool
to add an extension the FLORES-200 dev
set, pool_name
is the name of the .bin
file containing the vector representations of the extension's sentences (required for SE based retrieval). pool_dataset_name_or_path
is the path to the dataset on Hugging Face (it should contain 2 columns, each named after the code of the language in which the content is written e.g. en
for English, fr
for French, sw
for Swahili). pool_size
is the total size of dev + pool extension
.
The implementation supports Hugging Face's naming convention for LLMs. It uses generate
(though it might by pretty slow) and vllm (recommended) via the argument --use_vllm
. We have a list of supported templates in templates.py
but we recommend the use of template 11
:
English Sentence
I live in Paris
French Translation
J'habite à Paris
###
English Sentence
<sentence to translate>
French translation
We evaluate the translations using a separate code for each metric: COMET 22, BLEU and chrF++. This work introduces the Language-Aware COMET (laCOMET) which is a variant of COMET robust to empty translations and translations in the incorrect target language. This variant sets the COMET scores of such translations to zero. For this purpose, we use FastText for LID. The evaluation script is for laCOMET and COMET is comet_eval.py
. You can obviously evaluate your generations with classic metrics such as BLEU and chrF++ which are both supported by sacrebleu.
A quick usage of this codebase involves translating from English to Swahili with gemma-2b, in 10-shot. We can do it in s2s
(or in mix
, by adding --alpha 0.5
for example) with few-shot examples selected by SONAR
embeddings. When located in ICL-MT
, you can use the following command.
torchrun \
--rdzv-backend=c10d\
--rdzv-endpoint=localhost:29500\
main.py \
--model_name_or_path google/gemma-2b\
--tokenizer_name_or_path google/gemma-2b\
--data_path ./data/flores\
--k 10\
--src English\
--tgt Swahili\
--template_key 11\
--seed 122\
--request_batch_size 64\
--num_beams 2\
--max_new_tokens 100\
--temperature 0.0\
--top_p 1.0\
--repetition_penalty 1.0\
--max_samples 1012\
--strategy SONAR\
--output_path ./out/gemma-2b/SONAR\
--format s2s\
--use_vllm\
You can use other launchers such as accelerate launch
, python -m torch.distributed.launch
, python -m
etc.
--max_samples
indicates the number of devtest
sentences to translate, it is useful for debugging or to quickly get some results. --data_path
is the path to the pre-computed representations of the sentences. This code will create a folder named after the translation direction in --output_path
(Eng_to_Swh
in this case) in which there will be the JSON containing the translations.
We can evaluate the translations with Language-Aware COMET
using the following command.
torchrun \
--rdzv-backend=c10d\
--rdzv-endpoint=localhost:29500\
comet_eval.py \
--model_name_or_path Unbabel/wmt22-comet-da\
--batch_size 64\
--language_identifier_name_or_path facebook/fasttext-language-identification\
--data_dir ./out/gemma-2b\
--output_dir ./out/gemma-2b/comet_la_scores\
--empty\
If you do not specify --language_identifier_name_or_path
, it will compute vanilla COMET-22. You can make use of eval.ipynb
in order to get the BLEU and chrF++ scores of your generations. sb.ipynb
provides the same services with many more visualization utilities.
For a LLM of interest, it is possible to replicate our experiments by simply using bash variables which take all the values that we tested. For example you can use LLaMA 3 8B, set --output_path
to start with ./generations/llama-3-8b
and use the following template.
MODEL_NAME_OR_PATH=$1 # e.g. meta-llama/Meta-Llama-3-8B
TOKENIZER_NAME_OR_PATH=$2 # e.g. meta-llama/Meta-Llama-3-8B
K=$3 # e.g. 1, 2, 5, 10, 20, 32 etc.
SRC=$4 # e.g. English
TGT=$5 # e.g. French, German, Swahili, Wolof
STRATEGY=$6 # e.g. SONAR, LaBSE, bm25 ...
FORMAT=$7 # e.g. s2s, s2t
PORT_ID=$8 # e.g. 29500
SEED=$9 # 122
TEMPLATE_KEY=${10} # 11
ARGS="\
--model_name_or_path $MODEL_NAME_OR_PATH\
--tokenizer_name_or_path $TOKENIZER_NAME_OR_PATH\
--data_path ./data/flores\
--k $K\
--src $SRC\
--tgt $TGT\
--template_key $TEMPLATE_KEY\
--seed $SEED\
--request_batch_size 64\
--num_beams 2\
--max_new_tokens 100\
--temperature 0.0\
--top_p 1.0\
--repetition_penalty 1.0\
--max_samples 1012\
--strategy $STRATEGY\
--output_path ./generations/llama-3-8b/$STRATEGY\
--format $FORMAT\
--use_vllm\
"
torchrun \
--rdzv-backend=c10d\
--rdzv-endpoint=localhost:$PORT_ID\
main.py \
$ARGS \
The scaling experiments require to add the following arguments (for Swahili).
--augment_pool\
--pool_name pool_nllb_en_to_sw\
--pool_dataset_name_or_path ArmelRandy/nllb_en_sw_20K\
--pool_size $POOL_SIZE\
with POOL_SIZE in [10, 100, 500, 997, 1997, 1997, 5997, 10997, 20997]
. Usually we adapted --output_path
to ./generations/llama-3-8b/scaling/$POOL_SIZE/$STRATEGY
.
We use jupyter notebooks for data visualization. In analysis.ipynb
, you can find the code to compute the overlap between the choices of different sentence embeddings. It also helps to compute the vendi scores of each pool compositions (histogram.py
derives the average perplexity). sb.ipynb
contains the code to compare the retrieval approaches in s2s
, s2t
etc. and to plot the pool diversity figure.
All the outputs of our experiments can be found in the folder generations
.
This repository is an easy tool to perform few-shot machine translation with example selection via similarity search. Although all of our experiments involved translating from English to French, German, Swahili or Wolof; you can use this code to translate from English to any FLORES-200 language (see ./data/flores200.jsonl
). This is possible with Template 11 and in s2s
.
If you found this repository or its outputs to be useful, don't hesitate to cite.
@misc{zebaze2024incontextexampleselectionsimilarity,
title={In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation},
author={Armel Zebaze and Benoît Sagot and Rachel Bawden},
year={2024},
eprint={2408.00397},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.00397},
}
This work was made possible by the INRIA Paris' NLP research team, ALMAnaCH.