Official Code for "Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models" (To Appear in WMT 2022)
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. In this work, we evaluate knowledge distillation's use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyperparameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we explore the use of post-training quantization for the compression of these models. Here, we find that while distillation provides gains across some low-resource languages, quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set.
We cover 8 languages of diverse linguistic origins, varying data between 7K samples to 3M samples for our study. The train-test splits for Gondi and Mundari will be released soon and testsets for all other languages are publicly available (listed in the paper).
Language | Train Data (Sentence Pairs) | Links |
---|---|---|
Bribri | ~7000 | Here |
Wixarica | ~8000 | Here |
Mundari | ~11000 | Public Link Available Soon |
Gondi | ~25000 | Here |
Assammesse | ~135000 | Here |
Odia | ~1M | Here |
Punjabi | ~2.4M | Here |
Gujarati | ~3M | Here |
Each of the quantized variants is at least 3x smaller than it's best performing model and the distilled variants are at least 6x smaller. Models and their compressed variants (for plug-and-play usage) coming soon! | Language | Best Uncompressed Variant | Best Distilled Variant | Best Quantized Variant | |--------------|:-----------------------------:|:--------------------------:|:---------:|:--------------------------:|:---------:| | | spBLEU | spBLEU | chrF2 | spBLEU | chrF2 | | Bribri | 6.4 | 6.8 | 13.2 | 7.4 | 19.4 | | Wixarica | 6.2 | 4.1 | 17.3 | 7.2 | 26.8 | | Mundari | 15.9 | 18.2 | 32.7 | 15.7 | 29.3 | | Gondi | 14.3 | 14.2 | 32.8 | 13.8 | 31.1 | | Assamesse | 10.7 | 9.6 | 27.4 | 6.2 | 25.7 | | Odia | 27.4 | 20.2 | 40.7 | 21.0 | 41.3 | | Punjabi | 38.4 | 32.8 | 46.6 | 27.0 | 48.0 | | Gujarati | 35.9 | 29.8 | 48.6 | 28.4 | 51.4 |
The environment can be setup using the provided requirements file (Requires pip > pip 22.0.2)
pip install -r requirements.txt
├── readme.md
├── requirements.txt
├── scripts # Scripts with all the variants of the commands + default hyperparameter values
│ ├── confidence_estimation.sh # logging the confidence statistics
│ ├── inference.sh # inference for both architectures - online and offline graphs
│ ├── preprocess.sh # preprocessing data for training and evaluation
│ ├── sweep.yaml # sweep yaml for hyperparameter trials
│ └── train.sh # variants of training and continued pretraining
└── src # src files for all the experiments
├── confidence_estimation.py # logging confidence stats: average softmax entropy, standard deviation of log probabilities
├── continued_pretraining.py # continued pretraining of mt5
├── inference.py # online and graph inference
├── preprocess.py # preprocessing bilingual and monolingual data + vocab and tokenizer creation
├── split_saving.py # generating the offline graphs for both model architectures
├── student_labels.py # generating the student labels for the best model architecture for the models
├── train.py # training script for vanilla, distilled and pretrained model configuration
└── utils.py # utils like script conversion, checking for deduplication
1. Run **preprocess.py** to convert training data to HF format and generating the Tokenizer Files for the Vanilla tranformer.
2. Run **train.py** for training and saving the best model. (monitored metric is BLEU with mt13eval tokenizer)
3. Run **split_saving_{model_architecture_type}.py** to quantize the encoder and decoder separately.
4. Run **inference.py** (with offline = True) for offline inference on the quantized graphs.
Sample commands with default hyperparameter values are specified in scripts/
{
"nrefs:1|case:mixed|eff:no|tok:spm-flores|smooth:exp|version:2.2.0",
"verbose_score":,
"nrefs": "1",
"case": "mixed",
"eff": "no",
"tok": "spm-flores",
"smooth": "exp",
"version": "2.2.0"
}
{
"name": "chrF2",
"signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.2.0",
"nrefs": "1",
"case": "mixed",
"eff": "yes",
"nc": "6",
"nw": "0",
"space": "no",
"version": "2.2.0"
}
- [Wixarika] Mager, M., Carrillo, D., & Meza, I. (2018). Probabilistic finite-state morphological sgmenter for wixarika (huichol) language. Journal of Intelligent & Fuzzy Systems, 34(5), 3081-3087.
- [Bribri] Feldman, I., & Coto-Solano, R. (2020, December). Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 3965-3976).
- [Mundari]: Data to be released soon.
- [Odia] [Punjabi] [Gujarati] and [Assamesse]: @article{10.1162/tacl_a_00452, author = {Ramesh, Gowtham and Doddapaneni, Sumanth and Bheemaraj, Aravinth and Jobanputra, Mayank and AK, Raghavan and Sharma, Ajitesh and Sahoo, Sujit and Diddee, Harshita and J, Mahalakshmi and Kakwani, Divyanshu and Kumar, Navneet and Pradeep, Aswin and Nagaraj, Srihari and Deepak, Kumar and Raghavan, Vivek and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh Shantadevi}, title = "{Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}", journal = {Transactions of the Association for Computational Linguistics}, volume = {10}, pages = {145-162}, year = {2022}, month = {02}, abstract = "{We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.}", issn = {2307-387X}, doi = {10.1162/tacl_a_00452}, url = {https://doi.org/10.1162/tacl\_a\_00452}, eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00452/1987010/tacl\_a\_00452.pdf}, }