Official Code for "Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models" (To Appear in WMT 2022)
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. In this work, we evaluate knowledge distillation's use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on multiple priors makes distillation a brittle compression mechanism. We further explore the use of post-training quantization for the compression of these models. Here, we find that quantization provides more consistent performance trends (than distillation) for the entire range of languages, especially the lowest-resource languages in our target set.
We cover 8 languages of diverse linguistic origins, varying data between 7K samples to 3M samples for our study. The train-test splits for Gondi and Mundari will be released soon and testsets for all other languages are publicly available (listed in the paper).
Language | Train Data (Sentence Pairs) | Links |
---|---|---|
Bribri | ~7000 | Here |
Wixarica | ~8000 | Here |
Mundari | ~11000 | Public Link Available Soon |
Gondi | ~25000 | Here |
Assammesse | ~135000 | Here |
Odia | ~1M | Here |
Punjabi | ~2.4M | Here |
Gujarati | ~3M | Here |
Each of the quantized variants is at least 3x smaller than it's best performing model and the distilled variants are at least 6x smaller. Models and their compressed variants (for plug-and-play usage) coming soon!
Language | Best Uncompressed Variant | Best Distilled Variant | Best Quantized Variant | ||
---|---|---|---|---|---|
spBLEU | spBLEU | chrF2 | spBLEU | chrF2 | |
Bribri | 6.4 | 6.8 | 13.2 | 7.4 | 19.4 |
Wixarica | 6.2 | 4.1 | 17.3 | 7.2 | 26.8 |
Mundari | 15.9 | 18.2 | 32.7 | 15.7 | 29.3 |
Gondi | 14.3 | 14.2 | 32.8 | 13.8 | 31.1 |
Assamesse | 10.7 | 9.6 | 27.4 | 6.2 | 25.7 |
Odia | 27.4 | 20.2 | 40.7 | 21.0 | 41.3 |
Punjabi | 38.4 | 32.8 | 46.6 | 27.0 | 48.0 |
Gujarati | 35.9 | 29.8 | 48.6 | 28.4 | 51.4 |
The environment can be setup using the provided requirements file (Requires pip > pip 22.0.2)
pip install -r requirements.txt
├── readme.md
├── requirements.txt
├── scripts # Scripts with all the variants of the commands + default hyperparameter values
│ ├── confidence_estimation.sh # logging the confidence statistics
│ ├── inference.sh # inference for both architectures - online and offline graphs
│ ├── preprocess.sh # preprocessing data for training and evaluation
│ ├── sweep.yaml # sweep yaml for hyperparameter trials
│ └── train.sh # variants of training and continued pretraining
└── src # src files for all the experiments
├── confidence_estimation.py # logging confidence stats: average softmax entropy, standard deviation of log probabilities
├── continued_pretraining.py # continued pretraining of mt5
├── inference.py # online and graph inference
├── preprocess.py # preprocessing bilingual and monolingual data + vocab and tokenizer creation
├── split_saving.py # generating the offline graphs for both model architectures
├── student_labels.py # generating the student labels for the best model architecture for the models
├── train.py # training script for vanilla, distilled and pretrained model configuration
└── utils.py # utils like script conversion, checking for deduplication
1. Run **preprocess.py** to convert training data to HF format and generating the Tokenizer Files for the Vanilla tranformer.
2. Run **train.py** for training and saving the best model. (monitored metric is BLEU with mt13eval tokenizer)
3. Run **split_saving_{model_architecture_type}.py** to quantize the encoder and decoder separately.
4. Run **inference.py** (with offline = True) for offline inference on the quantized graphs.
Sample commands with default hyperparameter values are specified in scripts/
{
"nrefs:1|case:mixed|eff:no|tok:spm-flores|smooth:exp|version:2.2.0",
"verbose_score":,
"nrefs": "1",
"case": "mixed",
"eff": "no",
"tok": "spm-flores",
"smooth": "exp",
"version": "2.2.0"
}
{
"name": "chrF2",
"signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.2.0",
"nrefs": "1",
"case": "mixed",
"eff": "yes",
"nc": "6",
"nw": "0",
"space": "no",
"version": "2.2.0"
}
- [Wixarika] Mager, M., Carrillo, D., & Meza, I. (2018). Probabilistic finite-state morphological sgmenter for wixarika (huichol) language. Journal of Intelligent & Fuzzy Systems, 34(5), 3081-3087.
- [Bribri] Feldman, I., & Coto-Solano, R. (2020, December). Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 3965-3976).
- [Mundari]: Data to be released soon.
- [Odia] [Punjabi] [Gujarati] and [Assamesse]: @article{10.1162/tacl_a_00452, author = {Ramesh, Gowtham and Doddapaneni, Sumanth and Bheemaraj, Aravinth and Jobanputra, Mayank and AK, Raghavan and Sharma, Ajitesh and Sahoo, Sujit and Diddee, Harshita and J, Mahalakshmi and Kakwani, Divyanshu and Kumar, Navneet and Pradeep, Aswin and Nagaraj, Srihari and Deepak, Kumar and Raghavan, Vivek and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh Shantadevi}, title = "{Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}", journal = {Transactions of the Association for Computational Linguistics}, volume = {10}, pages = {145-162}, year = {2022}, month = {02}, issn = {2307-387X}, doi = {10.1162/tacl_a_00452}, url = {https://doi.org/10.1162/tacl\_a\_00452}, eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00452/1987010/tacl\_a\_00452.pdf}, }
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.