There are a number of pre-trained SFTs which can be loaded directly as follows:
from sft import SFT
pretrained_sft = SFT(model_identifier)
The following single-source (i.e. English, except for nusax_senti) task SFTs are available:
cambridgeltl/mbert-task-sft-pos
: Universal Dependencies part-of-speech tagging.cambridgeltl/mbert-task-sft-dp
: Universal Dependency parsing.cambridgeltl/mbert-task-sft-masakhaner
andcambridgeltl/xlmr-task-sft-masakhaner
: Named Entity Recognition on the restricted tagsetO
,(B/I)-PER
,(B/I)-ORG
,(B/I)-LOC
.cambridgeltl/xlmr-task-sft-nli
: Natural Language Inference.cambridgeltl/xlmr-task-sft-nusax_senti
: Sentiment Analysis, trained on the SMSA Indonesian SA dataset, which has three labels, 0 = negative, 1 = neutral, 2 = positive.
The following multi-source task SFTs are available:
cambridgeltl/mbert-task-sft-pos-ms
: Universal Dependencies part-of-speech tagging (15 diverse source languages).cambridgeltl/mbert-task-sft-dp-ms
: Universal Dependency parsing (15 diverse source languages).cambridgeltl/xlmr-task-sft-nli-ms
: Natural Language Inference, trained on the concatenation of MultiNLI (English) and the test data for all languages in XNLI.cambridgeltl/xlmr-task-sft-squadv1-ms
: SQuADv1-style question answering, trained on SQuADv1 (English) plus the test data from MLQA and XQuAD for all languages in MLQA. The data from XQuAD for languages NOT in MLQA was used for evaluation, achieving the following results (compared to DeepMind's full fine-tuning baselines):
Base model | Fine-tuning method | Source data | el | ro | ru | th | tr |
---|---|---|---|---|---|---|---|
mBERT | Full | SQuADv1 | 62.6/44.9 | 72.7/59.9 | 71.3/53.3 | 42.7/33.5 | 55.4/40.1 |
XLM-R Large | Full | SQuADv1 | 79.8/61.7 | 83.6/69.7 | 80.1/64.3 | 74.2/62.8 | 75.9/59.3 |
XLM-R Base | LT-SFT | SQuADv1 + MLQA + XQuAD(subset) | 81.9/65.5 | 86.3/73.3 | 81.4/64.6 | 82.4/75.2 | 75.2/58.6 |
cambridgeltl/xlmr-task-sft-nusax_senti-ms
: Sentiment Analysis, trained on the SMSA Indonesian SA dataset + NusaX-senti, which contains a subset of SMSA translated into 10 other Indonesian languages + English. Results of our single- and multi-source SFTs on the NusaX-senti test set are as follows (note that for bbc, bug and nij, no language adaptation was applied due to poor or non-existent Wikipedia corpora for these languages):
Source | ace | ban | bbc | bjn | bug | eng | ind | mad | min | jav | nij | sun |
---|---|---|---|---|---|---|---|---|---|---|---|---|
single | 79.7 | 82.0 | 38.7 | 82.2 | 29.6 | 88.5 | 89.7 | 76.7 | 82.0 | 84.2 | 68.0 | 85.8 |
multi | 82.9 | 86.3 | 75.8 | 87.3 | 71.8 | 91.4 | 91.2 | 81.7 | 89.8 | 91.0 | 81.4 | 87.8 |
Identifiers for language SFTs are of the form cambridgeltl/{base_model}-lang-sft-{lang_code}-small
, e.g. cambridgeltl/mbert-lang-sft-en-small
. "Small" SFTs have ~7.6M parameters - we may release larger models in the future. Language SFTs are currently available for the following languages/models:
Language | Code | bert-base-multilingual-cased (mbert) | xlm-roberta-base (xlmr) |
---|---|---|---|
Acehnese | ace | ✗ | ✓ |
Amharic | amh | ✗ | ✓ |
Arabic | ar | ✓ | ✓ |
Ashaninka | cni | ✗ | ✓ |
Balinese | ban | ✗ | ✓ |
Bambara | bm | ✓ | ✗ |
Banjarese | bjn | ✗ | ✓ |
Basque | eu | ✓ | ✗ |
Bengali | bn | ✓ | ✗ |
Bribri | bzd | ✗ | ✓ |
Bulgarian | bg | ✗ | ✓ |
Buryat | bxr | ✓ | ✗ |
Cantonese | yue | ✓ | ✗ |
Chinese | zh | ✓ | ✓ |
Czech | cs | ✓ | ✗ |
English | en | ✓ | ✓ |
Erzya | myv | ✓ | ✗ |
Estonian | et | ✓ | ✗ |
Faroese | fo | ✓ | ✗ |
French | fr | ✓ | ✓ |
German | de | ✓ | ✓ |
Greek | el | ✓ | ✓ |
Guarani | gn | ✗ | ✓ |
Hausa | hau | ✓ | ✓ |
Hindi | hi | ✓ | ✓ |
Igbo | ibo | ✓ | ✓ |
Indonesian | id | ✓ | ✓ |
Japanese | ja | ✓ | ✗ |
Javanese | jav | ✗ | ✓ |
Kinyarwanda | kin | ✓ | ✓ |
Komi Zyrian | kpv | ✓ | ✗ |
Korean | ko | ✓ | ✓ |
Livvi | olo | ✓ | ✗ |
Luganda | lug | ✓ | ✓ |
Luo | luo | ✓ | ✓ |
Madurese | mad | ✗ | ✓ |
Maltese | mt | ✓ | ✗ |
Manx | gv | ✓ | ✗ |
Minangkabau | min | ✗ | ✓ |
Nahuatl | nah | ✗ | ✓ |
Nigerian-Pidgin | pcm | ✓ | ✓ |
Otomi | oto | ✗ | ✓ |
Persian | fa | ✓ | ✗ |
Portuguese | pt | ✓ | ✗ |
Quechua | quy | ✗ | ✓ |
Raramuri | tar | ✗ | ✓ |
Romanian | ro | ✓ | ✓ |
Russian | ru | ✓ | ✓ |
Sanskrit | sa | ✓ | ✗ |
Shipibo-Konibo | shp | ✗ | ✓ |
Spanish | es | ✓ | ✓ |
Sundanese | sun | ✗ | ✓ |
Swahili | swa | ✓ | ✓ |
Tamil | ta | ✓ | ✗ |
Thai | th | ✓ | ✓ |
Turkish | tr | ✓ | ✓ |
Upper Sorbian | hsb | ✓ | ✗ |
Urdu | ur | ✗ | ✓ |
Uyghur | ug | ✓ | ✗ |
Vietnamese | vi | ✓ | ✓ |
Wixarika | hch | ✗ | ✓ |
Wolof | wol | ✓ | ✓ |
Yoruba | yor | ✓ | ✓ |