Skip to content

Latest commit

 

History

History
105 lines (94 loc) · 5.39 KB

MODELS.md

File metadata and controls

105 lines (94 loc) · 5.39 KB

There are a number of pre-trained SFTs which can be loaded directly as follows:

from sft import SFT

pretrained_sft = SFT(model_identifier)

Task SFTs

The following single-source (i.e. English, except for nusax_senti) task SFTs are available:

  • cambridgeltl/mbert-task-sft-pos: Universal Dependencies part-of-speech tagging.
  • cambridgeltl/mbert-task-sft-dp: Universal Dependency parsing.
  • cambridgeltl/mbert-task-sft-masakhaner and cambridgeltl/xlmr-task-sft-masakhaner: Named Entity Recognition on the restricted tagset O, (B/I)-PER, (B/I)-ORG, (B/I)-LOC.
  • cambridgeltl/xlmr-task-sft-nli: Natural Language Inference.
  • cambridgeltl/xlmr-task-sft-nusax_senti: Sentiment Analysis, trained on the SMSA Indonesian SA dataset, which has three labels, 0 = negative, 1 = neutral, 2 = positive.

The following multi-source task SFTs are available:

  • cambridgeltl/mbert-task-sft-pos-ms: Universal Dependencies part-of-speech tagging (15 diverse source languages).
  • cambridgeltl/mbert-task-sft-dp-ms: Universal Dependency parsing (15 diverse source languages).
  • cambridgeltl/xlmr-task-sft-nli-ms: Natural Language Inference, trained on the concatenation of MultiNLI (English) and the test data for all languages in XNLI.
  • cambridgeltl/xlmr-task-sft-squadv1-ms: SQuADv1-style question answering, trained on SQuADv1 (English) plus the test data from MLQA and XQuAD for all languages in MLQA. The data from XQuAD for languages NOT in MLQA was used for evaluation, achieving the following results (compared to DeepMind's full fine-tuning baselines):
Base model Fine-tuning method Source data el ro ru th tr
mBERT Full SQuADv1 62.6/44.9 72.7/59.9 71.3/53.3 42.7/33.5 55.4/40.1
XLM-R Large Full SQuADv1 79.8/61.7 83.6/69.7 80.1/64.3 74.2/62.8 75.9/59.3
XLM-R Base LT-SFT SQuADv1 + MLQA + XQuAD(subset) 81.9/65.5 86.3/73.3 81.4/64.6 82.4/75.2 75.2/58.6
  • cambridgeltl/xlmr-task-sft-nusax_senti-ms: Sentiment Analysis, trained on the SMSA Indonesian SA dataset + NusaX-senti, which contains a subset of SMSA translated into 10 other Indonesian languages + English. Results of our single- and multi-source SFTs on the NusaX-senti test set are as follows (note that for bbc, bug and nij, no language adaptation was applied due to poor or non-existent Wikipedia corpora for these languages):
Source ace ban bbc bjn bug eng ind mad min jav nij sun
single 79.7 82.0 38.7 82.2 29.6 88.5 89.7 76.7 82.0 84.2 68.0 85.8
multi 82.9 86.3 75.8 87.3 71.8 91.4 91.2 81.7 89.8 91.0 81.4 87.8

Language SFTs

Identifiers for language SFTs are of the form cambridgeltl/{base_model}-lang-sft-{lang_code}-small, e.g. cambridgeltl/mbert-lang-sft-en-small. "Small" SFTs have ~7.6M parameters - we may release larger models in the future. Language SFTs are currently available for the following languages/models:

Language Code bert-base-multilingual-cased (mbert) xlm-roberta-base (xlmr)
Acehnese ace
Amharic amh
Arabic ar
Ashaninka cni
Balinese ban
Bambara bm
Banjarese bjn
Basque eu
Bengali bn
Bribri bzd
Bulgarian bg
Buryat bxr
Cantonese yue
Chinese zh
Czech cs
English en
Erzya myv
Estonian et
Faroese fo
French fr
German de
Greek el
Guarani gn
Hausa hau
Hindi hi
Igbo ibo
Indonesian id
Japanese ja
Javanese jav
Kinyarwanda kin
Komi Zyrian kpv
Korean ko
Livvi olo
Luganda lug
Luo luo
Madurese mad
Maltese mt
Manx gv
Minangkabau min
Nahuatl nah
Nigerian-Pidgin pcm
Otomi oto
Persian fa
Portuguese pt
Quechua quy
Raramuri tar
Romanian ro
Russian ru
Sanskrit sa
Shipibo-Konibo shp
Spanish es
Sundanese sun
Swahili swa
Tamil ta
Thai th
Turkish tr
Upper Sorbian hsb
Urdu ur
Uyghur ug
Vietnamese vi
Wixarika hch
Wolof wol
Yoruba yor