There are a number of pre-trained SFTs which can be loaded directly as follows:

from sft import SFT

pretrained_sft = SFT(model_identifier)

Task SFTs

The following single-source (i.e. English, except for nusax_senti) task SFTs are available:

cambridgeltl/mbert-task-sft-pos: Universal Dependencies part-of-speech tagging.
cambridgeltl/mbert-task-sft-dp: Universal Dependency parsing.
cambridgeltl/mbert-task-sft-masakhaner and cambridgeltl/xlmr-task-sft-masakhaner: Named Entity Recognition on the restricted tagset O, (B/I)-PER, (B/I)-ORG, (B/I)-LOC.
cambridgeltl/xlmr-task-sft-nli: Natural Language Inference.
cambridgeltl/xlmr-task-sft-nusax_senti: Sentiment Analysis, trained on the SMSA Indonesian SA dataset, which has three labels, 0 = negative, 1 = neutral, 2 = positive.

The following multi-source task SFTs are available:

cambridgeltl/mbert-task-sft-pos-ms: Universal Dependencies part-of-speech tagging (15 diverse source languages).
cambridgeltl/mbert-task-sft-dp-ms: Universal Dependency parsing (15 diverse source languages).
cambridgeltl/xlmr-task-sft-nli-ms: Natural Language Inference, trained on the concatenation of MultiNLI (English) and the test data for all languages in XNLI.
cambridgeltl/xlmr-task-sft-squadv1-ms: SQuADv1-style question answering, trained on SQuADv1 (English) plus the test data from MLQA and XQuAD for all languages in MLQA. The data from XQuAD for languages NOT in MLQA was used for evaluation, achieving the following results (compared to DeepMind's full fine-tuning baselines):

Base model	Fine-tuning method	Source data	el	ro	ru	th	tr
mBERT	Full	SQuADv1	62.6/44.9	72.7/59.9	71.3/53.3	42.7/33.5	55.4/40.1
XLM-R Large	Full	SQuADv1	79.8/61.7	83.6/69.7	80.1/64.3	74.2/62.8	75.9/59.3
XLM-R Base	LT-SFT	SQuADv1 + MLQA + XQuAD(subset)	81.9/65.5	86.3/73.3	81.4/64.6	82.4/75.2	75.2/58.6

cambridgeltl/xlmr-task-sft-nusax_senti-ms: Sentiment Analysis, trained on the SMSA Indonesian SA dataset + NusaX-senti, which contains a subset of SMSA translated into 10 other Indonesian languages + English. Results of our single- and multi-source SFTs on the NusaX-senti test set are as follows (note that for bbc, bug and nij, no language adaptation was applied due to poor or non-existent Wikipedia corpora for these languages):

Source	ace	ban	bbc	bjn	bug	eng	ind	mad	min	jav	nij	sun
single	79.7	82.0	38.7	82.2	29.6	88.5	89.7	76.7	82.0	84.2	68.0	85.8
multi	82.9	86.3	75.8	87.3	71.8	91.4	91.2	81.7	89.8	91.0	81.4	87.8

Language SFTs

Identifiers for language SFTs are of the form cambridgeltl/{base_model}-lang-sft-{lang_code}-small, e.g. cambridgeltl/mbert-lang-sft-en-small. "Small" SFTs have ~7.6M parameters - we may release larger models in the future. Language SFTs are currently available for the following languages/models:

Language	Code	bert-base-multilingual-cased (mbert)	xlm-roberta-base (xlmr)
Acehnese	ace	✗	✓
Amharic	amh	✗	✓
Arabic	ar	✓	✓
Ashaninka	cni	✗	✓
Balinese	ban	✗	✓
Bambara	bm	✓	✗
Banjarese	bjn	✗	✓
Basque	eu	✓	✗
Bengali	bn	✓	✗
Bribri	bzd	✗	✓
Bulgarian	bg	✗	✓
Buryat	bxr	✓	✗
Cantonese	yue	✓	✗
Chinese	zh	✓	✓
Czech	cs	✓	✗
English	en	✓	✓
Erzya	myv	✓	✗
Estonian	et	✓	✗
Faroese	fo	✓	✗
French	fr	✓	✓
German	de	✓	✓
Greek	el	✓	✓
Guarani	gn	✗	✓
Hausa	hau	✓	✓
Hindi	hi	✓	✓
Igbo	ibo	✓	✓
Indonesian	id	✓	✓
Japanese	ja	✓	✗
Javanese	jav	✗	✓
Kinyarwanda	kin	✓	✓
Komi Zyrian	kpv	✓	✗
Korean	ko	✓	✓
Livvi	olo	✓	✗
Luganda	lug	✓	✓
Luo	luo	✓	✓
Madurese	mad	✗	✓
Maltese	mt	✓	✗
Manx	gv	✓	✗
Minangkabau	min	✗	✓
Nahuatl	nah	✗	✓
Nigerian-Pidgin	pcm	✓	✓
Otomi	oto	✗	✓
Persian	fa	✓	✗
Portuguese	pt	✓	✗
Quechua	quy	✗	✓
Raramuri	tar	✗	✓
Romanian	ro	✓	✓
Russian	ru	✓	✓
Sanskrit	sa	✓	✗
Shipibo-Konibo	shp	✗	✓
Spanish	es	✓	✓
Sundanese	sun	✗	✓
Swahili	swa	✓	✓
Tamil	ta	✓	✗
Thai	th	✓	✓
Turkish	tr	✓	✓
Upper Sorbian	hsb	✓	✗
Urdu	ur	✗	✓
Uyghur	ug	✓	✗
Vietnamese	vi	✓	✓
Wixarika	hch	✗	✓
Wolof	wol	✓	✓
Yoruba	yor	✓	✓

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODELS.md

MODELS.md

Task SFTs

Language SFTs

Files

MODELS.md

Latest commit

History

MODELS.md

File metadata and controls

Task SFTs

Language SFTs