Latin BERT is a contextual language model for the Latin language, described in more detail in the following:
David Bamman and Patrick J. Burns (2020), Latin BERT: A Contextual Language Model for Classical Philology, ArXiv.
Tested on Python 3.10.13 [Feb. 24 2024].
1.) Create a conda environment (optional):
conda create --name latinbert python=3
conda activate latinbert
2.) Install PyTorch according to your own system requirements (GPU vs. CPU, CUDA version): https://pytorch.org.
3.) Install the remaining libraries:
pip install -r requirements.txt
4.) Install Latin tokenizer models:
python3 -c "from cltk.data.fetch import FetchCorpus; corpus_downloader = FetchCorpus(language='lat');corpus_downloader.import_corpus('lat_models_cltk')"
5.) Download pre-trained BERT model for Latin:
./scripts/download.sh
For a minimal example of how to generate BERT representations for an input sentence, execute the following:
python3 scripts/gen_berts.py --bertPath models/latin_bert/ --tokenizerPath models/subword_tokenizer_latin/latin.subword.encoder > berts.output.txt
This generates BERT representations for two sentences and saves their output with one (token, 768-dimensional final BERT representation) tuple per line. For examples of how to fine-tune Latin BERT for a specific task, see the case studies on POS tagging and WSD.
Latin BERT is pre-trained using data from the following sources.
Source | Tokens |
---|---|
Corpus Thomisticum | 14.1M |
Internet Archive | 561.1M |
Latin Library | 15.8M |
Patrologia Latina | 29.3M |
Perseus | 6.5M |
Latin Wikipedia | 15.8M |
Total | 642.7M |
Texts from Perseus and the Latin Library are drawn from the corpora in the Classical Language Toolkit. Texts are tokenized for sentences and words using Latin-specific tokenizers in CLTK. We learn a Latin-specific WordPiece tokenizer using tensor2tensor from this training data.
Since the texts from the Internet Archive (IA) are the product of noisy OCR, we uniformly upsample all non-IA texts to train on a balance of approximately 50% IA texts and 50% non-IA texts.
We pre-train Latin BERT using tensorflow on a TPU for 1M steps. Training took approximately 5 days on a TPU v2, and cost ~$540 on Google Cloud (at $4.50 per TPU v2 hour). We set the maximum sequence length to 256 WordPiece tokens.
We convert the resulting tensorflow checkpoint into a BERT model that can used by the HuggingFace library using the transformers-cli library. The model in model/latin_bert
can be used with the HuggingFace transformers library.
Bamman and Burns (2020) illustrates the affordances of Latin BERT with four case studies; here is a quick summary of them.
Latin BERT demonstrates meaningful part-of-speech distinctions in its representations without further task-specific training.
When trained on POS tagging, Latin BERT achieves a new state of the art on all three Universal Dependency datasets for Latin.
Method | Perseus | PROIEL | ITTB |
---|---|---|---|
Latin BERT | 94.3 | 98.2 | 98.8 |
Straka et al. (2019) | 90.0 | 97.2 | 98.4 |
Smith et al. (2018) | 88.7 | 96.2 | 98.3 |
Straka (2018) | 87.6 | 96.8 | 98.3 |
Static embeddings | 87.6 | 95.2 | 97.6 |
Boros et al. (2018) | 85.7 | 94.6 | 97.7 |
Latin BERT can be used to generate probabilites for lacunae and other missing words in context. For example, consider the following sentence:
dvces et reges carthaginiensivm hanno et mago qui ___ punico bello cornelium consulem aput liparas ceperunt
The words with the highest probabilities predicted to fill that slot are the following:
Word | Probability |
---|---|
secundo | 0.451 |
primo | 0.385 |
tertio | 0.093 |
altero | 0.018 |
primi | 0.012 |
priore | 0.012 |
quarto | 0.005 |
secundi | 0.004 |
primum | 0.002 |
superiore | 0.002 |
(Note "primo" here is a textual critic's emendation.) Latin BERT is able to reconstruct an exact human-judged ementation 33.1% of the time; in 62.2% of cases, the human emendation is in the top 10 predictions.
Latin BERT is able to distinguish between senses of Latin words. We construct a new WSD dataset by mining citations from the Lewis and Short Latin Dictionary, and measure the ability of different methods to distinguish between them given the context of the sentence. In a balanced evaluation (where random choice yields 50% accuracy), Latin BERT outperforms static embeddings by over 8 absolute points.
Method | Accuracy |
---|---|
Latin BERT | 75.4 |
Static embeddings | 67.3 |
Random | 50.0 |
BERT representations are contextual embeddings, so the same word type (e.g., in) will have a different representation in each sentence in which it is used. While static embeddings like word2vec allow us to find words that are most similar to a given word type, BERT (and other contextual embeddings) allow us to find other words that are most similar to a given word token. For example, we can find tokens in context that are most similar to the representation for in within gallia est omnis divisa in partes tres:
Cosine | text | citation |
---|---|---|
0.835 | ager romanus primum divisus in partis tris, a quo tribus appellata titiensium ... | Varro, Ling. |
0.834 | in ea regna duodeviginti dividuntur in duas partes. | Sol. |
0.833 | gallia est omnis divisa in partes tres, quarum unam incolunt belgae, aliam ... | Caes., BGall. |
0.824 | is pagus appellabatur tigurinus; nam omnis civitas helvetia in quattuor pagos divisa est. | Caes., BGall. |
0.820 | ea pars, quam africam appellavimus, dividitur in duas provincias, veterem et novam, discretas fossa ... | Plin., HN |
0.817 | eam distinxit in partes quatuor. | Erasmus, Ep. |
0.812 | hereditas plerumque dividitur in duodecim uncias, quae assis appellatione continentur. | Justinian, Inst. |
The most similar tokens not only capture the specific morphological constraints of this sense of in appearing with a noun in the accusative case (denoting into rather than within) but also broadly capture the more specific subsense of division into parts.
With thanks to Todd Cook, Luis Antonio Vasquez Reina, LuigiOnFire for their contributions.