Skip to content

tibetan-nlp/classical-tibetan-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classical Tibetan Corpus

This repository contains a small number of Classical Tibetan texts that were linguistically analyzed and annotated by human beings:

  1. མཛངས་བླུན་ཞེས་བྱ་བའི་མདོ། (mdzangs blun)
  2. མར་པ་ལོ་ཙཱའི་རྣམ་ཐར། (mar pa lo cA'i rnam thar)
  3. བུ་སྟོན་ཆོས་འབྱུང་། (bu ston chos 'byung)
  4. མི་ལའི་རྣམ་ཐར། (mi la'i rnam thar)
  5. ཏཱ་ར་ནཱ་ཐ (tA ra nA tha)

With the exception of ཏཱ་ར་ནཱ་ཐ, which was machine-tagged between 2017-2020, the above texts were part-of-speech tagged by human beings as part of the TIDC (Tibetan in Digital Communication) project (2012-2015).

The tagset was then simplified in approximate conformance with the Universal POS tags scheme. No information was lost in this process, since many tagging details were encoded as Universal features. For details on this process, see the cg3 grammar tidc2upos in the tibcg3 repository.

The texts were then converted into BRAT standoff format so that they could be further analyzed using the brat rapid annotation tool. Between 2017-2020, the work focused on annotating the argument structure of Tibetan verbs, using a modified version of the Universal Dependencies scheme.

At the conclusion of annotation, the BRAT files were exported to CoNLL-U format, for broader dissemination and use. Please note that although the final BRAT configuration and data files are made available here, they are only provided for completeness. Moving forward, only the CoNNL-U files will be maintained.

English translations of the texts མཛངས་བླུན་ཞེས་བྱ་བའི་མདོ།, མར་པ་ལོ་ཙཱའི་རྣམ་ཐར།, and བུ་སྟོན་ཆོས་འབྱུང་། were obtained, and these translations were aligned at sentence or page-level to the Tibetan texts. In the case of མཛངས་བླུན་ཞེས་བྱ་བའི་མདོ། and མར་པ་ལོ་ཙཱའི་རྣམ་ཐར།, there are two CoNLL-U files each: those with the -translated suffix are translation-aligned at the page-level (making CoNLL-U sentences very long), and untranslated pages are excluded. For these two texts, the files without the -translated suffix lack translation alignments and use shunits, i.e. shad-delimited units, as CoNLL-U sentences.

You may cite this work by referencing the repository and its authors: Edward Garrett, Nathan Hill, Samyo Rode, Nikolai Solmsdorf, and Sonam Wangyal. We thank the AHRC for its funding of the projects Tibetan in Digital Communication (2012-2015, PI Ulrich Pagel) and Lexicography in Motion (2017-2020, PI Ulrich Pagel).

Here is some metadata about the collection.

Key Value
Text ID mdzangs_blun
Title (eng) Sutra of the Wise and the Foolish
Title (bod) མཛངས་བླུན་ཞེས་བྱ་བའི་མདོ་
Source (eng) Frye, Stanley (1981). The Sutra of the Wise and the Foolish, Library of Tibetan Works and Archives.
Source (bod) (?)
Date (?)
Author Unknown
Translation Stanley Frye
Tagging Edward Garrett & Nathan Hill (?)
Annotation Samyo Rode & Nikolai Solmsdorf
Alignment Sonam Wangyal
Genre Religion
Region Tibet
Language Tibetan, Classical
Normalization No
Licensing Creative Commons Attribution 4.0 International License (CC-BY)
Annotator's notes Translated from Chinese into Tibetan ca. 9./10. century. Canonical text (sDe dge bka’ ’gyur, mDo sde, Vol. 74,fols. 129a–298a). Collection of tales of previous births of the Buddha (skt. jātaka)that reflects structure of translated language (Non-Tibetan origin).Formulaic, repetitive narrative structure. Regular grammatical structure, uniform verb frames.
Key Value
Text ID marpa
Title (eng) The life of Marpa the Translator
Title (bod) མར་པ་ལོ་ཙཱ་བ་རྣམ་ཐར་
Source (eng) Trungpa, Chögyam (1982). The Life of Marpa the Translator, Prajna Press.
Source (bod) (?)
Date (?)
Author Gtsang smyon Heruka (1452–1507)
Translation Nalanda Translation Committee under the direction of Chögyam Trungpa
Tagging Edward Garrett & Nathan Hill (?)
Annotation Samyo Rode & Nikolai Solmsdorf
Alignment Sonam Wangyal
Genre Biography
Region Tibet
Language Tibetan, Classical
Normalization No
Licensing Creative Commons Attribution 4.0 International License (CC-BY)
Annotator's notes Composed in 1505. Large percentage of text is songs and poems with vivid language, resembling Colloquial Tibetan in parts. Prose interspersed with songs and poems with rich vocabulary. Diverse verb structures, e.g. light verbs, auxiliary verbs.
Key Value
Text ID bu_ston
Title (eng) History of Buddhism
Title (bod) བུ་སྟོན་ཆོས་འབྱུང་
Source (eng) Obermiller, Eugeny (1931-32). The history of Buddhism (Chos ḥbyung) by Bu-ston, Heidelberg, In Kommission bei O. Harrassowitz.
Source (bod) (?)
Date (?)
Author Bu ston Rin chen grub (1290–1364)
Translation Obermiller, Eugeny
Tagging Edward Garrett & Nathan Hill (?)
Annotation Samyo Rode & Nikolai Solmsdorf
Alignment No
Genre History
Region Tibet
Language Tibetan, Classical
Normalization No
Licensing Creative Commons Attribution 4.0 International License (CC-BY)
Annotator's notes Composed in 1322. History of Buddhism in India and Tibet with a focus on philosophical subjects. Abundant citations from Canonical texts with many lists and enumerations. Verse sections. Few continuous prose sections: Less fruitful for verb-argument-structure.
Key Value
Text ID mila
Title (eng) The life of Milarepa
Title (bod) མི་ལའི་རྣམ་ཐར་
Source (eng) Quintman, Andrew (2010). The Life of Milarepa, Penguin Books.
Source (bod) (?)
Date (?)
Author Gtsang smyon Heruka (1452–1507)
Translation Quintman, Andrew
Tagging Edward Garrett & Nathan Hill (?)
Annotation Samyo Rode & Nikolai Solmsdorf
Alignment Sonam Wangyal
Genre Biography
Region Tibet
Language Tibetan, Classical
Normalization No
Licensing Creative Commons Attribution 4.0 International License (CC-BY)
Annotator's notes Completed in 1488. Vivid language, resembling Colloquial Tibetan in parts. Prose interspersed with songs and poems with rich vocabulary.Diverse verb structures, e.g. light verbs, auxiliary verbs.
Key Value
Text ID taranatha
Title (eng) History of Buddhism in India
Title (bod) ཙཱ་ར་ནཱ་ཐའི་རྒྱ་གར་ཆོས་འབྱུང་
Source (eng) Alaka, Chattopadhaya, Alaka and Chattopadhyaya, Debiprasad (1990). Taranatha's History of Buddhism In India, Motilal Banarsidass.
Source (bod) (?)
Date (?)
Author Tāranātha Kun dga’ snying po (1575–1634)
Translation Lama Chimpa Chattopadhaya Alaka
Tagging Marieke Meelen
Annotation Samyo Rode & Nikolai Solmsdorf
Alignment No
Genre History
Region Tibet
Language Tibetan, Classical
Normalization No
Licensing Creative Commons Attribution 4.0 International License (CC-BY)
Annotator's notes Composed in 1608. History of Buddhism in India and Tibet. Mostly prose. Limited vocabulary: Lacking diversified verb structures.