TowerParse is a Python tool for multilingual dependency parsing, built on top of the HuggingFace Transformers library. Unlike other multilingual dependency parsers (e.g., UDify , UDapter), TowerParse offers a language-dedicated parsing model for each language (actually, for each test UD treebank, i.e., for languages with multiple treebanks, we offer multiple parsing models).
For each language/test treebanks, we heuristically selected the training and development treebanks, based on treebank sizes and typological proximities between languages. For more details on the heuristic training procedure, see the paper (and if you use TowerParse in your research, please cite it):
@inproceedings{glavas-vulic-2021-climbing,
title = "Climbing the Tower of Treebanks: Improving Low-Resource Dependency Parsing via Hierarchical Source Selection",
author = "Glava{\v{s}}, Goran and Vuli{\'c}, Ivan",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.431",
doi = "10.18653/v1/2021.findings-acl.431",
pages = "4878--4888",
}
In order to use TowerParse, you first need to download the pretrained model(s) for the language (and genre/treebank), and load it into the TowerParser class. TowerParse operates on pre-tokenized sentences, i.e., it does not include a tokenizer.
from tower import TowerParser
parser = tower.TowerParser("model_directory_path")
Instantiated parser than takes as input a list of sentences, each of which is supposed to be a list of (word-level) tokens (see example.py). You need to additionally specify the language code (ISO 639-3 code, e.g., "en" for English or "myv" for Erzya).
sentences = [["The", "quick", "brown", "fox", "jumped", "over", "the", "fence", "."],
["Oh", "dear", "I", "did", "n't", "expect", "that", "!"]]
parsed_sents = parser.parse("en", sentences)
TowerParse outputs as a result a list of parsed sentences, each of which is a list of 4-tuples, each corresponding to one input token, consisting of (i) the token index (starting from 1, index 0 denotes the "sentence root"), (ii) the token text, (iii) the index of the governing token, and (iv) the dependency relation. The token that is the root of the dependency tree has the governing token index of "0" and a dependency relation "root". Below are the code examples with output for example sentences in Arabic and German.
# Arabic
parser = tower.TowerParser("tower_models/UD_Arabic-PUD")
sentences_ar = [["سوريا", ":", "تعديل", "وزاري", "واسع", "يشمل", "8", "حقائب"]]
parsed_ar = parser.parse("ar", sentences_ar)
print_parsed (parsed_ar)
# Output:
(1, 'سوريا', 0, 'root')
(2, ':', 1, 'punct')
(3, 'تعديل', 6, 'nsubj')
(4, 'وزاري', 3, 'amod')
(5, 'واسع', 3, 'amod')
(6, 'يشمل', 1, 'parataxis')
(7, '8', 6, 'obj')
(8, 'حقائب', 7, 'nmod')
# German
parser.load_model("tower_models/UD_German-GSD")
sentences_de = [["Wie", "stark", "ist", "das", "Coronavirus", "in", "der", "Stadt", "verbreitet", "?"],
["Ein", "Überblick", "über", "die", "aktuelle", "Zahl", "der", "Infizierten", "und", "der", "aktuelle", "Inzidenzwert", "für", "München", "."]]
parsed_de = parser.parse("de", sentences_de)
print_parsed(parsed_de)
# Output:
(1, 'Wie', 2, 'advmod')
(2, 'stark', 9, 'advmod')
(3, 'ist', 9, 'cop')
(4, 'das', 5, 'det')
(5, 'Coronavirus', 9, 'nsubj')
(6, 'in', 8, 'case')
(7, 'der', 8, 'det')
(8, 'Stadt', 9, 'nmod')
(9, 'verbreitet', 0, 'root')
(10, '?', 9, 'punct')
(1, 'Ein', 2, 'det')
(2, 'Überblick', 0, 'root')
(3, 'über', 6, 'case')
(4, 'die', 6, 'det')
(5, 'aktuelle', 6, 'amod')
(6, 'Zahl', 2, 'nmod')
(7, 'der', 8, 'det')
(8, 'Infizierten', 6, 'nmod')
(9, 'und', 12, 'cc')
(10, 'der', 12, 'det')
(11, 'aktuelle', 12, 'amod')
(12, 'Inzidenzwert', 2, 'conj')
(13, 'für', 14, 'case')
(14, 'München', 12, 'nmod')
(15, '.', 2, 'punct')
You can configure the following in TowerParse:
-
The maximal expected length of the input sentences to be parsed, in terms of number of word-level tokens. This is set via the parameter max_word_len in tower_config.py. Should you feed sentences longer than what is set in max_word_len, TowerParse will throw an exception.
-
The maximal length of the input, in terms of the subword tokens fed to the XLM-R encoder. This is set via the parameter max_length in tower_config.py. The maximal value you can set for this config parameter is 512 (i.e., the maximal input length of the XLM-R Base encoder). Smaller values will lead to faster parsing, but you need to make sure that your max_length (i.e., max. number of XLM-R subword tokens for a sentence) is roughly aligned with max_word_len (i.e., the maximal expected number of word-level tokens in your sentences): otherwise, sentences longer than max_length XLM-R's subword tokens will be truncated. The good ratio between max_length and max_word_len depends on the language: for higher-resource languages (e.g., English), the number of XLM-R's subword tokens will be only slightly larger than the number of word-level tokens of your input sentence; for lower-resource languages, each word-level tokens may be broken down into several XLM-R's subword level tokens.
-
Processing device: you can run TowerParse both on GPU and CPU, with the former naturally being significantly faster. The processing device is set with the parameter device in tower_config.py
-
Finally, to make parsing faster, you can feed sentences to the parsing model in batches -- the larger the batch, the faster the parsing of your sentence collection is going to be (larger batches, will, naturally, occupy more of your working memory or GPU RAM, depending where your run the model). The batch size is an optional parameter (default value is 1, i.e., no batching) of the parse method of the TowerParse class (see the method signature below):
def parse(self, lang, sentences, batch_size = 1)
The parsing processing rates/speed we report are averaged of has been measured, averaged over sentences from UD_English_EWT, UD_German_GSD, and UD_Croatian_SET treebanks, and with sentences parsed in batches of size 128. These are to be taken as rough estimates, as the processing speed may vary depending on the language, batch size, and the (average) length of the sentences being processed. We measured the following parsing speed:
- On a (single) GPU (GeForce RTX 2080 with 11019MiB of memory): 86 sentences / second
- On CPU (Intel Xeon CPU E5-2698 v4): 12 sentences /second
TowerParse is built on top of HuggingFace Transformers. We have tested it with the Transformers version 4.9.2.
We offer 144 pretrained parsing models covering 80 languages.
Note: All the models have been trained on the (combinations of) treebanks from UD v2.5. Due to mismatches between XLM-R's subword tokenizer and word-level tokens in training treebanks for certain languages, we recommend to use the following models with caution: all Chinese models (CFL, GSD, GSDSimp, HK, and PUD) and Yoruba (YTB).