Morphemes (prefixes, suffixes, root words) are linguistic descriptions, defined as the smallest meaningful unit of words. Our proposed shared task is morpheme segmentation that converts a text into a sequence of morphemes. In order to prepare a dataset for this task, we integrated all basic types of morphological databases (including UniMorph (Kirov et al., 2018b; McCarthy et al., 2020) – inflectional morphology; MorphyNet (Batsuren et al., 2021) – derivational morphology; Universal Dependencies (Nivre et al., 2017) and ten editions of Wiktionary – compound morphology and root words). In the future, we expect the NLP community will benefit a lot by innovating subword-based tokenization with this task. This shared task has two parts:
- Part 1: Word-level morpheme segmentation
- Part 2: Sentence-level morpheme segmentation
Please join our Google Group to stay up to date. Click here to register for the task!
Please open the issues if you have any questions.
Two subtasks will be scored separately. Participant teams may submit as many systems as they want to as many subtasks as they want.
At the word level, participants will be asked to segment a given word into a sequence of morphemes. Input words contains all types of word forms: root words, derived words, inflected words, and compound words.
Training and development data are UTF-8-encoded tab-separated values files. Each example occupies a single line and consists of input word, the corresponding morpheme sequence, and the corresponding morphological category. The following shows three lines of English data:
inaccuracies in @@accurate @@cy @@s 110
dictionary dictionary 000
screwdriver screw @@drive @@er 011
Note: The third column as the morphological category is an optional feature that can only be used to oversample or undersample training data.
First example is a derived word with prefix (in-) and suffixes (-cy and -s), and second example is a root word. Third example is a compound word. In the test datasets, we will provide only first column of data as input words.
Development languages are:
ces
: Czecheng
: Englishfra
: Frenchhun
: Hungarianspa
: Spanishita
: Italianlat
: Latinrus
: Russian
Surprise language is:
mon
: Mongolian
word class | English | Spanish | Hungarian | French | Italian | Russian | Czech | Latin | Mongolian |
---|---|---|---|---|---|---|---|---|---|
100 | 126544 | 502229 | 410662 | 105192 | 253455 | 221760 | - | 831991 | 7266 |
010 | 203102 | 18449 | 24923 | 67983 | 41092 | 72970 | - | 0 | 2201 |
101 | 13790 | 458 | 101189 | 478 | 317 | 1909 | - | 0 | 35 |
000 | 101938 | 15843 | 6952 | 13619 | 21037 | 2921 | - | 50338 | 1604 |
011 | 5381 | 82 | 1654 | 506 | 140 | 328 | - | 0 | 0 |
110 | 106570 | 346862 | 323119 | 126196 | 237104 | 481409 | - | 0 | 7855 |
001 | 16990 | 248 | 3320 | 1684 | 431 | 259 | - | 0 | 5 |
111 | 3059 | 343 | 54279 | 186 | 158 | 2658 | - | 0 | 0 |
total words | 577374 | 884514 | 926098 | 382797 | 553734 | 784214 | 38682 | 882329 | 18966 |
For some of the development languages, we are providing the word categories so that participant can deal with imbalanced situation of morphological categories.
word class | Description | English example (input ==> output) |
---|---|---|
100 | Inflection only | played ==> play @@ed |
010 | Derivation only | player ==> play @@er |
101 | Inflection and Compound | wheelbands ==> wheel @@band @@s |
000 | Root words | progress ==> progress |
011 | Derivation and Compound | tankbuster ==> tank @@bust @@er |
110 | Inflection and Derivation | urbanizes ==> urban @@ize @@s |
001 | Compound only | hotpot ==> hot @@pot |
111 | Inflection, Derivation, Compound | trackworkers ==> track @@work @@er @@s |
The following table shows the word-level task results of pretrained BertTokenizer
on English. This pretrained model was employed from HuggingFace.
word class | inflection | derivation | compound | R | P | F1 | lev. distance |
---|---|---|---|---|---|---|---|
001 | no | no | yes | 64.11 | 46.60 | 53.97 | 1.42 |
101 | yes | no | yes | 50.12 | 51.57 | 50.83 | 1.51 |
011 | no | yes | yes | 38.93 | 36.85 | 37.86 | 2.96 |
111 | yes | yes | yes | 28.30 | 34.81 | 31.22 | 3.22 |
010 | no | yes | no | 33.90 | 25.29 | 28.97 | 2.75 |
110 | yes | yes | no | 26.17 | 24.92 | 25.53 | 3.31 |
100 | yes | no | no | 19.14 | 12.16 | 14.87 | 2.73 |
000 | no | no | no | 5.55 | 2.02 | 2.96 | 2.11 |
total | - | - | - | 28.79 | 20.99 | 24.28 | 2.69 |
At the sentence level, participating systems are expected to predict a sequence of morphemes for a given sentence. The following shows two lines of English data:
Six weeks of basic training . Six week @@s of base @@ic train @@ing .
Fistfights , please . Fist @@fight @@s , please .
The following shows two lines of Mongolian data:
Гэрт эмээ хоол хийв . Гэр @@т эмээ хоол хийх @@в .
Би өдөр эмээ уусан . Би өдөр эм @@ээ уух @@сан .
In above example, эмээ
is a hononym of two different words, first means a grandmother
and second is medicine
. Depending on the context, the second homonym word is inflectional form of medicine
and it is segmentable.
Development languages are:
ces
: Czecheng
: Englishmon
: Mongolian
train | dev | test | |
---|---|---|---|
Czech | 1000 | 500 | 500 |
English | 11007 | 1783 | 1845 |
Mongolian | 1000 | 500 | 600 |
Language | Precison | Recall | F-measure | Lev. distance |
---|---|---|---|---|
Czech | 36.76 | 30.35 | 33.25 | 21.01 |
English | 63.68 | 65.77 | 64.71 | 5.50 |
Mongolian | 20.00 | 29.95 | 23.99 | 28.86 |
We will provide python evaluation scripts, reporting the following evaluation measures:
- Precision - fraction of correctly predicted morphemes on all predicted morphemes
- Recall - ratio of correctly predicted morphemes on all gold morphemes
- F-measure - the harmonic mean of the precision and recall
- Edit distance - average Levenshtein distance between the predicted output and the gold instance.
Please submit your team's results to khuyagbaatar.b@gmail.com CCing your teammates by May 13th, 2022 (AoE). Each submission should be a .tar.gz or .zip file.
- February 23, 2022: Training splits for development languages are released at https://github.com/sigmorphon/2022SegmentationST/tree/v0.2
- February 28, 2022: Development splits for development languages are released at https://github.com/sigmorphon/2022SegmentationST/tree/v0.2
- March 15, 2022: Evaluation tool of word-level task (English baseline results included) is released at https://github.com/sigmorphon/2022SegmentationST/tree/v0.2
- March 22, 2022: Evaluation tool of sentence-level task is released at
/evaluation
- March 29, 2022: Full baseline results are released at
/baseline
April 8April 18, 2022: Training and development splits for surprise languages released/data/surprise
April 15April 29, 2022: Test splits for development and surprise languages are released at/data
April 29May 13, 2022: Participants' submissions due.
May 13May 27, 2022: Participants' draft system description papers due.May 20June 3, 2022: Participants' camera-ready system description papers due.
- Khuyagbaatar Batsuren (National University of Mongolia)
- Gábor Bella (University of Trento)
- Aryaman Arora (Georgetown University)
- Viktor Martinović (University of Vienna)
- Kyle Gorman (Graduate center, City University Of New York)
- Zdeněk Žabokrtský (Charles University)
- Amarsanaa Ganbold (National University of Mongolia)
- Šárka Dohnalová (Charles University)
- Magda Ševčíková (Charles University)
- Kateřina Pelegrinová (University of Ostrava)
- Fausto Giunchiglia (University of Trento)
- Ryan Cotterell (ETH Zürich)
- Ekaterina Vylomova (University of Melbourne)
The data is released under the Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary itself.
Kirov, C., Cotterell, R., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., Faruqui, M., Mielke, S., McCarthy, A., Kübler, S., Yarowsky, D., Eisner, J., and Hulden, M. (2018). UniMorph 2.0: Universal Morphology. Proceedings of LREC 2018.
McCarthy, A.D., Kirov, C., Grella, M., Nidhi, A., Xia, P., Gorman, K., Vylomova, E., Mielke, S.J., Nicolai, G., Silfverberg, M. and Arkhangelskij, T., (2020). UniMorph 3.0: Universal Morphology.. Proceedings of LREC 2020.
Batsuren, K., Bella, G. and Giunchiglia, F., (2021). MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology. In Proceedings of SIGMORPHON 2021 (pp. 39-48).
Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, M.J., Asahara, M., Ateyah, L., Attia, M., Atutxa, A., Augustinus, L. and Badmaeva, E., (2017). Universal Dependencies 2.1.