Skip to content

Inspection tool for characterizing the semantic compositionality of subword tokenization in English

Notifications You must be signed in to change notification settings

unimorph/umLabeller

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

umLabeller

umLabeller is an inspection tool for characterizing the semantic compositionality of subword tokenization, based on the morphological information retrieved from UniMorph. Given a word w and its subword tokenization, s = (s1, ..., sn) | ∀i si ∈ V, umLabeller assigns one of four categories: vocab, alien, morph, or n/a:

  • vocabulary subword: the given word w is a subword in the vocabulary as wV;
  • alien composition: the given subword sequence s is an alien subword composition if we find at least two subwords si and sj in s that are not meaningful with respect to the meaning of w;
  • morphological composition: the subword sequence s is morphological if it is neither a vocabulary nor an alien subword composition;
  • n/a: UniMorph has no information on the word.

umLabeller can characterize over half a million English words and is compatible with most modern tokenizers.

Examples

input word subword tokenization output label
jogging _j ogging alien
neutralised _neutral ised morph
stepstones _steps tones alien
swappiness _sw appiness alien
swappiness _swap pi ness morph
jogging _jogging vocab

Installation

To install from the source, please use the following commands:

!git clone https://github.com/unimorph/umLabeller.git
cd umLabeller
!pip install .

Note: The instructions above have been tested on Google Colab.

Usage

from umLabeller.umLabeller import UniMorphLabeller

uml = UniMorphLabeller()
print(uml.auto_classify('stepstones',['Ġsteps','tones']))

Output:

alien

License:

https://creativecommons.org/licenses/by-sa/3.0/

References

More details can be read in the following article:

Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella – Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge. https://arxiv.org/abs/2404.13292

About

Inspection tool for characterizing the semantic compositionality of subword tokenization in English

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages