Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
tok_pipeline.png		tok_pipeline.png

README.md

What does a Tokenizer do?

Let's see a 🤗 tokenizer in action first:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer.encode("Let's understand tokens")) 
# Output: [5756, 338, 1833, 16326]
print(tokenizer.batch_decode(encode("Let's understand tokens"))) # convert token ids to tokens
# Output: ['Let', "'s", ' understand', ' tokens']

The tokenizer encodes a piece of text into a sequence of token ids. These token ids are fed into our neural network. THe neural network has a special layer in the beginning called the embedding layer. Corresponding to each token id, the embedding layer stores a unique embedding vector. Given the input sequence of token ids, the embedding layer effectively performs a per-token-id lookup to output a sequence of embedding vectors. Before we get any further, we should ask: What are tokens? How do you decide where to break up a piece of text? What are the different ways in which you can break up text?

What are Tokens?

In essence, a token is an atomic piece of text. Tokenization/Segmentation of text is the process of breaking up text into smaller, meaningful units. Tokenization as a basic step in natural language processing is very old. These tokens should ideally be word forms, or in simple terms, some variation/derivative of a word. Adding some history, the Morphological Annotation Framework (MAF) ISO standard defines typographical units or tokens as “non-empty contiguous sequence of graphemes or phonemes in a document". The typographic separator we're all familiar with is the whitespace (which separates words), but this is script dependent (Latin, for example, uses whitespace, but the Japanese script does not). Previously, one would have to make a set of arbitrary rules to come up with a mapping between words and tokens. (For example "don't" -> "do n't").

Side note: Morphosyntactic is a fancy way of saying that you are marking up text to indicate various attributes of words, such as their parts of speech, gender, number, case, etc. MAF was a proposal that came up in 2005 (Interesting, you can still find their slides here). A grapheme is the smallest unit of a writing system of any given language. A phoneme is the smallest unit of sound in a language that can distinguish one word from another. There are schools of thought on what constitutes a grapheme and what doesn't, and I don't want to get into that mess here.

As expected, there's quite a bit of history about the evolution of tokenization/segmentation. Initially, this was based solely on meaningful typographic units (for the English language, this would be words and special characters separated by whitespace), and we've now moved towards a more fine-grained, sub-word level. An excellent study from Mielke et al provides an in-depth look.

Main approaches to tokenization

Let's quickly go over different kinds of tokenization. The two extremes of tokenization are character and word-based tokenization. In one case, the vocabulary is too small, the splits are too fine-grained, leading to very long tokenized sequences. Further this does not provide enough meaningful language representation for the model to springboard from. With word tokenization, you get meaningful units, but they are closed-vocabulary models - they are unable to deal with rare/novel words that weren't seen during training (Out-Of-Vocabulary). Even if you did assemble a gigantic corpus of all possible words, this vocabulary would be too big, too large to deal with, because of the fact that words can contain declinations, punctuations, misspellings, etc. The most popular form of tokenization is the middle ground: sub-word tokenization. The optimal way of breaking down text into different component sub-words is something that is learned from a relevant text corpus.

With subword tokenization, it is pretty non-trivial to figure out the best way to split words. Notable examples are words such as don't (this is an example of a contraction, where a combination of words do and not is shortened to don't), or code (an example with type-specific grammar). Further, as you'll see, the statistical nature of tokenization models also gives you weird results - such as numbers 450 and 448 being represented with a different number of tokens. (This is, in essence, a form of overfitting in your tokenizer)

Popular subword tokenization algorithms

If you're just starting out with learning about different subword tokenization algorithms, HuggingFace's NLP course is a must watch. Their videos have crisp animations that provide a good introduction, and it's very hard to top that in writing. That said, I'll briefly summarize some of the most popular subword tokenization algorithms are:

Byte-pair encoding (BPE)

Byte-pair encoding/ digram coding comes from information theory, and was first proposed in 1994 (Web archive). BPE tokenization first performs a character-level tokenization of the given corpus. Then, the most frequent pair of adjacent characters is merged into new, 2-character long tokens and all instances of the pair are replaced by this new token. This process is repeated (now with variable-length tokens, not just characters) until you achieve your desired vocabulary size. At each step, whenever merges happen, the tokenizer records these as "merge rules", to be used post-training while tokenizing a given piece of text. Going back to the information theoretic meaning, given a text file, you iteratively replace the most consecutive pairs of bytes in your data with an unused byte (a byte that didn't appear in your text file). This is where you get the term "byte pair encoding". (There are, as expected various ways BPE has been used. For example, GPT-2 uses a Byte-Level BPE, where BPE is applied to raw bytes, not characters)

Test-Time Tokenization

Given a piece of text, the BPE algorithm first performs a character-level tokenization, and then uses the merge rules used during training to iteratively merge different tokens until you can no longer reduce the tokenized sequence further. The exact time complexity depends on the implementation, but the SentencePiece implementation (which is very popular, and integrated into 🤗 tokenizers) takes $O(NlogN)$, where $N$ is the length of the input sentence.

WordPiece Tokenization

WordPiece is another popular tokenization strategy, used in BERT, RoBERTa, etc. WordPiece tokenization is very similar to BPE, but it differs in the way the pairs are selection during the merging steps. In the case of BPE, the pairs are ranked simply by frequency, and the most frequent pair is selected each time. With Wordpiece, a custom score is computed for each pair as follows:

score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)

pairs with the higher score are merged each iteration. By normalizing the frequency of the pair by the individual token frequencies, only those tokens are merged that are already not that frequent in the vocabulary.

Test-Time Tokenization

Given a piece of text, the WordPiece algorithm uses a left-to-right longest match first strategy. (This seems to be linear time processing, I need to dig into implementation though)

Unigram Tokenization

Unigram tokenization uses a statistical language model to model token probabilities. The basic assumption made by the unigram model is that the occurence of each word is independent of its previous word (hence, it is a "unigram" language model). This is of course, not appropriate for any serious model you'd want for generation, but we're looking at tokenization. Compared to BPE, Unigram's training strategy proceeds in the opposite direction: A large vocabulary is iteratively reduced in size by removing certain tokens. The training process can be summarized as follows: you first start with a large vocabulary for example by considering all possible substrings or by training BPE and using it's learnt vocabulary. At each step of the training, the algorithm computes a loss over the training corpus, for the current vocabulary. Then, you reduce the vocabulary size by removing some $x$ percent of tokens that cause the loss to increase the least.

Test-Time Tokenization

Given a word, we look at all the possible segmentations of that word into tokens and compute the probability of each according to the trained Unigram model. We pick the tokenization/segmentation with the highest probability. Semgentations with more tokens, usually end up with lower total probability (because you multiple more small numbers), and thus the model favours segmentation into smaller number of tokens, similar to what we expect.

SentencePiece

This is NOT a tokenization algorithm. SentencePiece is a framework for tokenization and detokenization. It is widely used because it has some desirable properties for an end-to-end text processing system, such as being purely data driven and not relying on pre-tokenization steps (i.e don't have to send in text separated by whitespace). It is also language independent, and supports both BPE and Unigram language model algorithms. When you see "a SentencePiece model" in any literature, this is simply referring to the fact that model was trained using the SentencePiece library, and the configuration/ parameters of the model are available via the SentencePiece model abstraction (roughly speaking, this is like saying "PyTorch model", or "model trained with 🤗Trainer"). The tokenization algorithm itself can be BPE (the most likely algorithm) or Unigram (or a custom variant if mentioned.)

The Tokenization Pipeline

The full tokenization pipeline is below. I'm not bothering defining these, because you'll find precise definitions in a number of introductory material on tokenization.

Normalization: Cleaning - lower casing, removing accents, etc
Pre-tokenization: Optional stage of splitting up text into words based on whitespace (if applicable for that language)
Model: The tokenization algorithm that takes in text/ list of words and spits out a list of tokens
Post-processor : Adds special tokens like sequence separators, beginning and end of sequence tokens, etc

An excellent visualization for the BERT tokenizer, from HuggingFace's NLP course:

Next Chapter

In the next chapter, we'll dive into the BPE algorithm, and train a simple BPE model from scratch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1-intro

1-intro

README.md

Table of Contents

What does a Tokenizer do?

What are Tokens?

Main approaches to tokenization

Popular subword tokenization algorithms

Byte-pair encoding (BPE)

Test-Time Tokenization

WordPiece Tokenization

Test-Time Tokenization

Unigram Tokenization

Test-Time Tokenization

SentencePiece

The Tokenization Pipeline

Next Chapter

Further reading

Files

1-intro

Directory actions

More options

Directory actions

More options

Latest commit

History

1-intro

Folders and files

parent directory

README.md

Table of Contents

What does a Tokenizer do?

What are Tokens?

Main approaches to tokenization

Popular subword tokenization algorithms

Byte-pair encoding (BPE)

Test-Time Tokenization

WordPiece Tokenization

Test-Time Tokenization

Unigram Tokenization

Test-Time Tokenization

SentencePiece

The Tokenization Pipeline

Next Chapter

Further reading