Skip to content

A readable implementation of fast greedy BPE tokenizer with detailed documentation.

Notifications You must be signed in to change notification settings

laelhalawani/fast-greedy-bpe-tokenizer

Repository files navigation

BPE Tokenizer Documentation

Introduction and Overview

The BPE Tokenizer is a fast and greedy Byte Pair Encoding tokenizer for Python. It allows you to tokenize text into subword units by iteratively merging the most frequent pairs of adjacent characters or tokens to build up a vocabulary. This documentation provides an overview of the BPE Tokenizer and how to use it effectively.

Installation

To install the BPE Tokenizer, you can use:

pip install git+https://github.com/laelhalawani/fast-greedy-bpe-tokenizer.git

or manually download the repository and open a command prompt in the extracted directory and run the install command.

pip install .

After installation, you can import the tokenizer using import bpe_tokenizer or from bpe_tokenizer import BPETokenizer.

BPE Tokenizer Class Usage Guide

The BPE Tokenizer class allows trainable subword tokenization of text using the Byte Pair Encoding algorithm. Here are some examples of how to use the class:

Train and Save or Load Tokenizer

Train and save the tokenizer from scratch:

train_file = "./training_data.txt"
saved_vocab = "./vocab.json"
tokenizer = BPETokenizer()
tokenizer.train_from_file(train_file, 5000, True)
tokenizer.save_vocab_file(saved_vocab)

Load a pretrained tokenizer:

saved_vocab = "./vocab.json"
tokenizer = BPETokenizer(vocab_or_json_path=saved_vocab)

Encode Text to Integers

Encode text to integers using a trained or loaded tokenizer:

saved_vocab = "./vocab.json"
tokenizer = BPETokenizer(saved_vocab)
tokens = tokenizer.encode("This is some example text")
print(tokens)

Decode Integers to Text Decode integers back into text:

saved_vocab = "./vocab.json"
tokenizer = BPETokenizer(saved_vocab)
text = tokenizer.decode([1,0, -1, -999])
print(text)

API Reference

train

Method Signature:

def train(self, corpus, desired_vocab_size:int, word_level=True)

Performs the core BPE training algorithm on the provided corpus. Iteratively merges the most frequent pair of adjacent symbols in the corpus text to build up vocabulary.

train_from_file

Method Signature:

def train_from_file(self, corpus_file_path:str, desired_vocab_size:int, word_level=True)

Trains the BPE model from a corpus file. It reads the corpus text file file line by line and performs the core BPE training algorithm on each line.

encode

Method Signature:

def encode(self, text, pad_to_tokens=0)

Encodes text into corresponding integer tokens using the trained vocabulary. It iteratively takes the longest matching substring from the vocabulary and emits the integer token. Unknown symbols are replaced with the UNK token.

decode

Method Signature:

def decode(self, tokens)

Decodes a list of integer tokens into the corresponding text. It looks up each integer token in the decoder dictionary to recover the symbol string.

save_vocab_file

Method Signature:

def save_vocab_file(self, json_file_path:str)

Saves the current vocabulary dictionary to the provided file path as a JSON file. It serializes the vocabulary dict as JSON using the json module.

load_vocab_file

Method Signature:

def load_vocab_file(self, json_file_path:str)

Loads the vocabulary dictionary from a JSON file at the given path using the json module.

Advanced Topics

Using word_level = False will enable the use of a character level BPE model. It is significantly slower for training than a word level model, however it might be more accurate for complex tasks. Character level training were used in GPT tokenizers. The default value of word_level for this BPETokenizer implementation is True.

License

GNU AGPLv3 2023, laelhalawani@gmail.com.

Contributing

Any and all is welcome, thank you!

About

A readable implementation of fast greedy BPE tokenizer with detailed documentation.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages