This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.
Install and update using pip
pip install word-piece-tokenizer
from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()
ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]
tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']
tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'
Test the tokenizer against hugging's face implementation:
pip install transformers
python tests/tokenizer_test.py
To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.
The following features has been enabled by default instead of being configurable:
Category | Feature |
---|---|
Tokenizer | - The tokenizer utilises the pre-trained bert-based-uncased vocab list. - Basic tokenization is performed before word piece tokenization |
Text Cleaning | - Chinese characters are padded with whitespace - Characters are converted to lowercase - Input string is stripped of accent |
The following features has been removed from the tokenizer:
pad_token
,mask_token
, and special tokens- Ability to add new tokens to the tokenizer
- Ability to never split certain strings (
never_split
) - Unused functions such as
build_inputs_with_special_tokens
,get_special_tokens_mask
,get_vocab
,save_vocabulary
, and more...
The tokenizer's longest substring token matching algorithm is implemented using a trie
instead of greedy longest-match-first
The original Trie
class has been modified to adapt to the modified longest substring token matching algorithm.
Instead of a split
function that seperates the input string into substrings, the new trie implements a getLongestMatchToken
function that returns the token value (int)
of the longest substring match, and the remaining unmatched substring (str)