Skip to content

ztjhz/word-piece-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Lightweight Word Piece Tokenizer

PyPI version shields.io

This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.

Table of Contents

  1. Usage
  2. Making it Lightweight
  3. Matching Algorithm

Usage

Installing

Install and update using pip

pip install word-piece-tokenizer

Example

from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()

ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]

tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']

tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'

Running Tests

Test the tokenizer against hugging's face implementation:

pip install transformers
python tests/tokenizer_test.py

Making It Lightweight

To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.

Optional Features

The following features has been enabled by default instead of being configurable:

Category Feature
Tokenizer - The tokenizer utilises the pre-trained bert-based-uncased vocab list.
- Basic tokenization is performed before word piece tokenization
Text Cleaning - Chinese characters are padded with whitespace
- Characters are converted to lowercase
- Input string is stripped of accent

Unused Features

The following features has been removed from the tokenizer:

  • pad_token, mask_token, and special tokens
  • Ability to add new tokens to the tokenizer
  • Ability to never split certain strings (never_split)
  • Unused functions such as build_inputs_with_special_tokens, get_special_tokens_mask, get_vocab, save_vocabulary, and more...

Matching Algorithm

The tokenizer's longest substring token matching algorithm is implemented using a trie instead of greedy longest-match-first

The Trie

The original Trie class has been modified to adapt to the modified longest substring token matching algorithm.

Instead of a split function that seperates the input string into substrings, the new trie implements a getLongestMatchToken function that returns the token value (int) of the longest substring match, and the remaining unmatched substring (str)