A Lightweight Word Piece Tokenizer

This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.

Usage

Installing

Install and update using pip

pip install word-piece-tokenizer

Example

from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()

ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]

tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']

tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'

Running Tests

Test the tokenizer against hugging's face implementation:

pip install transformers
python tests/tokenizer_test.py

Making It Lightweight

To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.

Optional Features

The following features has been enabled by default instead of being configurable:

Category	Feature
Tokenizer	- The tokenizer utilises the pre-trained bert-based-uncased vocab list. - Basic tokenization is performed before word piece tokenization
Text Cleaning	- Chinese characters are padded with whitespace - Characters are converted to lowercase - Input string is stripped of accent

Unused Features

The following features has been removed from the tokenizer:

pad_token, mask_token, and special tokens
Ability to add new tokens to the tokenizer
Ability to never split certain strings (never_split)
Unused functions such as build_inputs_with_special_tokens, get_special_tokens_mask, get_vocab, save_vocabulary, and more...

Matching Algorithm

The tokenizer's longest substring token matching algorithm is implemented using a trie instead of greedy longest-match-first

The Trie

The original Trie class has been modified to adapt to the modified longest substring token matching algorithm.

Instead of a split function that seperates the input string into substrings, the new trie implements a getLongestMatchToken function that returns the token value (int) of the longest substring match, and the remaining unmatched substring (str)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src/word_piece_tokenizer		src/word_piece_tokenizer
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Lightweight Word Piece Tokenizer

Table of Contents

Usage

Installing

Example

Running Tests

Making It Lightweight

Optional Features

Unused Features

Matching Algorithm

The Trie

About

Releases

Packages

Languages

License

ztjhz/word-piece-tokenizer

Folders and files

Latest commit

History

Repository files navigation

A Lightweight Word Piece Tokenizer

Table of Contents

Usage

Installing

Example

Running Tests

Making It Lightweight

Optional Features

Unused Features

Matching Algorithm

The Trie

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages