Add support for `tokenizers` tokenizers #1251

gabe-l-hart · 2024-10-01T22:31:36Z

🚀 The feature, motivation and pitch

The request is to extend the tokenizer module in torchchat to support tokenizers that use the Huggingface tokenizers library.

There are many models out there that use tokenizers which won't be able to run in torchchat until they can be loaded and run either via the tokenizers library directly or via a conversion to tiktoken or sentencepiece.

Alternatives

It may be possible to convert a tokenizers tokenizer to a tiktoken tokenizer. I have a working implementation of this for the llama tokenizer.json model, however other models that use different tokenizers configurations do not work (in particular Granite Code).

Additional context

This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.

I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between tiktoken and the various tokenizers pieces (in particular the pretokenizers). My branch has a python implementation that simply wraps tokenizers, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a corresponding c++ implementation. I plan to investigate this further soon!

RFC (Optional)

No response

The text was updated successfully, but these errors were encountered:

Jack-Khuu · 2024-10-02T23:50:33Z

Thanks for the updates and context! I'd be interested in seeing what your working implementation is for converting out of HF's tokenizer lib.

gabe-l-hart · 2024-10-03T15:16:18Z

Sure! This is what I have for tiktoken_converter.py

tiktoken_converter.py

"""
Helper module for converting tokenizers from the `tokenizers` package to
tiktoken format for use in torchchat.
"""
# Standard
import base64
import json

# First Party
import argparse

# Third Party
from transformers.convert_slow_tokenizer import bytes_to_unicode
import tokenizers

## Helpers #####################################################################


def unicode_to_bytes():
    """Inversion of the lookup table for byte -> string"""
    return {v:k for k, v in bytes_to_unicode().items()}


byte_encoder = bytes_to_unicode()
byte_decoder = unicode_to_bytes()


def token_bytes_to_string(b):
    """
    DIRECTLY COPIED FROM `transformers`
    https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1461
    """
    return "".join([byte_encoder[ord(char)] for char in b.decode("latin-1")])


def token_string_to_bytes(s):
    """Inversion of the conversion done in token_bytes_to_string"""
    return bytes([byte_decoder[byt] for byt in s])

def get_tokenizer_state(source_tokenizer: tokenizers.Tokenizer) -> dict:
    """The underlying tokenizer is buried in the rust structs, so it's not
    immediately accessible in python. This function leverages the string
    serialization to pull out the core of the configuration.
    """
    return json.loads(source_tokenizer.to_str())

def extract_pattern(source_tokenizer_state: dict) -> str:
    """Extract the string splitting regex for the pre-tokenizer"""
    return source_tokenizer_state['pre_tokenizer']['pretokenizers'][0]['pattern']['Regex']

def extract_special_tokens(source_tokenizer_state: dict) -> dict[str, int]:
    """Extract the special tokens that were added to the vocab"""
    return {
        itm["content"]: itm["id"]
        for itm in source_tokenizer_state["added_tokens"]
        if itm["special"]
    }

def convert_to_ranks(vocab_dict: dict[str, int]) -> dict[bytes, int]:
    """Convert from string form to the bytes form that is needed by tiktoken"""
    return {token_string_to_bytes(k): v for k, v in vocab_dict.items()}


def convert_tokenizers_to_tiktoken(
    source_tokenizer: tokenizers.Tokenizer,
) -> dict[bytes, int]:
    """End-to-end converter from tokenizers to tiktoken"""

    # Parse the serialized state of the source tokenizer
    source_tokenizer_state = get_tokenizer_state(source_tokenizer)

    # Extract the vocab from the tokenizer
    vocab = source_tokenizer.get_vocab()

    # Extract the special tokens from the tokenizer state
    special_tokens = extract_special_tokens(source_tokenizer_state)
    print("SPECIAL TOKENS:")
    for special_token, tok_id in sorted(special_tokens.items(), key=lambda x: x[1]):
        print(f'"{special_token}": {tok_id}')

    # Remove the special tokens from the vocab
    cleaned_vocab = {k: v for k, v in vocab.items() if k not in special_tokens}

    # Convert the cleaned vocab to byte form
    cleaned_vocab_ranks = convert_to_ranks(cleaned_vocab)
    return cleaned_vocab_ranks

def save_tiktoken_model(bpe_ranks: dict[bytes, int], output_path: str):
    """Saves a tiktoken model from an existing tokenizer."""
    with open(output_path, "wb") as handle:
        for token, rank in sorted(bpe_ranks.items(), key=lambda x: x[1]):
            handle.write(base64.b64encode(token) + b" " + str(rank).encode() + b"\n")


def validate_conversion(
    source_tokenizer: tokenizers.Tokenizer,
    output_file: str,
    test_strings: list[str] | None,
    test_files: list[str] | None,
):
    """Validate the tokenization between the source and target tokenizers"""
    # Local
    # NOTE: Local import to avoid hard dependency on torchchat
    from tokenizer.tiktoken import Tokenizer

    # Load the output tokenizer model with tiktoken in torchchat
    target_tokenizer = Tokenizer(output_file)

    # Define local comparison function
    def compare_tokenization(test_text: str):
        source_tokens = source_tokenizer.encode(test_text).ids
        target_tokens = target_tokenizer.encode(test_text, bos=False)
        if source_tokens != target_tokens:
            print("----------------------------")
            print("MISMATCH FOUND")
            print(f"Test text: {test_text}")
            print(f"Source tokens: {source_tokens}")
            print(f"Target tokens: {target_tokens}")
            print()
            #DEBUG
            breakpoint()

    # Validate on manual strings
    for test_string in test_strings or []:
        compare_tokenization(test_string)

    # Validate on file content
    for test_file in test_files or []:
        with open(test_file, "r") as handle:
            test_text = handle.read()
            compare_tokenization(test_text)


## Main ########################################################################

def main():
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("input_file", help="The tokenizer json file to convert.")
    parser.add_argument("--output-file", "-o", default="tokenizer.model", help="The filename for the output tokenizer model")
    parser.add_argument("--test-string", "-ts", nargs="*", help="Strings to validate on")
    parser.add_argument("--test-file", "-tf", nargs="*", help="Files to validate on")
    args = parser.parse_args()

    # Load the tokenizer from the json file
    source_tokenizer = tokenizers.Tokenizer.from_file(args.input_file)

    # Do the conversion
    bpe_ranks = convert_tokenizers_to_tiktoken(source_tokenizer)

    # Save the model
    save_tiktoken_model(bpe_ranks, args.output_file)

    # Validate if requested
    if args.test_string or args.test_file:
        validate_conversion(
            source_tokenizer,
            args.output_file,
            args.test_string,
            args.test_file
        )

if __name__ == "__main__":
    main()

The main gap is around handling the pretokenizer. In tokenizers, there are a number of pretokenizer types that either take multiple regexes evaluated in sequence (later exprs evaluated on the chunks found by previous splits), or are just different classes with different splitting logic (I haven't fully delved this yet).

The other piece that is not yet portable is the addition of special tokens other than those used by the llama* models. The tokenizer.model format seems to only encode the vocab (ranks) and doesn't seem to have a way to include the special tokens from what I can tell.

gabe-l-hart · 2024-10-03T16:00:02Z

Draft PR up: #1261

I've noted some more details on the open investigation questions in the Discussion section of the PR.

gabe-l-hart · 2024-10-04T17:43:06Z

@Jack-Khuu I've been digging into the landscape of the c++ code a bit. It looks like in addition to supporting this in torchchat, we'd need to also extend the tokenizer functionality in executorch which seems to be logically equivalent, but implemented differently. I think the set of issues in both codebases is similar, though:

There is no way to customize the pre-tokenizer regex in the c++ code
There is no way to run multiple pre-tokenizer regexes in the c++ code
There is no way to customize the special tokens in the c++ code
There is no format in the tokenizers.model format for either custom regex(es) or special tokens

The underlying guts of the decode implementation in the two tiktoken.cpp implementations would actually not be terribly hard to update to support 1-3, but doing so would definitely break the 1:1 compatibility nature with the original tiktoken implementation. Similarly, it wouldn't be terribly difficult to add special parsing logic to parse additional metadata fields from the tokenizer.model format, but that would also break compatibility with the true tiktoken format.

Given this, I think we could go one of two ways:

Extend what is currently called tiktoken in both projects to be a generic regex/special-token tokenizer
- If we go this route, we'd probably want to consider renaming it?
Add a net-new implementation with significant code overlap that parses the native tokenizer.json format from tokenizers
- This would require adding json parsing support (e.g. vendoring a copy of nlohmann/json)

Given the compatibility concerns, my initial preference would be for (2), but I want to kick off the conversation since either one would be a pretty significant change.

…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Jack-Khuu · 2024-10-07T21:49:28Z

Thanks for the details and analysis, I'll hop over to the PR to comment