Add tokenization support to Disco LLMs #646

JulienVig · 2024-03-05T16:35:24Z

Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.

Full tokenizer support would allow:

To input natural text which is tokenized in DISCO
To perform Token decoding after model inference
To be able to use a pre-trained LLM in Disco first requires converting the weights to a format compatible with TF.js or other JS libraries. But additionally, we will also need to use the model's "pre-trained" tokenizer, and will also need to convert the tokenizers to JavaScript.
Train a tokenizer with a particular algorithm (e.g. SentencePiece BPE) on a custom dataset, use it when training a model, and save it along with the model
To save and store tokenizer alongside the model with which it was used

JulienVig · 2024-03-06T16:16:13Z

The Transformers.js may be an efficient off-the-shelf solution for adding pre-trained tokenizer support in Disco.
The library extends the HuggingFace library to JavaScript, including integrating tokenizer support:

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');

gpt-tokenizer is another alternative which @peacefulotter used in their experiments. The gpt-tokenizer extends OpenAI's tiktoken library of GPT tokenizers only.
For Llama models llama-tokenizer-js implements the SentencePiece BPE algorithm. This implementation is the basis for the Transformers.js's tokenizer implementation

However, none of these libraries offer tokenizer's training capabilities.

JulienVig added the discojs Related to Disco.js label Mar 5, 2024

JulienVig self-assigned this Mar 5, 2024

tharvik mentioned this issue Mar 13, 2024

discojs-core/models: add gpt #644

Merged

JulienVig mentioned this issue Mar 18, 2024

Add tokenization and prompting API to GPT models #651

Merged

JulienVig closed this as completed in #651 Apr 3, 2024

martinjaggi added this to the v3.0.0 milestone Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenization support to Disco LLMs #646

Add tokenization support to Disco LLMs #646

JulienVig commented Mar 5, 2024 •

edited

Loading

JulienVig commented Mar 6, 2024

Add tokenization support to Disco LLMs #646

Add tokenization support to Disco LLMs #646

Comments

JulienVig commented Mar 5, 2024 • edited Loading

JulienVig commented Mar 6, 2024

JulienVig commented Mar 5, 2024 •

edited

Loading