You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.
Full tokenizer support would allow:
To input natural text which is tokenized in DISCO
To perform Token decoding after model inference
To be able to use a pre-trained LLM in Disco first requires converting the weights to a format compatible with TF.js or other JS libraries. But additionally, we will also need to use the model's "pre-trained" tokenizer, and will also need to convert the tokenizers to JavaScript.
Train a tokenizer with a particular algorithm (e.g. SentencePiece BPE) on a custom dataset, use it when training a model, and save it along with the model
To save and store tokenizer alongside the model with which it was used
The text was updated successfully, but these errors were encountered:
The Transformers.js may be an efficient off-the-shelf solution for adding pre-trained tokenizer support in Disco.
The library extends the HuggingFace library to JavaScript, including integrating tokenizer support:
import{AutoTokenizer}from'@xenova/transformers';consttokenizer=awaitAutoTokenizer.from_pretrained('Xenova/bert-base-uncased');const{ input_ids }=awaittokenizer('I love transformers!');
gpt-tokenizer is another alternative which @peacefulotter used in their experiments. The gpt-tokenizer extends OpenAI's tiktoken library of GPT tokenizers only.
Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.
Full tokenizer support would allow:
The text was updated successfully, but these errors were encountered: