Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenization support to Disco LLMs #646

Closed
3 of 5 tasks
JulienVig opened this issue Mar 5, 2024 · 1 comment · Fixed by #651
Closed
3 of 5 tasks

Add tokenization support to Disco LLMs #646

JulienVig opened this issue Mar 5, 2024 · 1 comment · Fixed by #651
Assignees
Labels
discojs Related to Disco.js
Milestone

Comments

@JulienVig
Copy link
Collaborator

JulienVig commented Mar 5, 2024

Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.

Full tokenizer support would allow:

  • To input natural text which is tokenized in DISCO
  • To perform Token decoding after model inference
  • To be able to use a pre-trained LLM in Disco first requires converting the weights to a format compatible with TF.js or other JS libraries. But additionally, we will also need to use the model's "pre-trained" tokenizer, and will also need to convert the tokenizers to JavaScript.
  • Train a tokenizer with a particular algorithm (e.g. SentencePiece BPE) on a custom dataset, use it when training a model, and save it along with the model
  • To save and store tokenizer alongside the model with which it was used
@JulienVig JulienVig added the discojs Related to Disco.js label Mar 5, 2024
@JulienVig JulienVig self-assigned this Mar 5, 2024
@JulienVig
Copy link
Collaborator Author

  • The Transformers.js may be an efficient off-the-shelf solution for adding pre-trained tokenizer support in Disco.
    The library extends the HuggingFace library to JavaScript, including integrating tokenizer support:
import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');

However, none of these libraries offer tokenizer's training capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discojs Related to Disco.js
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants