-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokenization and prompting API to GPT models #651
Conversation
@tharvik @peacefulotter I'd be happy to hear your take on how to integrate tokenization and prompting into the Disco architecture. |
Congrats on the LLM POC milestone! Firstly, I am wondering if having a custom tokenizer is even really relevant in the first place. Is there really a use case for it in Disco? If so, instead of passing the tokenizer object, you could pass a string representing the tokenizer to use. Then an object could map tokenizer ids / names to the corresponding instance. |
Secondly, since it appears tokenization needs to be done on the fly (before or during training) and not as a completely separate step. It could be done as a preprocessing step like Hugo planned to do on his branch |
@peacefulotter yes I want to do both points you mentioned! The second one is already implemented, the tokenization is part of the preprocessing. |
I don't like
after a bit more thinking, I don't think it really matters for now, as long as #639 is not done
indeed, functions can't be represented by msgpack/json so it gets removed. it'll be nice to have a serialization process to ensure that we get function back from the network (a string such as |
Transformer.js requires Node >= 18, I'm waiting on #653 to be merged |
Closes: #632
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! a bit of nitpicking here and there but nothing really vital
discojs/discojs-core/src/dataset/data/preprocessing/text_preprocessing.ts
Show resolved
Hide resolved
Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>
Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a new models/tokenizer.ts file
LGTM!
a few more comments (I'm a never ending stream of critics, please stop me)
Fixes #646
docs/examples/wikitext.ts