-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support cl100k_base tokenization and increase performance for GPT2 #3897
Conversation
I'll refactor it to include safe import for new dependency, so tiktoken should be installed only when using OpenAI nodes. |
After this last commit, I noticed that tests are still being run under Python 3.7, I'll implement a failsafe import mechanism as tiktoken supports only >=3.8. |
Hey @danielbichuetti could you update your branch with the current version of deepset-ai/haystack main? It looks like you have extra files showing up in the PR diff because you are a few commits behind the current version of main. |
Hello@sjrl . I will be in the office in 30min. I'm using the last Vladimir commit for PromptNode as the base. I created some files and edited import utils to improve typing and fix one small case. |
The failed test is an error from Actions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! I left a couple of questions, otherwise all good from my side 😊
The only thing I have some concerns about is the lack of tests. Our CI works on 3.7, so that's totally not your fault, but I'll see if there's a way to run a subset of tests on Py3.8 to test this code out. I'll let you know soon.
Looking really good! I had one more lookover the new changes and should be my final round of comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
I will hold this one here for max 1 more day to see what the rest of the team thinks about moving our CI to Py3.8. There are several advantages at this point, including testing this PR properly, so let me give that a shot before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Alright, I didn't get a chance to discuss the CI Python upgrade topic just yet with the team (so much stuff going on lately), but no point holding it any further. I'll merge for now and bring up the topic asap 🙂 |
@ZanSara I'll implement a fallback mechanism for this specific scenario. |
Related Issues
Proposed Changes:
GTP2TokenizerFast has been replaced by tiktoken library when using OpenAI embedding encoder and answer generator. tiktoken is faster, open-source, sponsored by OpenAI and supports the new cl100k_base tokenizer used in the v2 generation and embedding models (e.g., text-davinci-003 and text-embedding-ada-002), while keeping compatibility with old GPT2 tokenizer.
How did you test it?
Current unit tests and PoC tests.
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.