Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support cl100k_base tokenization and increase performance for GPT2 #3897

Merged
merged 18 commits into from
Jan 24, 2023

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Jan 19, 2023

Related Issues

Proposed Changes:

GTP2TokenizerFast has been replaced by tiktoken library when using OpenAI embedding encoder and answer generator. tiktoken is faster, open-source, sponsored by OpenAI and supports the new cl100k_base tokenizer used in the v2 generation and embedding models (e.g., text-davinci-003 and text-embedding-ada-002), while keeping compatibility with old GPT2 tokenizer.

How did you test it?

Current unit tests and PoC tests.

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added tests that demonstrate the correct behavior of the change
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@danielbichuetti danielbichuetti requested a review from a team as a code owner January 19, 2023 17:06
@danielbichuetti danielbichuetti requested review from silvanocerza and removed request for a team January 19, 2023 17:06
@danielbichuetti danielbichuetti changed the title feat: migration to tiktoken when tokenizing using OpenAI feat: support vl100k_base tokenization and increase performance for GPT2 Jan 19, 2023
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jan 19, 2023

I'll refactor it to include safe import for new dependency, so tiktoken should be installed only when using OpenAI nodes.

@danielbichuetti
Copy link
Contributor Author

After this last commit, I noticed that tests are still being run under Python 3.7, I'll implement a failsafe import mechanism as tiktoken supports only >=3.8.

@sjrl
Copy link
Contributor

sjrl commented Jan 20, 2023

Hey @danielbichuetti could you update your branch with the current version of deepset-ai/haystack main? It looks like you have extra files showing up in the PR diff because you are a few commits behind the current version of main.

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jan 20, 2023

Hello@sjrl .

I will be in the office in 30min.

I'm using the last Vladimir commit for PromptNode as the base. I created some files and edited import utils to improve typing and fix one small case.

@sjrl sjrl changed the base branch from main to add_doc2query January 20, 2023 13:44
@sjrl sjrl changed the base branch from add_doc2query to main January 20, 2023 13:44
@danielbichuetti
Copy link
Contributor Author

The failed test is an error from Actions:

FAILED test/modeling/test_dpr.py::test_dpr_processor_save_load_non_bert_tokenizer[query_and_passage_model2] - requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
====== 1 failed, 53 passed, 1 deselected, 1 warning in 217.82s (0:03:37) ======
Error: Process completed with exit code 1.

@danielbichuetti danielbichuetti requested review from sjrl and removed request for silvanocerza January 20, 2023 20:05
@danielbichuetti danielbichuetti changed the title feat: support vl100k_base tokenization and increase performance for GPT2 feat: support cl100k_base tokenization and increase performance for GPT2 Jan 20, 2023
@sjrl sjrl requested a review from vblagoje January 23, 2023 09:30
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! I left a couple of questions, otherwise all good from my side 😊

The only thing I have some concerns about is the lack of tests. Our CI works on 3.7, so that's totally not your fault, but I'll see if there's a way to run a subset of tests on Py3.8 to test this code out. I'll let you know soon.

haystack/nodes/retriever/_base_encoder.py Outdated Show resolved Hide resolved
haystack/utils/import_utils.py Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@sjrl
Copy link
Contributor

sjrl commented Jan 23, 2023

Looking really good! I had one more lookover the new changes and should be my final round of comments.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

I will hold this one here for max 1 more day to see what the rest of the team thinks about moving our CI to Py3.8. There are several advantages at this point, including testing this PR properly, so let me give that a shot before merging.

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ZanSara
Copy link
Contributor

ZanSara commented Jan 24, 2023

Alright, I didn't get a chance to discuss the CI Python upgrade topic just yet with the team (so much stuff going on lately), but no point holding it any further. I'll merge for now and bring up the topic asap 🙂

@ZanSara ZanSara merged commit 739fc22 into deepset-ai:main Jan 24, 2023
ZanSara added a commit that referenced this pull request Jan 24, 2023
@ZanSara ZanSara mentioned this pull request Jan 24, 2023
6 tasks
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jan 24, 2023

@ZanSara
After your PR mentioning this PR, I decided to check the wheels. Indeed, Linux ARM64 has no wheels available on tiktoken.

I'll implement a fallback mechanism for this specific scenario.

tiktoken wheels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate OpenAIAnswerGenerator and _OpenAIEmbeddingEncoder to tiktoken library
3 participants