Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent going past token limit in OpenAI calls in PromptNode #4179

Merged
merged 45 commits into from
Mar 3, 2023

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Feb 16, 2023

Related Issues

Proposed Changes:

  • Refactor code used to interact with OpenAI API. Aggregating code into one file and removing duplicate code.
  • retry_with_exponential_backoff has been moved to one location. Wraps the new util function openai_request which makes the request to the OpenAI API and handles the raising of appropriate errors and retries. Tested locally to make sure the retry mechanism still works.
  • Updated to use MODEL_TO_ENCODING added in tiktoken==0.2.0 so we can automatically look up the correct tokenizer for the requested model.
  • Prevent token overflow when calling OpenAI API in PromptNode. I accomplished this by truncating the prompt to fit within the max_token_limit. I considered using the solution in the OpenAIAnswerGenerator since it removes documents from the context until it fits within the max token length, which I think is a better solution for Retrieval Augmented QA. However, for the PromptNode it is not easily possible to determine the use case at model invocation time so I opted to truncate the end of the prompt instead and log a warning to the user.

How did you test it?

  • Add a new test to check truncation
  • Refactorings are tested using existing tests

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added tests that demonstrate the correct behavior of the change
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@sjrl sjrl marked this pull request as ready for review February 16, 2023 13:26
@sjrl sjrl requested a review from a team as a code owner February 16, 2023 13:26
@sjrl sjrl requested review from julian-risch and removed request for a team February 16, 2023 13:26
…automatically determine correct tokenizer for the requested model
@TuanaCelik
Copy link
Contributor

Hey @sjrl - I just tried this branch out with @julian-risch and by calling the following:

result = prompter.prompt(prompt_template=template, tweets=twitter_stream)

We get the following response from OpenAI:

Status code: 400
Response body: {
  "error": {
    "message": "This model's maximum context length is 4097 tokens, however you requested 4100 tokens (4000 in your prompt; 100 for the completion). Please reduce your prompt; or completion length.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

@sjrl
Copy link
Contributor Author

sjrl commented Feb 17, 2023

Hey @TuanaCelik thanks for checking this and catching this bug! I didn't take into account the answer length into the token limit. I'll do that now.

@mathislucka
Copy link
Member

@sjrl just a question, wouldn't we need the actual vocabulary of GPT-3.5 to check the number of tokens reliably. GPT-2 might tokenize differently, am I wrong? Or is this more of an approximation and we are fine with some divergence?

@sjrl
Copy link
Contributor Author

sjrl commented Feb 28, 2023

@sjrl just a question, wouldn't we need the actual vocabulary of GPT-3.5 to check the number of tokens reliably. GPT-2 might tokenize differently, am I wrong? Or is this more of an approximation and we are fine with some divergence?

That is a great point @mathislucka, which is why we also support for the tiktoken library which is the official OpenAI repo for tokenizing GPT-3.5 models. However, we have the GPT-2 tokenizer as a fallback because the tiktoken library does not have wheels built for ARM64 Linux (issue here) so we aren't able to provide it in the Haystack Docker Image.

if "davinci" in model_name:
max_tokens_limit = 4000
if USE_TIKTOKEN:
tokenizer_name = MODEL_TO_ENCODING.get(model_name, "p50k_base")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathislucka This is where we automatically load the correct GPT-3.5 tokenizer if the tiktoken library is available.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready to go. 👍 The test case for the token limit is failing because the log message is different than the one we check for in the test. This needs to be changed. There are two lines with tt = PromptTemplate that you could refactor to prompt_template = PromptTemplate to increase readability. Other than that the changes look good to me.

test/nodes/test_prompt_node.py Outdated Show resolved Hide resolved
@sjrl sjrl requested a review from julian-risch March 1, 2023 10:25
@sjrl sjrl mentioned this pull request Mar 1, 2023
6 tasks
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 🚀

@zoltan-fedor
Copy link
Contributor

zoltan-fedor commented Mar 12, 2023

@sjrl, @julian-risch ,

I believe this PR has broken the use of flan-t5 models, which do not have a token limit (well, there is one in the tokenizer, see https://huggingface.co/google/flan-t5-xl/blob/main/tokenizer_config.json#L106, but that is NOT limiting what that model can actually handle, so for that type of model automatically using the limit from the tokenizer is wrong).

This is because the T5 models use a relative attention mechanism and so can handle any sequence with the only constraint being the memory of the GPU. (see google-research/text-to-text-transfer-transformer#273)

Now automatically the 512 token limit from the tokenizer is applied to the prompt supplied to the flan-t5 models.

Shouldn't there be a way to overwrite this new token limit coming from the tokenizer for scenarios like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: PromptModel should ensure user's prompts don't overflow the token limit
7 participants