Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In OpenAIAnswerGenerator avoid tokenizing all documents #4298

Closed
sjrl opened this issue Feb 28, 2023 · 2 comments · Fixed by #4504
Closed

In OpenAIAnswerGenerator avoid tokenizing all documents #4298

sjrl opened this issue Feb 28, 2023 · 2 comments · Fixed by #4504
Labels
Contributions wanted! Looking for external contributions P3 Low priority, leave it in the backlog

Comments

@sjrl
Copy link
Contributor

sjrl commented Feb 28, 2023

We're tokenizing all documents here:

if leftover_token_len < 0:
n_docs_tokens = [self._count_tokens(doc.content) for doc in documents]

and here:
n_full_prompt_tokens = self._count_tokens(full_prompt)

Depending on how slow tokenization is, we could tokenize one document at a time and check whether throwing away that document saves enough tokens.
The call to count_openai_tokens(text=doc.content, tokenizer=self._tokenizer) would need to go into the same loop as n_skipped_tokens += doc_token_len. What do you think? It's not the focus of this PR so maybe this improvement could become a separate issue?

Originally posted by @julian-risch in #4179 (comment)

Yeah, I think this is a good idea. This would avoid a user accidentally sending 100+ docs to the PromptNode or AnswerGenerator and wonder what is taking so long. But this PR is already fairly large so I think opening a separate issue for this would be a good idea.

Originally posted by @sjrl in #4179 (comment)

@masci masci added Contributions wanted! Looking for external contributions P3 Low priority, leave it in the backlog labels Mar 9, 2023
@masci
Copy link
Contributor

masci commented Mar 9, 2023

@sjrl any chance you can work on this?

@sjrl
Copy link
Contributor Author

sjrl commented Mar 24, 2023

Hey @masci unfortunately, I won't have time to work on this for a few weeks, but I can try and get to it when I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions P3 Low priority, leave it in the backlog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants