In OpenAIAnswerGenerator avoid tokenizing all documents #4298

sjrl · 2023-02-28T15:28:41Z

We're tokenizing all documents here:

haystack/haystack/nodes/answer_generator/openai.py

Lines 316 to 317 in c3a38a5

    
           if leftover_token_len < 0: 
        
               n_docs_tokens = [self._count_tokens(doc.content) for doc in documents]

and here:

haystack/haystack/nodes/answer_generator/openai.py

Line 307 in c3a38a5

n_full_prompt_tokens = self._count_tokens(full_prompt)

Depending on how slow tokenization is, we could tokenize one document at a time and check whether throwing away that document saves enough tokens.
The call to count_openai_tokens(text=doc.content, tokenizer=self._tokenizer) would need to go into the same loop as n_skipped_tokens += doc_token_len. What do you think? It's not the focus of this PR so maybe this improvement could become a separate issue?

Originally posted by @julian-risch in #4179 (comment)

Yeah, I think this is a good idea. This would avoid a user accidentally sending 100+ docs to the PromptNode or AnswerGenerator and wonder what is taking so long. But this PR is already fairly large so I think opening a separate issue for this would be a good idea.

Originally posted by @sjrl in #4179 (comment)

The text was updated successfully, but these errors were encountered:

masci · 2023-03-09T14:29:02Z

@sjrl any chance you can work on this?

sjrl · 2023-03-24T10:40:02Z

Hey @masci unfortunately, I won't have time to work on this for a few weeks, but I can try and get to it when I do.

masci added Contributions wanted! Looking for external contributions P3 Low priority, leave it in the backlog labels Mar 9, 2023

anakin87 mentioned this issue Mar 27, 2023

refactor: OpenAIAnswerGenerator - avoid tokenizing all documents several times #4504

Merged

6 tasks

masci closed this as completed in #4504 Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In OpenAIAnswerGenerator avoid tokenizing all documents #4298

In OpenAIAnswerGenerator avoid tokenizing all documents #4298

sjrl commented Feb 28, 2023 •

edited

Loading

masci commented Mar 9, 2023

sjrl commented Mar 24, 2023

In OpenAIAnswerGenerator avoid tokenizing all documents #4298

In OpenAIAnswerGenerator avoid tokenizing all documents #4298

Comments

sjrl commented Feb 28, 2023 • edited Loading

masci commented Mar 9, 2023

sjrl commented Mar 24, 2023

sjrl commented Feb 28, 2023 •

edited

Loading