Consistent naming for counting tokens #4297

sjrl · 2023-02-28T15:26:30Z

I think it's a bit confusing how we use many different terms for number of tokens:
n_tokens, max_seq_len, max_tokens, max_tokens_limit, max_length and earlier leftover_token_len, n_full_prompt, n_full_prompt_tokens , n_skipped_tokens. Maybe we could follow a convention of using n_ when we count tokens and max_ when we set a limit? _len I would leave out. So leftover_token_len could become n_leftover_tokens and max_seq_len could become max_tokens. What do you think?

Originally posted by @julian-risch in #4179 (comment)

I agree that this is confusing and not consistent. Maybe we could combine this with your previous comment into one new PR to handle naming conventions?

Originally posted by @sjrl in #4179 (comment)

The text was updated successfully, but these errors were encountered:

ZanSara · 2024-01-23T10:21:09Z

In Haystack 2.0 most of these parameters are not defined explicitly. They're mostly passed through kwargs, where I believe it makes sense to keep them exactly as their underlying backend expects them instead of normalizing them.

There is only one component that makes use of the "old" convention: ExtractiveReader. I'm going to update the name of this parameter from max_seq_len to max_tokens_per_seq and leave the others untouched.

sjrl · 2024-01-23T10:54:51Z

@ZanSara I think you're right that this issue is more for Haystack v1. If the ExtractiveReader is the only one doing this I'd actually to prefer to leave it as is since that is the name of the underlying variable used by HuggingFace which we decided to explicitly expose in this case instead of passing it through model_kwargs.

ZanSara · 2024-01-23T11:37:59Z

Ok, sounds good to me! I'll close the issue as completed then. Of course if you spot anything that I didn't catch let's open a dedicated issue and reference this one.

TuanaCelik added type:documentation Improvements on the docs type:enhancement labels Feb 28, 2023

masci added breaking change 2.x Related to Haystack v2.0 P3 Low priority, leave it in the backlog labels Mar 9, 2023

julian-risch mentioned this issue Jun 6, 2023

Automatically calculate the max_length of the PromptNode to avoid calculating token length in client code #5047

Closed

masci added P2 Medium priority, add to the next sprint if no P1 available and removed P3 Low priority, leave it in the backlog labels Dec 11, 2023

masci added this to the 2.0.0 milestone Dec 18, 2023

ZanSara self-assigned this Jan 15, 2024

ZanSara closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent naming for counting tokens #4297

Consistent naming for counting tokens #4297

sjrl commented Feb 28, 2023 •

edited

Loading

ZanSara commented Jan 23, 2024

sjrl commented Jan 23, 2024

ZanSara commented Jan 23, 2024

Consistent naming for counting tokens #4297

Consistent naming for counting tokens #4297

Comments

sjrl commented Feb 28, 2023 • edited Loading

ZanSara commented Jan 23, 2024

sjrl commented Jan 23, 2024

ZanSara commented Jan 23, 2024

sjrl commented Feb 28, 2023 •

edited

Loading