Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent naming for counting tokens #4297

Closed
sjrl opened this issue Feb 28, 2023 · 3 comments
Closed

Consistent naming for counting tokens #4297

sjrl opened this issue Feb 28, 2023 · 3 comments
Assignees
Labels
2.x Related to Haystack v2.0 breaking change P2 Medium priority, add to the next sprint if no P1 available type:documentation Improvements on the docs type:enhancement
Milestone

Comments

@sjrl
Copy link
Contributor

sjrl commented Feb 28, 2023

I think it's a bit confusing how we use many different terms for number of tokens:
n_tokens, max_seq_len, max_tokens, max_tokens_limit, max_length and earlier leftover_token_len, n_full_prompt, n_full_prompt_tokens , n_skipped_tokens. Maybe we could follow a convention of using n_ when we count tokens and max_ when we set a limit? _len I would leave out. So leftover_token_len could become n_leftover_tokens and max_seq_len could become max_tokens. What do you think?

Originally posted by @julian-risch in #4179 (comment)

I agree that this is confusing and not consistent. Maybe we could combine this with your previous comment into one new PR to handle naming conventions?

Originally posted by @sjrl in #4179 (comment)

@masci masci added breaking change 2.x Related to Haystack v2.0 P3 Low priority, leave it in the backlog labels Mar 9, 2023
@masci masci added P2 Medium priority, add to the next sprint if no P1 available and removed P3 Low priority, leave it in the backlog labels Dec 11, 2023
@masci masci added this to the 2.0.0 milestone Dec 18, 2023
@ZanSara ZanSara self-assigned this Jan 15, 2024
@ZanSara
Copy link
Contributor

ZanSara commented Jan 23, 2024

In Haystack 2.0 most of these parameters are not defined explicitly. They're mostly passed through kwargs, where I believe it makes sense to keep them exactly as their underlying backend expects them instead of normalizing them.

There is only one component that makes use of the "old" convention: ExtractiveReader. I'm going to update the name of this parameter from max_seq_len to max_tokens_per_seq and leave the others untouched.

@sjrl
Copy link
Contributor Author

sjrl commented Jan 23, 2024

@ZanSara I think you're right that this issue is more for Haystack v1. If the ExtractiveReader is the only one doing this I'd actually to prefer to leave it as is since that is the name of the underlying variable used by HuggingFace which we decided to explicitly expose in this case instead of passing it through model_kwargs.

@ZanSara
Copy link
Contributor

ZanSara commented Jan 23, 2024

Ok, sounds good to me! I'll close the issue as completed then. Of course if you spot anything that I didn't catch let's open a dedicated issue and reference this one.

@ZanSara ZanSara closed this as completed Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 breaking change P2 Medium priority, add to the next sprint if no P1 available type:documentation Improvements on the docs type:enhancement
Projects
None yet
Development

No branches or pull requests

4 participants