Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use model token limit not tokenizer ditto #5939

Conversation

JensMadsen
Copy link
Contributor

@JensMadsen JensMadsen commented Jun 9, 2023

This fixes a token limit bug in the SentenceTransformersTokenTextSplitter. Before the token limit was taken from tokenizer used by the model. However, for some models the token limit of the tokenizer (from AutoTokenizer.from_pretrained) does not equal the token limit of the model. This was a false assumption. Therefore, the token limit of the text splitter is now taken from the sentence transformers model token limit.

Twitter: @plasmajens

Before submitting

Who can review?

@hwchase17 and/or @dev2049

@JensMadsen JensMadsen changed the title chore: add type annotations to constructors fix: use model token limit not tokenizer ditto Jun 9, 2023
@JensMadsen JensMadsen force-pushed the fixInconsistentTokenBasedTextSplitterTokenLimits branch from 3c876fb to 6978bea Compare June 9, 2023 19:31
@JensMadsen JensMadsen marked this pull request as ready for review June 9, 2023 19:47
@JensMadsen JensMadsen force-pushed the fixInconsistentTokenBasedTextSplitterTokenLimits branch from 6978bea to 8ac7f3f Compare June 9, 2023 21:25
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks

@hwchase17 hwchase17 added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jun 10, 2023
@hwchase17 hwchase17 merged commit 1250cd4 into langchain-ai:master Jun 10, 2023
@JensMadsen JensMadsen deleted the fixInconsistentTokenBasedTextSplitterTokenLimits branch June 11, 2023 06:07
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
This fixes a token limit bug in the
SentenceTransformersTokenTextSplitter. Before the token limit was taken
from tokenizer used by the model. However, for some models the token
limit of the tokenizer (from `AutoTokenizer.from_pretrained`) does not
equal the token limit of the model. This was a false assumption.
Therefore, the token limit of the text splitter is now taken from the
sentence transformers model token limit.

Twitter: @plasmajens

#### Before submitting

#### Who can review?

@hwchase17 and/or @dev2049

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants