fix: use model token limit not tokenizer ditto #5939

JensMadsen · 2023-06-09T13:34:58Z

This fixes a token limit bug in the SentenceTransformersTokenTextSplitter. Before the token limit was taken from tokenizer used by the model. However, for some models the token limit of the tokenizer (from AutoTokenizer.from_pretrained) does not equal the token limit of the model. This was a false assumption. Therefore, the token limit of the text splitter is now taken from the sentence transformers model token limit.

Twitter: @plasmajens

Before submitting

Who can review?

@hwchase17 and/or @dev2049

hwchase17

lgtm! thanks

@hwchase17

This fixes a token limit bug in the SentenceTransformersTokenTextSplitter. Before the token limit was taken from tokenizer used by the model. However, for some models the token limit of the tokenizer (from `AutoTokenizer.from_pretrained`) does not equal the token limit of the model. This was a false assumption. Therefore, the token limit of the text splitter is now taken from the sentence transformers model token limit. Twitter: @plasmajens #### Before submitting #### Who can review? @hwchase17 and/or @dev2049 --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

chore: add type annotations to constructors

6c2387c

JensMadsen changed the title ~~chore: add type annotations to constructors~~ fix: use model token limit not tokenizer ditto Jun 9, 2023

JensMadsen force-pushed the fixInconsistentTokenBasedTextSplitterTokenLimits branch from 3c876fb to 6978bea Compare June 9, 2023 19:31

JensMadsen marked this pull request as ready for review June 9, 2023 19:47

fix: use model token limit not tokenizer ditto

8ac7f3f

JensMadsen force-pushed the fixInconsistentTokenBasedTextSplitterTokenLimits branch from 6978bea to 8ac7f3f Compare June 9, 2023 21:25

cr

10edb96

hwchase17 approved these changes Jun 10, 2023

View reviewed changes

hwchase17 added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jun 10, 2023

hwchase17 merged commit 1250cd4 into langchain-ai:master Jun 10, 2023

JensMadsen deleted the fixInconsistentTokenBasedTextSplitterTokenLimits branch June 11, 2023 06:07

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use model token limit not tokenizer ditto #5939

fix: use model token limit not tokenizer ditto #5939

JensMadsen commented Jun 9, 2023 •

edited

Loading

hwchase17 left a comment

fix: use model token limit not tokenizer ditto #5939

fix: use model token limit not tokenizer ditto #5939

Conversation

JensMadsen commented Jun 9, 2023 • edited Loading

Before submitting

Who can review?

hwchase17 left a comment

Choose a reason for hiding this comment

JensMadsen commented Jun 9, 2023 •

edited

Loading