Add 'get_token_ids' method #4784

vowelparrot · 2023-05-16T13:38:33Z

Also did some reshuffling since we were using different encoding models depending on the method in the chat openai model which seems bad?

langchain/base_language.py

See #4770, #4784

@hwchase17

# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting `get_num_tokens` of `BaseLanguageModel` into a `get_tokens` method - which returns a list of integer tokens - and the same `get_num_tokens` method, now calling `get_tokens` under the hood. I have further implemented these methods for the `OpenAI` and `ChatOpenAI` language models.   ## Before submitting I updated the integration test here `tests/integration_tests/test_schema.py` to also check the tokenization of the default `GPT2TokenizerFast` tokenizer.  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @hwchase17, @agola11

See #4770, #4784

vowelparrot requested a review from dev2049 May 16, 2023 13:38

vowelparrot force-pushed the vwp/get_token_ids branch 3 times, most recently from 8f8b5f8 to 459b050 Compare May 16, 2023 14:13

eyurtsev approved these changes May 16, 2023

View reviewed changes

zachschillaci27 mentioned this pull request May 16, 2023

Finish implementing get_token_ids #4787

Merged

dev2049 reviewed May 16, 2023

View reviewed changes

langchain/base_language.py Outdated Show resolved Hide resolved

vowelparrot pushed a commit that referenced this pull request May 17, 2023

Finish implementing get_token_ids (#4787)

1627f31

See #4770, #4784

vowelparrot force-pushed the vwp/get_token_ids branch 2 times, most recently from 8232ebc to 1013878 Compare May 17, 2023 18:58

zachschillaci27 and others added 6 commits May 21, 2023 21:50

Update naming / typing

35c94c7

Finish implementing get_token_ids (#4787)

002f303

See #4770, #4784

Unify model name

bbb404f

docstring

2bffe5b

handle < 3.7

5563713

vowelparrot force-pushed the vwp/get_token_ids branch 2 times, most recently from 4500fec to c6a0f1d Compare May 22, 2023 05:36

handle < 3.7

d1f4103

vowelparrot force-pushed the vwp/get_token_ids branch from c6a0f1d to d1f4103 Compare May 22, 2023 06:12

vowelparrot merged commit 785502e into master May 22, 2023

vowelparrot deleted the vwp/get_token_ids branch May 22, 2023 13:17

danielchalef mentioned this pull request Jun 5, 2023

Zep Hybrid Search #5742

Merged

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'get_token_ids' method #4784

Add 'get_token_ids' method #4784

vowelparrot commented May 16, 2023 •

edited

Loading

Add 'get_token_ids' method #4784

Add 'get_token_ids' method #4784

Conversation

vowelparrot commented May 16, 2023 • edited Loading

vowelparrot commented May 16, 2023 •

edited

Loading