Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770

zachschillaci27 · 2023-05-16T07:01:43Z

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI

Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting get_num_tokens of BaseLanguageModel into a get_tokens method - which returns a list of integer tokens - and the same get_num_tokens method, now calling get_tokens under the hood. I have further implemented these methods for the OpenAI and ChatOpenAI language models.

Before submitting

I updated the integration test here tests/integration_tests/test_schema.py to also check the tokenization of the default GPT2TokenizerFast tokenizer.

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@hwchase17, @agola11

langchain/base_language.py

langchain/chat_models/openai.py

zachschillaci27 · 2023-05-16T13:34:51Z

@vowelparrot Thanks for the comments. I realize I missed another change here, on the OpenAIChat class

@hwchase17

# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting `get_num_tokens` of `BaseLanguageModel` into a `get_tokens` method - which returns a list of integer tokens - and the same `get_num_tokens` method, now calling `get_tokens` under the hood. I have further implemented these methods for the `OpenAI` and `ChatOpenAI` language models.   ## Before submitting I updated the integration test here `tests/integration_tests/test_schema.py` to also check the tokenization of the default `GPT2TokenizerFast` tokenizer.  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @hwchase17, @agola11

See #4770, #4784

@hwchase17

# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting `get_num_tokens` of `BaseLanguageModel` into a `get_tokens` method - which returns a list of integer tokens - and the same `get_num_tokens` method, now calling `get_tokens` under the hood. I have further implemented these methods for the `OpenAI` and `ChatOpenAI` language models.   ## Before submitting I updated the integration test here `tests/integration_tests/test_schema.py` to also check the tokenization of the default `GPT2TokenizerFast` tokenizer.  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @hwchase17, @agola11

See #4770, #4784

zachschillaci27 added 2 commits May 15, 2023 14:55

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI

9ff8e87

Update tokenizer integration test

1b4ac91

zachschillaci27 changed the title ~~Llm get tokens~~ Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI May 16, 2023

vowelparrot reviewed May 16, 2023

View reviewed changes

langchain/base_language.py Show resolved Hide resolved

vowelparrot reviewed May 16, 2023

View reviewed changes

langchain/base_language.py Show resolved Hide resolved

vowelparrot reviewed May 16, 2023

View reviewed changes

langchain/base_language.py Show resolved Hide resolved

vowelparrot reviewed May 16, 2023

View reviewed changes

langchain/chat_models/openai.py Show resolved Hide resolved

vowelparrot changed the base branch from master to vwp/get_token_ids May 16, 2023 13:27

vowelparrot merged commit f968330 into langchain-ai:vwp/get_token_ids May 16, 2023

zachschillaci27 mentioned this pull request May 16, 2023

Finish implementing get_token_ids #4787

Merged

vowelparrot pushed a commit that referenced this pull request May 17, 2023

Finish implementing get_token_ids (#4787)

1627f31

See #4770, #4784

vowelparrot pushed a commit that referenced this pull request May 22, 2023

Finish implementing get_token_ids (#4787)

002f303

See #4770, #4784

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770

zachschillaci27 commented May 16, 2023

zachschillaci27 commented May 16, 2023

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770

Conversation

zachschillaci27 commented May 16, 2023

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI

Before submitting

Who can review?

zachschillaci27 commented May 16, 2023