Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770

Merged

Conversation

zachschillaci27
Copy link
Contributor

Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI

Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting get_num_tokens of BaseLanguageModel into a get_tokens method - which returns a list of integer tokens - and the same get_num_tokens method, now calling get_tokens under the hood. I have further implemented these methods for the OpenAI and ChatOpenAI language models.

Before submitting

I updated the integration test here tests/integration_tests/test_schema.py to also check the tokenization of the default GPT2TokenizerFast tokenizer.

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@hwchase17, @agola11

@zachschillaci27 zachschillaci27 changed the title Llm get tokens Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI May 16, 2023
@vowelparrot vowelparrot changed the base branch from master to vwp/get_token_ids May 16, 2023 13:27
@vowelparrot vowelparrot merged commit f968330 into langchain-ai:vwp/get_token_ids May 16, 2023
@zachschillaci27
Copy link
Contributor Author

@vowelparrot Thanks for the comments. I realize I missed another change here, on the OpenAIChat class

vowelparrot pushed a commit that referenced this pull request May 17, 2023
# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI
Sometimes it's helpful to inspect the tokenization of your model to
better understand how certain words are being tokenized. Therefore, I
have split the preexisting `get_num_tokens` of `BaseLanguageModel` into
a `get_tokens` method - which returns a list of integer tokens - and the
same `get_num_tokens` method, now calling `get_tokens` under the hood. I
have further implemented these methods for the `OpenAI` and `ChatOpenAI`
language models.

<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->

<!-- Remove if not applicable -->

## Before submitting

I updated the integration test here
`tests/integration_tests/test_schema.py` to also check the tokenization
of the default `GPT2TokenizerFast` tokenizer.

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@hwchase17, @agola11 

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoaders
        - @eyurtsev

        Models
        - @hwchase17
        - @agola11

        Agents / Tools / Toolkits
        - @vowelparrot
        
        VectorStores / Retrievers / Memory
        - @dev2049
        
 -->
vowelparrot pushed a commit that referenced this pull request May 17, 2023
vowelparrot pushed a commit that referenced this pull request May 22, 2023
# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI
Sometimes it's helpful to inspect the tokenization of your model to
better understand how certain words are being tokenized. Therefore, I
have split the preexisting `get_num_tokens` of `BaseLanguageModel` into
a `get_tokens` method - which returns a list of integer tokens - and the
same `get_num_tokens` method, now calling `get_tokens` under the hood. I
have further implemented these methods for the `OpenAI` and `ChatOpenAI`
language models.

<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->

<!-- Remove if not applicable -->

## Before submitting

I updated the integration test here
`tests/integration_tests/test_schema.py` to also check the tokenization
of the default `GPT2TokenizerFast` tokenizer.

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@hwchase17, @agola11 

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoaders
        - @eyurtsev

        Models
        - @hwchase17
        - @agola11

        Agents / Tools / Toolkits
        - @vowelparrot
        
        VectorStores / Retrievers / Memory
        - @dev2049
        
 -->
vowelparrot pushed a commit that referenced this pull request May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants