Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'get_token_ids' method #4784

Merged
merged 7 commits into from
May 22, 2023
Merged

Add 'get_token_ids' method #4784

merged 7 commits into from
May 22, 2023

Conversation

vowelparrot
Copy link
Contributor

@vowelparrot vowelparrot commented May 16, 2023

Also did some reshuffling since we were using different encoding models depending on the method in the chat openai model which seems bad?

@vowelparrot vowelparrot requested a review from dev2049 May 16, 2023 13:38
@vowelparrot vowelparrot force-pushed the vwp/get_token_ids branch 3 times, most recently from 8f8b5f8 to 459b050 Compare May 16, 2023 14:13
langchain/base_language.py Outdated Show resolved Hide resolved
vowelparrot pushed a commit that referenced this pull request May 17, 2023
@vowelparrot vowelparrot force-pushed the vwp/get_token_ids branch 2 times, most recently from 8232ebc to 1013878 Compare May 17, 2023 18:58
zachschillaci27 and others added 6 commits May 21, 2023 21:50
# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI
Sometimes it's helpful to inspect the tokenization of your model to
better understand how certain words are being tokenized. Therefore, I
have split the preexisting `get_num_tokens` of `BaseLanguageModel` into
a `get_tokens` method - which returns a list of integer tokens - and the
same `get_num_tokens` method, now calling `get_tokens` under the hood. I
have further implemented these methods for the `OpenAI` and `ChatOpenAI`
language models.

<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->

<!-- Remove if not applicable -->

## Before submitting

I updated the integration test here
`tests/integration_tests/test_schema.py` to also check the tokenization
of the default `GPT2TokenizerFast` tokenizer.

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@hwchase17, @agola11 

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoaders
        - @eyurtsev

        Models
        - @hwchase17
        - @agola11

        Agents / Tools / Toolkits
        - @vowelparrot
        
        VectorStores / Retrievers / Memory
        - @dev2049
        
 -->
@vowelparrot vowelparrot force-pushed the vwp/get_token_ids branch 2 times, most recently from 4500fec to c6a0f1d Compare May 22, 2023 05:36
@vowelparrot vowelparrot merged commit 785502e into master May 22, 2023
@vowelparrot vowelparrot deleted the vwp/get_token_ids branch May 22, 2023 13:17
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants