-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770
Merged
vowelparrot
merged 2 commits into
langchain-ai:vwp/get_token_ids
from
zachschillaci27:llm-get-tokens
May 16, 2023
Merged
Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI #4770
vowelparrot
merged 2 commits into
langchain-ai:vwp/get_token_ids
from
zachschillaci27:llm-get-tokens
May 16, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
zachschillaci27
changed the title
Llm get tokens
Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI
May 16, 2023
vowelparrot
reviewed
May 16, 2023
vowelparrot
reviewed
May 16, 2023
vowelparrot
reviewed
May 16, 2023
vowelparrot
reviewed
May 16, 2023
@vowelparrot Thanks for the comments. I realize I missed another change here, on the |
vowelparrot
pushed a commit
that referenced
this pull request
May 17, 2023
# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting `get_num_tokens` of `BaseLanguageModel` into a `get_tokens` method - which returns a list of integer tokens - and the same `get_num_tokens` method, now calling `get_tokens` under the hood. I have further implemented these methods for the `OpenAI` and `ChatOpenAI` language models. <!-- Thank you for contributing to LangChain! Your PR will appear in our next release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. --> <!-- Remove if not applicable --> ## Before submitting I updated the integration test here `tests/integration_tests/test_schema.py` to also check the tokenization of the default `GPT2TokenizerFast` tokenizer. <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @hwchase17, @agola11 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->
vowelparrot
pushed a commit
that referenced
this pull request
May 17, 2023
vowelparrot
pushed a commit
that referenced
this pull request
May 22, 2023
# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting `get_num_tokens` of `BaseLanguageModel` into a `get_tokens` method - which returns a list of integer tokens - and the same `get_num_tokens` method, now calling `get_tokens` under the hood. I have further implemented these methods for the `OpenAI` and `ChatOpenAI` language models. <!-- Thank you for contributing to LangChain! Your PR will appear in our next release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. --> <!-- Remove if not applicable --> ## Before submitting I updated the integration test here `tests/integration_tests/test_schema.py` to also check the tokenization of the default `GPT2TokenizerFast` tokenizer. <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @hwchase17, @agola11 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->
vowelparrot
pushed a commit
that referenced
this pull request
May 22, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI
Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting
get_num_tokens
ofBaseLanguageModel
into aget_tokens
method - which returns a list of integer tokens - and the sameget_num_tokens
method, now callingget_tokens
under the hood. I have further implemented these methods for theOpenAI
andChatOpenAI
language models.Before submitting
I updated the integration test here
tests/integration_tests/test_schema.py
to also check the tokenization of the defaultGPT2TokenizerFast
tokenizer.Who can review?
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
@hwchase17, @agola11