-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'get_token_ids' method #4784
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vowelparrot
force-pushed
the
vwp/get_token_ids
branch
3 times, most recently
from
May 16, 2023 14:13
8f8b5f8
to
459b050
Compare
eyurtsev
approved these changes
May 16, 2023
dev2049
reviewed
May 16, 2023
vowelparrot
pushed a commit
that referenced
this pull request
May 17, 2023
vowelparrot
force-pushed
the
vwp/get_token_ids
branch
2 times, most recently
from
May 17, 2023 18:58
8232ebc
to
1013878
Compare
# Get tokens method for BaseLanguageModel, OpenAI, & ChatOpenAI Sometimes it's helpful to inspect the tokenization of your model to better understand how certain words are being tokenized. Therefore, I have split the preexisting `get_num_tokens` of `BaseLanguageModel` into a `get_tokens` method - which returns a list of integer tokens - and the same `get_num_tokens` method, now calling `get_tokens` under the hood. I have further implemented these methods for the `OpenAI` and `ChatOpenAI` language models. <!-- Thank you for contributing to LangChain! Your PR will appear in our next release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. --> <!-- Remove if not applicable --> ## Before submitting I updated the integration test here `tests/integration_tests/test_schema.py` to also check the tokenization of the default `GPT2TokenizerFast` tokenizer. <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @hwchase17, @agola11 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->
vowelparrot
force-pushed
the
vwp/get_token_ids
branch
2 times, most recently
from
May 22, 2023 05:36
4500fec
to
c6a0f1d
Compare
vowelparrot
force-pushed
the
vwp/get_token_ids
branch
from
May 22, 2023 06:12
c6a0f1d
to
d1f4103
Compare
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Also did some reshuffling since we were using different encoding models depending on the method in the chat openai model which seems bad?