Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change our tokenizer a bit to be more accurate #616

Merged
merged 3 commits into from
Nov 15, 2023
Merged

Conversation

dkotter
Copy link
Collaborator

@dkotter dkotter commented Nov 14, 2023

Description of the Change

We have a lightweight tokenizer class that attempts to determine how many tokens are in a string. This is done mostly by determining how many characters are in the string and how many characters are in a single token. This is not meant to be 100% accurate but is meant to be close enough for our use case (which is ensuring we stay within the token limits of our model).

Recently I ran across an article that was super long and I received a token length error when trying to process it. In debugging, I found we weren't being aggressive enough with the counting of tokens, and thus we weren't trimming enough of the content to stay within the model limits.

This PR lowers the number of characters per token from 4 to 3.5. This fixed the issue I ran into and seems to be more accurate in counting tokens.

How to test the Change

Send content to OpenAI (either generate an excerpt or generate titles) and ensure things still work and no errors are shown

Changelog Entry

Fixed - More accurate token counts when trimming content

Credits

Props @dkotter

Checklist:

  • I agree to follow this project's Code of Conduct.
  • I have updated the documentation accordingly.
  • I have added tests to cover my change.
  • All new and existing tests pass.

@dkotter dkotter added this to the 2.5.0 milestone Nov 14, 2023
@dkotter dkotter self-assigned this Nov 14, 2023
@dkotter dkotter requested review from jeffpaul and a team as code owners November 14, 2023 20:36
@dkotter dkotter requested review from a team and faisal-alvi and removed request for a team and jeffpaul November 14, 2023 20:36
Copy link
Member

@faisal-alvi faisal-alvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and brief details. Tested the plugin features and found are working as expected in the fix branch, no errors are shown.

Title Generation

image

Excerpt Generation

image

Post Classification

image

@dkotter dkotter merged commit fffd3ec into develop Nov 15, 2023
13 checks passed
@dkotter dkotter deleted the fix/tokenizer branch November 15, 2023 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants