Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved interface for split_by_token #18

Merged
merged 1 commit into from
Apr 24, 2023

Conversation

jackbackes
Copy link
Contributor

The split_by_token_ordinary method and its corresponding iterator split_by_token_ordinary_iter have been added to the CoreBPE struct in vendor_tiktoken.rs. These methods allow for ordinary tokenization of a string without special tokens from the BPE model.

Simplified the .split_by_token_with_special_tokens method to just be split_by_token and differentiated between methods that return iter vs collection.

The `split_by_token_ordinary` method and its corresponding iterator `split_by_token_ordinary_iter` have been added to the `CoreBPE` struct in `vendor_tiktoken.rs`. These methods allow for ordinary tokenization of a string without special tokens from the BPE model.

Simplified the .split_by_token_with_special_tokens method to just be `split_by_token` and differentiated between methods that return iter vs collection.
@jackbackes
Copy link
Contributor Author

I thought about this some more - I think this interface is more in line with the rest of the codebase.

@zurawiki
Copy link
Owner

nice!

@zurawiki zurawiki merged commit f2e3962 into zurawiki:main Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants