-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to join the implementation of GPT3 Tokenizer #62
Comments
Thanks, but I think I need a library that can be called through golang. |
There's no plan for that right now, but we are open for contributions 😄 I guess you can also call github.com/openai/tiktoken as a separate binary from Go. |
Isn't this library in Python? and if porting; how would you prefer the scaffolding of the porting into your repo? would it be a separate repo and then you import it into go-gpt3, etc. In other words, I am attempting to see your vision if porting it from Python to Go is feasible. |
There's a go library already: https://github.com/samber/go-gpt-3-encoder |
This library can only be used for English characters, and the correct results cannot be obtained for other languages |
@ealvar3z It's Rust wrapped in Python https://github.com/openai/tiktoken/blob/main/src/lib.rs If it would be possible to bring tokenization with zero (or minimal) dependencies — I'm all for merging it. Otherwise, I think it makes sense to implement it in a separate repo. |
Since the original issue was opened, there has been some progress! The documentation on the official OpenAI repository currently points to You can see from the test script that it deals with tokens in different languages and alphabets. It might still get things wrong, but at least they are as wrong as the official OpenAI Python version! Dependencies currently listed by its module github.com/pkoukk/tiktoken-go
go 1.19
require (
github.com/dlclark/regexp2 v1.10.0
github.com/google/uuid v1.3.0
github.com/stretchr/testify v1.8.2
)
require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
) It's not "zero" dependencies as you'd prefer, but close! I haven't looked into the code very deeply. The dependency upon The inclusion of And Performance, according to the published benchmarks (e.g., those included in its test suite), seems to be the same as the original Python code. I think you've got your tiktokenizer candidate! 😀 |
Use Go to implement this function: https://platform.openai.com/tokenizer
The text was updated successfully, but these errors were encountered: