Is it possible to join the implementation of GPT3 Tokenizer #62

OneSeven · 2023-02-08T03:52:17Z

Use Go to implement this function: https://platform.openai.com/tokenizer

sashabaranov · 2023-02-08T06:29:28Z

Related: https://github.com/openai/tiktoken

OneSeven · 2023-02-08T06:58:17Z

Related: https://github.com/openai/tiktoken

Thanks, but I think I need a library that can be called through golang.

sashabaranov · 2023-02-08T07:50:52Z

@OneSeven sure, I mean, we either would need to be able to embed this library (via cgo or otherwise) or would need to translate it from Rust to Go.

OneSeven · 2023-02-08T07:55:59Z

@OneSeven sure, I mean, we either would need to be able to embed this library (via cgo or otherwise) or would need to translate it from Rust to Go.

Do you have plans to add this functionality to the current SDK.
I would love to contribute, but my level is far from enough, sorry.

sashabaranov · 2023-02-08T08:01:49Z

There's no plan for that right now, but we are open for contributions 😄

I guess you can also call github.com/openai/tiktoken as a separate binary from Go.

ealvar3z · 2023-02-09T04:33:45Z

@OneSeven sure, I mean, we either would need to be able to embed this library (via cgo or otherwise) or would need to translate it from Rust to Go.

Isn't this library in Python? and if porting; how would you prefer the scaffolding of the porting into your repo? would it be a separate repo and then you import it into go-gpt3, etc. In other words, I am attempting to see your vision if porting it from Python to Go is feasible.

marcel · 2023-02-09T05:14:37Z

There's a go library already: https://github.com/samber/go-gpt-3-encoder

OneSeven · 2023-02-09T06:27:28Z

There's a go library already: https://github.com/samber/go-gpt-3-encoder

This library can only be used for English characters, and the correct results cannot be obtained for other languages

sashabaranov · 2023-02-09T07:58:35Z

@ealvar3z It's Rust wrapped in Python https://github.com/openai/tiktoken/blob/main/src/lib.rs

If it would be possible to bring tokenization with zero (or minimal) dependencies — I'm all for merging it. Otherwise, I think it makes sense to implement it in a separate repo.

vvatanabe · 2023-07-09T02:14:26Z

Good example of how to count tokens:

GwynethLlewelyn · 2024-03-30T23:19:02Z

Since the original issue was opened, there has been some progress!

The documentation on the official OpenAI repository currently points to pkoukk/tiktoken-go as the Go library for tokenizing (no endorsements, just a link).

You can see from the test script that it deals with tokens in different languages and alphabets. It might still get things wrong, but at least they are as wrong as the official OpenAI Python version!

Dependencies currently listed by its go.mod:

module github.com/pkoukk/tiktoken-go

go 1.19

require (
	github.com/dlclark/regexp2 v1.10.0
	github.com/google/uuid v1.3.0
	github.com/stretchr/testify v1.8.2
)

require (
	github.com/davecgh/go-spew v1.1.1 // indirect
	github.com/pmezard/go-difflib v1.0.0 // indirect
	gopkg.in/yaml.v3 v3.0.1 // indirect
)

It's not "zero" dependencies as you'd prefer, but close! I haven't looked into the code very deeply.

The dependency upon google/uuid is pretty standard; one wonders why the Go core developers haven't incorporated it into the Go Standard Library yet (it does have a few quirks, though, but because it comes from Google itself, I guess it's ok to use).

The inclusion of dlclark/regexp2 — as opposed to using the standard regexp built on top of Google's RE2 engine — is very likely because the former closely follows the algorithm used by .NET, which might be a requirement for the tokenizer to come up with the same results as tiktoken.

And stretchr/testify is evidently only used for the testing bits; it has no relevance to the overall tokenizer code itself.

Performance, according to the published benchmarks (e.g., those included in its test suite), seems to be the same as the original Python code.

I think you've got your tiktokenizer candidate! 😀

sashabaranov mentioned this issue Mar 29, 2023

Ability to count tokens before sending #200

Closed

vvatanabe mentioned this issue Jun 16, 2023

Add a method to query the remaining tokens of key #380

Closed

vvatanabe added the enhancement New feature or request label Jul 1, 2023

This was referenced Jul 3, 2023

如何获取花费的 token 数？ #231

Closed

Streamlining and Organizing our Issue Tracker #415

Closed

vvatanabe mentioned this issue Jul 16, 2023

How can i compute the token count？ #442

Closed

vvatanabe mentioned this issue Jul 27, 2023

docs: add Frequently Asked Questions to README.md #462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to join the implementation of GPT3 Tokenizer #62

Is it possible to join the implementation of GPT3 Tokenizer #62

OneSeven commented Feb 8, 2023

sashabaranov commented Feb 8, 2023

OneSeven commented Feb 8, 2023

sashabaranov commented Feb 8, 2023

OneSeven commented Feb 8, 2023

sashabaranov commented Feb 8, 2023

ealvar3z commented Feb 9, 2023

marcel commented Feb 9, 2023

OneSeven commented Feb 9, 2023

sashabaranov commented Feb 9, 2023

vvatanabe commented Jul 9, 2023

GwynethLlewelyn commented Mar 30, 2024 •

edited

Loading

Is it possible to join the implementation of GPT3 Tokenizer #62

Is it possible to join the implementation of GPT3 Tokenizer #62

Comments

OneSeven commented Feb 8, 2023

sashabaranov commented Feb 8, 2023

OneSeven commented Feb 8, 2023

sashabaranov commented Feb 8, 2023

OneSeven commented Feb 8, 2023

sashabaranov commented Feb 8, 2023

ealvar3z commented Feb 9, 2023

marcel commented Feb 9, 2023

OneSeven commented Feb 9, 2023

sashabaranov commented Feb 9, 2023

vvatanabe commented Jul 9, 2023

GwynethLlewelyn commented Mar 30, 2024 • edited Loading

GwynethLlewelyn commented Mar 30, 2024 •

edited

Loading