Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate embedded vocabulary maps lazily #6

Merged
merged 2 commits into from
May 10, 2024
Merged

Generate embedded vocabulary maps lazily #6

merged 2 commits into from
May 10, 2024

Conversation

espadolini
Copy link
Contributor

Maps in vars are created at init time, resulting in something like 2.5MiB of allocation at runtime that can be problematic for programs that only optionally use the tokenizer; this PR changes the vocab codegen to replace the map literal in the var with a map literal assignment in a fooInit function that can then get called with a sync.Once to only generate the map on first load.

@ghost
Copy link

ghost commented May 9, 2024

Thanks for this. I'll look into this tomorrow.

@ghost
Copy link

ghost commented May 9, 2024

Ciao, the code looks good. Before I merge it could you make sure the generated code passes gofmt?

(I think the change would be on the generators internal/cmd/vocab.go)

For example this:

var p50kBaseVocabOnce sync.Once
var p50kBaseVocab vocab
func p50kBaseVocabInit() {
p50kBaseVocab = vocab{
    "aasdsd": 1,
     ...
}

becomes

var p50kBaseVocabOnce sync.Once
var p50kBaseVocab vocab

func p50kBaseVocabInit() {
	p50kBaseVocab = vocab{
		"!":  0,
		"\"": 1,
               ....,
	}
}

@espadolini
Copy link
Contributor Author

@mariano-sb I didn't do that initially because I wanted to avoid touching the actual lines with the words (and because it wasn't gofmt-correct to begin with), but if you view the changes in github with "hide whitespace" enabled it's still easy to review.

@bluescreen10 bluescreen10 merged commit 7f65fe7 into tiktoken-go:main May 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants