Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align Tokenizer in JetStream #40

Merged
merged 5 commits into from
Apr 24, 2024
Merged

Align Tokenizer in JetStream #40

merged 5 commits into from
Apr 24, 2024

Conversation

JoeZijunZhou
Copy link
Collaborator

  • Return token ids instead of custom decode operation result in JetStream; let client do tokenizer decode operation
  • Update benchmark script and requester tool to use JetStream tokenizer seqio library for tokenizer decode operation
    • Use correct method to let tokenizer decode the whole output token id list
  • Update unit tests for the above change
  • Enforce python type check to benchmark script
  • Update README unit test section

@FanhaiLu1 FanhaiLu1 self-requested a review April 22, 2024 18:04
// List of responses, one per sample.
repeated string response = 1;
// List of responses, one per sample. The list size depends on text generation strategy the engine used.
repeated RepeatedTokenIds response = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add your deleted field number to the reserved list, it may messes up deserialization. Here is the reference: https://protobuf.dev/programming-guides/dos-donts/#reserve-tag-numbers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep both and let the user choose whether she wants ids or string?

@@ -37,6 +37,10 @@ message DecodeRequest {
int32 max_tokens = 4;
}
message DecodeResponse {
// List of responses, one per sample.
repeated string response = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still keep a str as option? The internal keep both text and token id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't want to decode it to str (or piece) in jetstream, since it would have some off in the final result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for making the changes!

@FanhaiLu1 FanhaiLu1 merged commit a0df320 into main Apr 24, 2024
3 checks passed
@FanhaiLu1 FanhaiLu1 deleted the zijun/align-tokenizer branch April 24, 2024 22:25
jwyang-google pushed a commit that referenced this pull request May 6, 2024
* Align Tokenizer in JetStream

* Update requirements with pytest dep

* Remove mix_decode unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants