Align Tokenizer in JetStream #40

JoeZijunZhou · 2024-04-13T03:42:52Z

Return token ids instead of custom decode operation result in JetStream; let client do tokenizer decode operation
Update benchmark script and requester tool to use JetStream tokenizer seqio library for tokenizer decode operation
- Use correct method to let tokenizer decode the whole output token id list
Update unit tests for the above change
Enforce python type check to benchmark script
Update README unit test section

FanhaiLu1 · 2024-04-23T23:31:41Z

jetstream/core/proto/jetstream.proto

-  // List of responses, one per sample.
-  repeated string response = 1;
+  // List of responses, one per sample. The list size depends on text generation strategy the engine used.
+  repeated RepeatedTokenIds response = 1;


Please add your deleted field number to the reserved list, it may messes up deserialization. Here is the reference: https://protobuf.dev/programming-guides/dos-donts/#reserve-tag-numbers

Why not keep both and let the user choose whether she wants ids or string?

FanhaiLu1 · 2024-04-23T23:32:25Z

jetstream/core/proto/jetstream.proto

@@ -37,6 +37,10 @@ message DecodeRequest {
  int32 max_tokens = 4;
 }
 message DecodeResponse {
-  // List of responses, one per sample.
-  repeated string response = 1;


Can we still keep a str as option? The internal keep both text and token id.

I guess we don't want to decode it to str (or piece) in jetstream, since it would have some off in the final result.

Great! Thanks for making the changes!

* Align Tokenizer in JetStream * Update requirements with pytest dep * Remove mix_decode unit test

Align Tokenizer in JetStream

c97c4fd

JoeZijunZhou requested a review from vipannalla as a code owner April 13, 2024 03:42

Update requirements with pytest dep

8016fac

gangji approved these changes Apr 22, 2024

View reviewed changes

FanhaiLu1 self-requested a review April 22, 2024 18:04

JoeZijunZhou added 3 commits April 24, 2024 12:38

Merge branch 'main' into zijun/align-tokenizer

1e3ddc6

Remove mix_decode unit test

3985e9b

Merge branch 'main' into zijun/align-tokenizer

24f9d74

FanhaiLu1 reviewed Apr 24, 2024

View reviewed changes

FanhaiLu1 approved these changes Apr 24, 2024

View reviewed changes

FanhaiLu1 merged commit a0df320 into main Apr 24, 2024
3 checks passed

FanhaiLu1 deleted the zijun/align-tokenizer branch April 24, 2024 22:25

JoeZijunZhou mentioned this pull request Apr 24, 2024

Update maxtext user guide #56

Merged

bhavya01 mentioned this pull request Apr 25, 2024

Add an abstract class for Tokenizer #53

Merged

jwyang-google pushed a commit that referenced this pull request May 6, 2024

Align Tokenizer in JetStream (#40)

0dbb2a5

* Align Tokenizer in JetStream * Update requirements with pytest dep * Remove mix_decode unit test

FanhaiLu1 mentioned this pull request May 8, 2024

Update JetStream grpc proto to support I/O with text and token ids #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Tokenizer in JetStream #40

Align Tokenizer in JetStream #40

JoeZijunZhou commented Apr 13, 2024

FanhaiLu1 Apr 23, 2024

qihqi Apr 25, 2024

FanhaiLu1 Apr 23, 2024

JoeZijunZhou Apr 24, 2024

FanhaiLu1 May 14, 2024

Align Tokenizer in JetStream #40

Align Tokenizer in JetStream #40

Conversation

JoeZijunZhou commented Apr 13, 2024

FanhaiLu1 Apr 23, 2024

Choose a reason for hiding this comment

qihqi Apr 25, 2024

Choose a reason for hiding this comment

FanhaiLu1 Apr 23, 2024

Choose a reason for hiding this comment

JoeZijunZhou Apr 24, 2024

Choose a reason for hiding this comment

FanhaiLu1 May 14, 2024

Choose a reason for hiding this comment