Detokenize incrementally when streaming #653

hnyls2002 · 2024-07-19T00:23:03Z

The detokenizer_manager achieves very low performance when 1) streaming is enabled, 2) model forward is very swift.

Changed the decoding process into incremental detokenization in detokenizer_manager, and it is not compatible with mutable output_ids (when jump forward).

zhyncs · 2024-07-19T05:43:29Z

@hnyls2002 Nice work! cc @merrymercy @Ying1123 @yzh119

model: Llama 3 8B Instruct
hardware: A100 80G

# server
python3 -m sglang.launch_server --model /root/Meta-Llama-3-8B-Instruct --trust-remote-code --port 23333 --disable-radix-cach --stream-interval 1

# client
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 23333 --dataset /root/ShareGPT_V3_unfiltered_cleaned_split.json --model /root/Meta-Llama-3-8B-Instruct --tokenizer /root/Meta-Llama-3-8B-Instruct --num-prompts 1000 --request-rate 128 --trust-remote-code

hnyls2002 added 4 commits July 18, 2024 22:54

reduce transfer data size

5f083ba

detokenize incrementally when stream

3b22297

rename

91bae31

fix

e79faab

hnyls2002 merged commit a9ef49c into main Jul 19, 2024
2 checks passed

hnyls2002 deleted the fix-streaming-detokenizer branch July 19, 2024 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detokenize incrementally when streaming #653

Detokenize incrementally when streaming #653

hnyls2002 commented Jul 19, 2024 •

edited

Loading

zhyncs commented Jul 19, 2024 •

edited

Loading

Detokenize incrementally when streaming #653

Detokenize incrementally when streaming #653

Conversation

hnyls2002 commented Jul 19, 2024 • edited Loading

zhyncs commented Jul 19, 2024 • edited Loading

hnyls2002 commented Jul 19, 2024 •

edited

Loading

zhyncs commented Jul 19, 2024 •

edited

Loading