Eliminate 2 gpu ops during sampling when logit_bias is zero #338

Qubitium · 2024-03-28T10:31:19Z

Reason for PR: Eliminate 2 gpu ops during sampling when logit_bias is not used (all zeros)

skip allocation of logit_bias
skip applying logits.add_(self.logit_bias)

On our internal test of Yi-6B quantized using marlin on 4090 we see tangible improvement: up to 63% throughput improvement for small sized tokens output work loads when batch > 1.

Concurrency is 10 requests so it will cause sglang to use batch size between 1-10:

python/sglang/srt/managers/router/infer_batch.py

hnyls2002 · 2024-04-03T05:55:03Z

@Qubitium Sorry to cause trouble. I have directly pushed the commit on your branch.
closes #344

remove allocation and usage of logic_bias when unused (all zeros)

b88f14a

Qubitium changed the title ~~Eliminate 2 gpu ops during inference (sampling) when logit_bias is zero~~ Eliminate 2 gpu ops during sampling when logit_bias is zero Mar 28, 2024

hnyls2002 reviewed Mar 29, 2024

View reviewed changes

python/sglang/srt/managers/router/infer_batch.py Outdated Show resolved Hide resolved

Qubitium and others added 2 commits March 29, 2024 07:19

fix merge condition for 1 of 2 val is not none

e27a7c1

fix

6469a68

hnyls2002 mentioned this pull request Apr 2, 2024

Eliminate 2 gpu ops during sampling when logit_bias is zero #343

Merged

hnyls2002 closed this Apr 2, 2024

Qubitium mentioned this pull request Apr 2, 2024

PR review process standard #344

Closed

hnyls2002 reopened this Apr 2, 2024

hnyls2002 merged commit c9de3e1 into sgl-project:main Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate 2 gpu ops during sampling when logit_bias is zero #338

Eliminate 2 gpu ops during sampling when logit_bias is zero #338

Qubitium commented Mar 28, 2024 •

edited

Loading

hnyls2002 commented Apr 3, 2024 •

edited

Loading

Eliminate 2 gpu ops during sampling when logit_bias is zero #338

Eliminate 2 gpu ops during sampling when logit_bias is zero #338

Conversation

Qubitium commented Mar 28, 2024 • edited Loading

hnyls2002 commented Apr 3, 2024 • edited Loading

Qubitium commented Mar 28, 2024 •

edited

Loading

hnyls2002 commented Apr 3, 2024 •

edited

Loading