[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… #1845

HaiShaw · 2024-10-30T23:44:12Z

… speedup on ROCm

Motivation

Speedup _decode_grouped_softmax_reducev_fwd.
Test shows ~1.0% improvement to median decode throughput on MI300x with Grok-1 and FP8 (b32/i1024/o256)

Modifications

Setting optimal kernel arguments to _fwd_grouped_kernel_stage2 on ROCm.

Checklist

[+] Format your code according to the Contributor Guide.
[+] Add unit tests as outlined in the Contributor Guide.
[+] Update documentation as needed, including docstrings or example tutorials.

… speedup on ROCm

[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd…

8f7cf8c

… speedup on ROCm

HaiShaw requested review from merrymercy, Ying1123, zhyncs and ispobock as code owners October 30, 2024 23:44

fix lint/isort

592b67f

merrymercy merged commit 2d4ce1b into sgl-project:main Oct 31, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… #1845

[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… #1845

HaiShaw commented Oct 30, 2024

[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… #1845

[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… #1845

Conversation

HaiShaw commented Oct 30, 2024

Motivation

Modifications

Checklist