benchmark script for simple_gla vs mamba2 kernel #50

learning-chip · 2024-08-18T19:24:08Z

Follow-up #49

Amazingly, it seems like chunk_simple_gla is much faster than mamba_chunk_scan_combined:

$ python ./benchmark_simple_gla_vs_mamba2.py

Performance:
         T  chunk_simple_gla  mamba2_ssd
0     64.0          0.084992    0.840208
1    128.0          0.100352    0.847920
2    256.0          0.100368    0.848896
3    512.0          0.174080    0.873472
4   1024.0          0.399360    0.880208
5   2048.0          0.776352    1.596416
6   4096.0          1.526784    3.160064
7   8192.0          3.067904    6.251520
8  16384.0          6.220800   12.452864

I left many TODO and NOTE in the benchmark scripts, including:

Testing more input shapes
Tuning block size
analyze impact of input memory layout

More importantly:

more detailed profiling to understand why exactly it is faster.

Maybe mamba-2 kernel incurs more memory IO (less "fused")? And why the short-sequence performance (T<256) differs by so much?

yzhangcs · 2024-08-18T19:26:49Z

@learning-chip Great job! Appreciate your quick actions.

sustcsonglin · 2024-08-18T22:35:21Z

@learning-chip Mamba2’s official kernel involves three main steps: 1) computation of each chunk’s last hidden state, 2) recurrence at the chunk level, and 3) output computation.

For steps 1) and 2), it stores/loads the hidden state in FP32, which incurs significant I/O costs.

FLA’s implementation fuses steps 1) and 2), avoids materializing the FP32 hidden state after step 1) and stores only the BF16 hidden state after 2), thus reducing I/O costs.

benchmark script for simple_gla vs mamba2 kernel

f57a027

learning-chip mentioned this pull request Aug 18, 2024

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Merged

4 tasks

yzhangcs marked this pull request as ready for review August 18, 2024 19:25

yzhangcs merged commit c60ada3 into fla-org:main Aug 18, 2024
1 check passed

yzhangcs mentioned this pull request Aug 18, 2024

Add implementations of Mamba 2 into FLA #34

Closed

learning-chip mentioned this pull request Aug 30, 2024

llama : initial Mamba-2 support ggerganov/llama.cpp#9126

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark script for simple_gla vs mamba2 kernel #50

benchmark script for simple_gla vs mamba2 kernel #50

learning-chip commented Aug 18, 2024 •

edited

Loading

yzhangcs commented Aug 18, 2024

sustcsonglin commented Aug 18, 2024

benchmark script for simple_gla vs mamba2 kernel #50

benchmark script for simple_gla vs mamba2 kernel #50

Conversation

learning-chip commented Aug 18, 2024 • edited Loading

yzhangcs commented Aug 18, 2024

sustcsonglin commented Aug 18, 2024

learning-chip commented Aug 18, 2024 •

edited

Loading