Minor SDPA optimizations #16566

cglagovichTT · 2025-01-09T18:46:14Z

Ticket

Subtask of #16557

Problem description

SDPA has quite a few unnecessary operations which make it inefficient, especially as sequence length grows.

What's changed

Remove all block copies by efficiently ping-ponging buffers with aliases and std::swap
Use L1 accumulation to update intermediate output to avoid an extra unpack/add/pack
Reuse DST in mul_block_bcast_cols_accumulate
Fix bug with DST reuse and dhead=96
Don't allocate CB for mask if a mask isn't used. This reduces memory waste and enables larger chunk sizes

For the following test case, we get a nice 1.084x speedup.
tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention.py::test_sdpa_tt_large_seq[1-8-1-131072-128-k128-q128-bf16]

ID  Total %  Bound  OP Code                    Device Time   Op-to-Op Gap  Cores  DRAM  DRAM %  FLOPs  FLOPs %  Math Fidelity
----------------------------------------------------------------------------------------------------------------------------------
2  100.0 %         OLD ScaledDotProductAttention  1,949,831 us                   64                                BF16, BF16 => BF16
2  100.0 %         NEW ScaledDotProductAttention  1,798,710 us                   64                                BF16, BF16 => BF16

Checklist

Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12831544820
Model regression CI testing passes https://github.com/tenstorrent/tt-metal/actions/runs/12831557373

caixunshiren

Overall looks good!

caixunshiren · 2025-01-11T01:09:21Z

ttnn/cpp/ttnn/operations/transformer/sdpa/device/sdpa_program_factory.cpp

+        dht_granularity = 1;
+        log2_dht_granularity = 0;
+    }
+    TT_FATAL(dht_granularity == (1 << log2_dht_granularity), "Error");


Maybe better error messaging?

…lock by aliasing mm2 output cb.

…mul_block_bcast_cols writes direclty to cb_out

(cherry picked from commit 705f064)

(cherry picked from commit 94b6e96)

(cherry picked from commit 00486eb)

cglagovichTT · 2025-01-17T14:45:38Z

I found that one of the optimizations in this branch, using mul_block_bcast_cols to write directly to cb_out, leads to inexplicable PCC issues in Llama tests. I was able to reproduce this in a chunked prefill unit test, but it's unclear why this optimization leads to different outputs from before.

caixunshiren

LGTM

cglagovichTT marked this pull request as ready for review January 9, 2025 18:49

cglagovichTT requested review from ayerofieiev-tt, dmakoviichuk-tt, cfjchu and TT-BrianLiu as code owners January 9, 2025 18:49

cglagovichTT requested a review from caixunshiren January 9, 2025 18:49

cglagovichTT mentioned this pull request Jan 9, 2025

Efficient Joint Attention op #16557

Open

3 tasks

caixunshiren approved these changes Jan 11, 2025

View reviewed changes

cglagovichTT added 6 commits January 16, 2025 19:43

#0: Don't allocate CB for mask if it won't be used

47359a6

#0: remove unnecessary copy_block on sum

3dce46e

#0: v1 Get DST reuse in mul_block_bcast_cols_inplace. Remove a copy_b…

d30d9a8

…lock by aliasing mm2 output cb.

#0: ping pong buffer cur_sum and prev_sum

0255f50

#0: Remove special case for k_chunk 0, move swap after branch.

47b7fc3

#0: ping pong mm2 out as well

9d665aa

cglagovichTT force-pushed the cglagovich/sdpa_opt branch from c293e7c to 00486eb Compare January 16, 2025 20:01

cglagovichTT requested a review from a team as a code owner January 16, 2025 20:01

cglagovichTT added 4 commits January 17, 2025 06:40

#0: Remove copy for prev max. Keep copy for cb_out due to PCC bug if …

0e4ad8e

…mul_block_bcast_cols writes direclty to cb_out

#0: Reorder a few lines.

4aae4ae

(cherry picked from commit 705f064)

#0: Fix reconfig DF, fix granularity calc when not power of 2

f144ccf

(cherry picked from commit 94b6e96)

#0: enhance error messages

ece57f8

(cherry picked from commit 00486eb)

cglagovichTT force-pushed the cglagovich/sdpa_opt branch from 00486eb to ece57f8 Compare January 17, 2025 15:01

cglagovichTT requested a review from caixunshiren January 17, 2025 15:03

caixunshiren approved these changes Jan 17, 2025

View reviewed changes

cglagovichTT merged commit 9a3766d into main Jan 17, 2025
219 of 223 checks passed

cglagovichTT deleted the cglagovich/sdpa_opt branch January 17, 2025 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor SDPA optimizations #16566

Minor SDPA optimizations #16566

cglagovichTT commented Jan 9, 2025 •

edited

Loading

caixunshiren left a comment

caixunshiren Jan 11, 2025

cglagovichTT commented Jan 17, 2025

caixunshiren left a comment

Minor SDPA optimizations #16566

Minor SDPA optimizations #16566

Conversation

cglagovichTT commented Jan 9, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

caixunshiren left a comment

Choose a reason for hiding this comment

caixunshiren Jan 11, 2025

Choose a reason for hiding this comment

cglagovichTT commented Jan 17, 2025

caixunshiren left a comment

Choose a reason for hiding this comment

cglagovichTT commented Jan 9, 2025 •

edited

Loading