[PTX] `ldmatrix` builtin to accelerate copying data from shared memory to warp memory #10855

yzh119 · 2022-04-01T02:18:00Z

We already have PTX mma and mma.sp builtin support in #9909 and #10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the ldmatrix builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

@vinx13 @Hzfengsy

yzh119

It turns out that SplitHostDevice would split device and host function via the position of launch_thread, and launch_thread was always placed under block allocated buffers, thus the allocated buffer would not be recognized as device buffer.

I created a boundary block as a workaround.

tests/python/unittest/test_tir_ptx_ldmatrix.py

…y to warp memory (apache#10855) We already have PTX mma and mma.sp builtin support in apache#9909 and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

yzh119 added 2 commits March 31, 2022 19:11

init commit

d424c05

add requires_cuda flag

47a3bb9

vinx13 approved these changes Apr 1, 2022

View reviewed changes

Hzfengsy approved these changes Apr 2, 2022

View reviewed changes

fix test

c4b21b8

yzh119 commented Apr 2, 2022

View reviewed changes

tests/python/unittest/test_tir_ptx_ldmatrix.py Outdated Show resolved Hide resolved

junrushao merged commit 966d018 into apache:main Apr 3, 2022

masahi mentioned this pull request May 18, 2022

[TIR] Support tensorization using ldmatrix + MMA #11355

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PTX] `ldmatrix` builtin to accelerate copying data from shared memory to warp memory #10855

[PTX] `ldmatrix` builtin to accelerate copying data from shared memory to warp memory #10855

yzh119 commented Apr 1, 2022

yzh119 left a comment

[PTX] ldmatrix builtin to accelerate copying data from shared memory to warp memory #10855

[PTX] ldmatrix builtin to accelerate copying data from shared memory to warp memory #10855

Conversation

yzh119 commented Apr 1, 2022

yzh119 left a comment

Choose a reason for hiding this comment

[PTX] `ldmatrix` builtin to accelerate copying data from shared memory to warp memory #10855

[PTX] `ldmatrix` builtin to accelerate copying data from shared memory to warp memory #10855