Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PTX] ldmatrix builtin to accelerate copying data from shared memory to warp memory #10855

Merged
merged 3 commits into from
Apr 3, 2022

Conversation

yzh119
Copy link
Member

@yzh119 yzh119 commented Apr 1, 2022

We already have PTX mma and mma.sp builtin support in #9909 and #10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the ldmatrix builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

@vinx13 @Hzfengsy

Copy link
Member Author

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that SplitHostDevice would split device and host function via the position of launch_thread, and launch_thread was always placed under block allocated buffers, thus the allocated buffer would not be recognized as device buffer.

I created a boundary block as a workaround.

tests/python/unittest/test_tir_ptx_ldmatrix.py Outdated Show resolved Hide resolved
@junrushao junrushao merged commit 966d018 into apache:main Apr 3, 2022
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
…y to warp memory (apache#10855)

We already have PTX mma and mma.sp builtin support in apache#9909  and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 11, 2022
…y to warp memory (apache#10855)

We already have PTX mma and mma.sp builtin support in apache#9909  and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants