Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

Closed
wants to merge 28 commits into from

Conversation

ggengnv
Copy link

@ggengnv ggengnv commented Sep 19, 2024

UPDATE - split into two PRs per @vwbaker's suggestion: #18, #19. Comments will be addressed in these new PRs instead.

Notes

  • Locally, benchmarked on H100 HBM3 700W for a sample workload i8 -> bf16 on two shapes. Seeing 15-20% perf gain.
  • Prefetching into RF will be put into another PR
  • Coalesce issue is fixed in the latest commits
  • I largely reuse existing Ampere logic for SMEM to RF copy with minor modifications for Hopper. More changes might be needed if SMEM bank conflicts becomes an issue or if we can't use this for TMA

Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA.

This PR adds an optimization for the case above to use RS GEMM, with the benefit of SMEM pipelining, i.e. copy from GMEM to SMEM, and then from SMEM to RF to finally perform RS WGMMA. The copying from GMEM to SMEM is necessary for coalesced access and allows for pipelining with LDGSTS (and potentially TMA in the future).

@gflegar gflegar marked this pull request as draft September 20, 2024 17:52
@ggengnv ggengnv changed the title [DRAFT] Optimization to put LHS operand in registers for WGMMA before elementwise ops Optimization to put LHS operand in registers for WGMMA before elementwise ops Sep 20, 2024
@ggengnv ggengnv marked this pull request as ready for review September 20, 2024 21:09
@ggengnv
Copy link
Author

ggengnv commented Sep 20, 2024

Fixed global coalesce issue and the PR should be ready for full review. Prefetch will be added in another PR seeing we're already seeing perf gains with the current changes

lib/Dialect/TritonGPU/IR/Dialect.cpp Show resolved Hide resolved
lib/Dialect/TritonGPU/IR/Dialect.cpp Show resolved Hide resolved
// Can only hoist operand 0
auto alloc = dotOp.getOperand(0).getDefiningOp<LocalAllocOp>();
if (!alloc || !alloc.getSrc())
return failure();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer rewriter.notifyMatchFailure().

lib/Dialect/TritonGPU/IR/Dialect.cpp Show resolved Hide resolved
}
for (int k = 0; k < n1; ++k)
if (isHopper) {
// WGMMA.cpp expects different (m-major) ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply naming a file here is misleading as you could have just changed the expectation in that file. Why does it expect a different ordering? is this about the warp order being different on hopper than on ampere?

Copy link
Author

@ggengnv ggengnv Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik m-major is the hardware-expected ordering for both Ampere and Hopper; the reason for the difference here is probably just code convention.

Ampere uses k-major ordering here, but in fact, there is logic when lowering MMAv2 to LLVM to remap the ordering from k-major to m-major.

WGMMA.cpp has no analogous remapping logic. Before this PR, WGMMA was able to accept operand A as values in registers (i.e. LLVM struct) as produced here. This code lowers convert_layout from MMA (accumulator) encoding to DotOp encoding, which was for chained MMA's.

Therefore, if I was to change the expectation in WGMMA.cpp, I would also have to change the expectation for chained MMAs above. This modification, I expect, is probably a lot more effort compared to what I'm doing here, so I've opted to leave the current convention intact and just conditionally set the ordering here instead.

I can change the comment here to be clearer though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Just a comment is helpful and looks good now, thanks :)

@vwbaker
Copy link
Member

vwbaker commented Sep 23, 2024

As a general comment, I think it would be nicer to review this & for openai to review it if this was separated into multiple PRs. You don't need to wait to have a performance improvement in every PR. It's enough to have a chain of easily-readable PRs that give a performance improvement in the end. This PR as a whole is doing a lot & could have at least separated the enablement of registers in mmav3 & pipelining improvements for more performance gains

@ggengnv
Copy link
Author

ggengnv commented Sep 23, 2024

As a general comment, I think it would be nicer to review this & for openai to review it if this was separated into multiple PRs. You don't need to wait to have a performance improvement in every PR. It's enough to have a chain of easily-readable PRs that give a performance improvement in the end. This PR as a whole is doing a lot & could have at least separated the enablement of registers in mmav3 & pipelining improvements for more performance gains

I've split this into two PRs as you suggested :) - #18 and #19. I will be addressing the above comments in these new PRs from now on.

@ggengnv
Copy link
Author

ggengnv commented Sep 23, 2024

Closing - split into #18 and #19

@ggengnv
Copy link
Author

ggengnv commented Sep 23, 2024

Addressed all above comments in #18 and #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants