Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory #110

Merged
merged 102 commits into from
Jul 30, 2024

Conversation

LeiWang1999
Copy link
Contributor

@LeiWang1999 LeiWang1999 commented Jul 30, 2024

As we delved deeper into the contiguous batching optimizations for mixed precision GEMM, a crucial insight emerged: enabling dequantization at the warp tile level can conserve memory bandwidth, though it'll introduce few cost of computational overhead. To facilitate this, we must improve the lower warp memory pass, as TVM struggles to manage warp memory with decode intrinsics.

This pull request implement this optimizations, and we can now codegen mixed precison gemm with warp level dequantization. There're still some TODO Items should be resolved in future developments to officially integrate this optimizations.

##TODO

  • Introduce Transform Propagation Level 3, which can also enable weight propagation to eliminate the instruction ldmatrix.
  • Checkout the Correctness of Weight Propagation Stage 3.
  • The design of LOP3 Tensor Intrins should be optimized, as now we not only support local scope but also warp score, moreover, the buffer slot impl should be converted into point with dynamic offsets instead of Var.

LeiWang1999 and others added 28 commits July 23, 2024 09:23
@LeiWang1999 LeiWang1999 merged commit fd07a82 into microsoft:main Jul 30, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant