[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory #110

LeiWang1999 · 2024-07-30T10:40:38Z

As we delved deeper into the contiguous batching optimizations for mixed precision GEMM, a crucial insight emerged: enabling dequantization at the warp tile level can conserve memory bandwidth, though it'll introduce few cost of computational overhead. To facilitate this, we must improve the lower warp memory pass, as TVM struggles to manage warp memory with decode intrinsics.

This pull request implement this optimizations, and we can now codegen mixed precison gemm with warp level dequantization. There're still some TODO Items should be resolved in future developments to officially integrate this optimizations.

##TODO

Introduce Transform Propagation Level 3, which can also enable weight propagation to eliminate the instruction ldmatrix.
Checkout the Correctness of Weight Propagation Stage 3.
The design of LOP3 Tensor Intrins should be optimized, as now we not only support local scope but also warp score, moreover, the buffer slot impl should be converted into point with dynamic offsets instead of Var.

…ability and maintainability

…ainability

…tainability

…ainability

… run

…cefully

… run

LeiWang1999 added 30 commits July 5, 2024 08:54

Refactor BatchMatMulEmitter and BatchMatMulSelector for improved read…

d8884e6

…ability and maintainability

Refactor import statements for improved readability and maintainability

fc84173

Refactor import statements for improved readability and maintainability

02f64de

disable failure email for ci

397eee6

remove email notifications.

20f6ad1

move relax pass from testing to mlc_llm

b93c394

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

ba6a6df

Refactor scripts with se check_eual_ref_scripts_with_emitter function

257693a

Lint Fix

9bb7f49

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

39e7614

Refactor scripts with se check_eual_ref_scripts_with_emitter function

93eb5a5

bug fix in test

aa66a90

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

ae14a53

lint fix.

79b08e4

test cuda i4 kernel

86fd036

Refactor copyright notice in i4matmul.hpp

6b73a21

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

0ba90c1

Refactor BitBLASLinear test module for improved readability and maint…

086d208

…ainability

refactor test as version below python 3.9 cannot handle int32 overflow.

47a3abd

format lint for test

024b247

Refactor test_int4b_fp16_convert.py for improved readability and main…

bfedeaa

…tainability

remove unused design file

e672a23

move tile device from package to base

21e5430

dummy impl for codegen

fd11940

Refactor file structure for ladder_permutate module

9ccfa85

Refactor backend class and fix typos in comments

7c7d73e

Deep refactor Lib related code.

47d5fc5

remove ci pull.

53dd0dd

LintFix

d58ac43

refactor builder for whl build

37cb07c

LeiWang1999 and others added 28 commits July 23, 2024 09:23

Refactor BitBLASMatmulOpsBenchmark for improved readability and maint…

54d2227

…ainability

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

c2edefb

… run

lint fix

e0bc723

Benchmark bot test

a4e68d1

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

df7e9aa

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

1c03365

… run

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

4f319fc

… run

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

a8833d4

… run

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

803f6c6

… run

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

df4572b

… run

int8 test case

45ded45

Refactor compare_benchmark.py to handle missing benchmark results gra…

4229676

…cefully

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

b883290

ci fix

476ffee

disable ci for test benchmark

9bd34ff

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

e86f4b2

Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark…

75f3dd9

… run

remove cli installation

79e04aa

chore: Create virtual environment and install dependencies for benchmark

cdd3345

Merge branch 'main' into dev

f099938

chore: Update benchmark workflow to include comparison step

f211ad4

Lint fix

ddde02a

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

8045ce9

Merge branch 'dev' of https://github.com/LeiWang1999/MSBitBLAS into dev

21aee89

upodate tvm cmmit

ef1b158

Imporve lower warp memory pass

a8d8841

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

686b929

Bug fix

7736c38

LeiWang1999 merged commit fd07a82 into microsoft:main Jul 30, 2024
5 checks passed

LeiWang1999 mentioned this pull request Aug 5, 2024

[Dev] Refactor the weight transformation to support upcoming stage3 transform #130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory #110

[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory #110

LeiWang1999 commented Jul 30, 2024 •

edited

Loading

[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory #110

[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory #110

Conversation

LeiWang1999 commented Jul 30, 2024 • edited Loading

LeiWang1999 commented Jul 30, 2024 •

edited

Loading