[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding #124

LeiWang1999 · 2024-08-04T07:18:09Z

This pull request includes several changes to improve the handling of decoding and memory prefetching in the GPU intrinsics and matmul dequantization logic. The key changes involve adding support for offset handling in decoding functions, updating the pass context for TVM transformations, and enhancing the scheduling logic for shared memory prefetching.

Decoding Enhancements:

Added new decoding functions decode_i4_to_f16_scale_offset, decode_i4s_to_f16_scale_offset, and decode_i4u_to_f16_scale_offset to handle offset during decoding (bitblas/gpu/intrin/lop3.py).
Updated get_fast_decode_intrin to append _offset to function names when storage_scope is "warp" and scaling is enabled (bitblas/gpu/intrin/lop3.py).

TVM Transformation Context:

Modified the tvm_callback_cuda_postproc function to include "tir.disable_cse_tir": True in the TVM pass context configuration (bitblas/base/utils.py).

Scheduling and Prefetching:

Enhanced shared memory prefetching logic to handle different reduction depths and weight transform kinds in sch_shared_memory_prefetch_with_config (bitblas/gpu/matmul_mma_dequantize.py).
Updated the get_param_indices and fetch_to_shared functions to support reduction across threads (bitblas/gpu/matmul_mma_dequantize.py). [1] [2]

Intrinsic Definitions:

Added new intrinsic definitions for warp scope in intrin_definitions to support various configurations (bitblas/gpu/intrin/lop3.py).

These changes collectively enhance the functionality and performance of the GPU intrinsics and matmul dequantization processes.

…ability and maintainability

…ainability

…tainability

LeiWang1999 · 2024-08-04T07:18:45Z

Also fixed codeql warning ref to #121

LeiWang1999 · 2024-08-04T11:31:27Z

Add GPTQ Repack Test to checkout the integration correctness.

LeiWang1999 added 30 commits July 5, 2024 08:54

Refactor BatchMatMulEmitter and BatchMatMulSelector for improved read…

d8884e6

…ability and maintainability

Refactor import statements for improved readability and maintainability

fc84173

Refactor import statements for improved readability and maintainability

02f64de

disable failure email for ci

397eee6

remove email notifications.

20f6ad1

move relax pass from testing to mlc_llm

b93c394

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

ba6a6df

Refactor scripts with se check_eual_ref_scripts_with_emitter function

257693a

Lint Fix

9bb7f49

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

39e7614

Refactor scripts with se check_eual_ref_scripts_with_emitter function

93eb5a5

bug fix in test

aa66a90

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

ae14a53

lint fix.

79b08e4

test cuda i4 kernel

86fd036

Refactor copyright notice in i4matmul.hpp

6b73a21

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

0ba90c1

Refactor BitBLASLinear test module for improved readability and maint…

086d208

…ainability

refactor test as version below python 3.9 cannot handle int32 overflow.

47a3abd

format lint for test

024b247

Refactor test_int4b_fp16_convert.py for improved readability and main…

bfedeaa

…tainability

remove unused design file

e672a23

move tile device from package to base

21e5430

dummy impl for codegen

fd11940

Refactor file structure for ladder_permutate module

9ccfa85

Refactor backend class and fix typos in comments

7c7d73e

Deep refactor Lib related code.

47d5fc5

remove ci pull.

53dd0dd

LintFix

d58ac43

refactor builder for whl build

37cb07c

LeiWang1999 added 20 commits July 31, 2024 11:23

add test for stage3 propagate

802abde

implement propagate func

d339037

Stage3 Ladder Permutate integration

0f6a033

get_ladder_stage3_propagate

00ec916

comments benchmark scirpts as the setting is too big

5316577

ci fix for benchmark

dd070f9

lint fix

6fcc368

chore: Update benchmark workflow to trigger on pull request comments

705580b

Add LDMatrix Transform 3

c5ba940

Support GPTQ Test

1566990

Fuse BlockReduce Schedule

c6c70ef

Support mma propagate 3

36128f3

Support MMA Propagate Stage 3

23ff5f4

Lint Fix

de3bf08

Merge block reduce for dequantze config.

d9830ba

fix codeql

e5a4485

chore: Update submodule reference to latest commit

a04282b

chore: Disable common subexpression elimination in TIR passes

314d3e9

Lint Fix

f7d33bb

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into dev

db633ed

LeiWang1999 added 3 commits August 4, 2024 11:03

4bit related lop3 updates.

201155a

lint fix

2b73662

gptq test fix

1a6a0fd

LeiWang1999 added 3 commits August 4, 2024 16:15

Fix for test

e84e3ef

lint fix

f0fbb55

lint fix

bf30688

LeiWang1999 merged commit 164d1ab into microsoft:main Aug 4, 2024
6 checks passed

LeiWang1999 mentioned this pull request Aug 4, 2024

Fix code scanning alert - Alert Suppression Report #121

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding #124

[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding #124

LeiWang1999 commented Aug 4, 2024

LeiWang1999 commented Aug 4, 2024

LeiWang1999 commented Aug 4, 2024

[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding #124

[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding #124

Conversation

LeiWang1999 commented Aug 4, 2024

Decoding Enhancements:

TVM Transformation Context:

Scheduling and Prefetching:

Intrinsic Definitions:

LeiWang1999 commented Aug 4, 2024

LeiWang1999 commented Aug 4, 2024