[Dev] Potentially improve performance through block reduction #63

LeiWang1999 · 2024-06-30T11:23:46Z

In our recent evaluations, we observed that batch inference on matrix multiplication shapes for 7b/13b models didn't achieve the expected theoretical speedup. This performance bottleneck appears to be linked to the absence of block reduction support within our schedule template and tvm.

This pull request includes a variety of changes across multiple files, mainly focusing on introducing new features related to block reduction.

Here are the most important changes:
* python/bitblas/base/roller/policy/tensorcore.py: The block_reduction_depth field was added to the Policy class, and this value is now considered in several functions, including _check_small_tile, _enlarge, and _score. [1] [2] [3] [4] [5]
* python/bitblas/gpu/matmul_analysis.py: The check_last_trait function was updated to set the block_reduction_depth to 2 for small M values.
* python/bitblas/gpu/matmul_mma.py: The apply_config function was updated to call a new function apply_block_reduction_with_config if block_reduction_depth is not None.
* 3rdparty/tvm: The subproject commit was updated.

…splitk

…into lei/splitk

…lWithSplitK

…splitk

…lei/splitk

…into lei/splitk

LeiWang1999 · 2024-06-30T11:37:50Z

The BitBLAS performance on small shapes has not yet met our expectations, indicating that further investigation is necessary, so the block reduce related items was disabled when we enable auto tuning, cc @tzj-fxz if you have time.
Code to reproduce the performance:

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

from bitblas.utils.target_detector import auto_detect_nvidia_target
from bitblas import Matmul, MatmulConfig
import argparse
import bitblas
import tvm
from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.roller.arch import CUDA
from bitblas.gpu.matmul_analysis import get_tensorized_func_and_tags
from bitblas.base.utils import apply_and_build
import time
from tvm import te, tir

bitblas.set_log_level("DEBUG")
# Initialize the parser  
parser = argparse.ArgumentParser(  
    description="Benchmark BitBLAS int4 on a specific target."  
)  
  
# Add arguments to the parser  
parser.add_argument(  
    "--target",  
    type=str,  
    default=auto_detect_nvidia_target(),  
    help="Specify the target device for benchmarking."  
)  
parser.add_argument(  
    "--group_size",  
    type=int,  
    default=None,  
    help="Group size for grouped quantization."  
)  
parser.add_argument(  
    "--A_dtype",  
    type=str,  
    default="float16",  
    choices=["float16", "float32", "float64", "int32", "int8"],  # Assuming these are the valid choices  
    help="Data type of activation A."  
)  
parser.add_argument(  
    "--W_dtype",  
    type=str,  
    default="uint4",  
    help="Data type of weight W."  
)  
parser.add_argument(  
    "--accum_dtype",  
    type=str,  
    default="float16",  
    help="Data type for accumulation."  
)  
parser.add_argument(  
    "--out_dtype",  
    type=str,  
    default="float16",  
    choices=["float16", "float32", "int32", "int8"],  # Assuming these are the valid choices  
    help="Data type for output."  
)  
parser.add_argument(  
    "--layout",  
    type=str,  
    default="nt",  
    choices=["nt", "nn"],  # Assuming these are the valid choices  
    help="Matrix layout, 'nt' for non-transpose A and transpose W."  
)  
parser.add_argument(  
    "--with_bias",  
    action="store_true",  
    help="Include bias in the benchmark."  
)  
parser.add_argument(  
    "--with_scaling",  
    action="store_true",  
    help="Include scaling factor in the quantization."  
)  
parser.add_argument(  
    "--with_zeros",  
    action="store_true",  
    help="Include zeros in the quantization."  
)  
parser.add_argument(  
    "--zeros_mode",  
    type=str,  
    default=None,  
    choices=["original", "rescale", "quantized"],  # Replace with actual modes if applicable  
    help="Specify the mode for calculating zeros."  
)
parser.add_argument(  
    "--propagate_a",  
    type=str,  
    default=True,  
    choices=["original", "rescale", "quantized"],  # Replace with actual modes if applicable  
    help="Specify the mode for calculating zeros."  
)
parser.add_argument(  
    "--propagate_b",  
    type=str,  
    default=True,  
    choices=["original", "rescale", "quantized"],  # Replace with actual modes if applicable  
    help="Specify the mode for calculating zeros."  
)  
  
# Parse the arguments  
args = parser.parse_args()  
  
# Assign arguments to variables  
target = args.target  
group_size = args.group_size  
A_dtype = args.A_dtype  
W_dtype = args.W_dtype  
accum_dtype = args.accum_dtype  
out_dtype = args.out_dtype  
layout = args.layout  
with_bias = args.with_bias  
group_size = args.group_size  
with_scaling = args.with_scaling  
with_zeros = args.with_zeros  
zeros_mode = args.zeros_mode 
propagate_a = args.propagate_a
propagate_b = args.propagate_b

test_shapes = [
    (MatmulConfig, Matmul, (16, 16384, 16384, A_dtype, W_dtype, out_dtype, accum_dtype, layout, with_bias, group_size, with_scaling, with_zeros, zeros_mode)),
]

benchmark_sets = []
benchmark_sets.extend(test_shapes)

# fmt:on

benchmark_results = {}
for config, operator, input_args in benchmark_sets:
    matmul_config = config(*input_args, propagate_a=True, propagate_b=True, fast_decoding=True)
    matmul = operator(matmul_config, target=target, enable_tuning=False)
    func = matmul.prim_func

    intrin_info = bitblas.base.roller.hint.IntrinInfo(
        in_dtype="float16",
        out_dtype="float16",
        trans_b=True,
        input_transform_kind=2,
        weight_transform_kind=2,
    )


    sch_normal = bitblas.gpu.MatmulTensorizationMMAWithDequantizeInfo().sch_shared_memory_prefetch_with_config(
        func,
        bitblas.base.roller.hint.Hint().from_dict({
            "warp": [16, 16],
            "block": [16, 64],
            "rstep": [128],
            "pipeline_stage": 2,
            "use_async": True,
            "intrin_info": intrin_info,
            "shared_scope": "shared",
            "vectorize": {
                "A": 8,
                "B": 8,
            },
            "rasterization_plan": bitblas.base.roller.Rasterization2DColumn(10)
        })
    )
    with tvm.transform.PassContext(config={"tir.use_async_copy": True, "tir.merge_static_smem": False, "cuda.kernels_output_dir": "./debug/bitblas_fp16xint4_fp16_pb_noscale_with_default"}):
        rt_mod = tvm.build(sch_normal.mod, target=matmul.target)
    time_evaluator = rt_mod.time_evaluator(rt_mod.entry_name, tvm.cuda(), number=10)
    profile_tensors = matmul.get_profile_tensors()
    latency = time_evaluator(*profile_tensors).mean * 1e3
    # print(rt_mod.imported_modules[0].get_source())
    print(f"Time cost is: {latency:.3f} ms")

    sch_reduce = bitblas.gpu.MatmulTensorizationMMAWithDequantizeInfo().sch_shared_memory_prefetch_with_config(
        func,
        bitblas.base.roller.hint.Hint().from_dict({
            "warp": [16, 16],
            "block": [16, 64],
            "rstep": [128],
            "pipeline_stage": 2,
            "use_async": True,
            "intrin_info": intrin_info,
            "shared_scope": "shared",
            "vectorize": {
                "A": 8,
                "B": 8,
            },
            "block_reduction_depth": 2,
            "rasterization_plan": bitblas.base.roller.Rasterization2DColumn(10)
        })
    )
    with tvm.transform.PassContext(config={"tir.use_async_copy": True, "tir.merge_static_smem": False, "cuda.kernels_output_dir": "./debug/bitblas_fp16xint4_fp16_pb_noscale_with_default"}):
        rt_mod = tvm.build(sch_reduce.mod, target=matmul.target)
    time_evaluator = rt_mod.time_evaluator(rt_mod.entry_name, tvm.cuda(), number=10)
    latency = time_evaluator(*profile_tensors).mean * 1e3
    # print(rt_mod.imported_modules[0].get_source())
    print(f"Time cost is: {latency:.3f} ms")

LeiWang199 and others added 30 commits May 21, 2024 11:51

improve e4m3 decoding.

75d2f3d

Merge branch 'main' of https://github.com/microsoft/BitBLAS into main

dd744d0

append fp16xint1

00bfa31

Update submodule commit reference

8cd8b10

chore: Update shared memory scope for float32 output dtype

9122ff7

BUGFIX: UINT8/INT8 Decoding

b508acc

feat: Add rasterization options for roller module

58d55b7

Refactor tensorcore_legalization method to optimize tensor core usage

e7547ce

feat: Add function to collect variables from expression, improve for …

678a2e1

…splitk

chore: Update typing import in __init__.py

3088b35

chore: Refactor CPU execution of operators

5d206b3

Refactor matmul implementation for splitk layout

e06ce10

Refactor matmul implementation for splitk layout

d67cc6d

Refactor matmul implementation for splitk layout

9e36b6d

chore: Update version to 0.0.1.dev8

e1a0149

chore: Enable debug output in bitblas.set_debug_level()

df0ed7a

Refactor Linear module matmul implementation for splitk layout

a0f651a

Refactor matmul implementation for splitk layout

88295a7

Merge branch 'main' of https://github.com/microsoft/BitBLAS into lei/…

3366dce

…splitk

Refactor CUDA kernel launch string for dynamic symbolic set

25b5c63

Bumpt version to v0.0.1.dev9

26a9f1b

Merge branch 'main' of https://github.com/microsoft/BitBLAS into lei/…

251bf08

…splitk

Refactor CUDA kernel launch string for dynamic symbolic set

e0cf62c

Bump version to v0.0.1.dev10

2e4e8dd

Merge branch 'main' into lei/splitk

0dec7d8

Refactor CUDA kernel launch string for dynamic symbolic set

81f5b9a

Merge branch 'lei/splitk' of https://github.com/LeiWang1999/MSBitBLAS …

ec64f91

…into lei/splitk

Bump version to v0.0.1.dev12 and add MatmulConfigWithSplitK and Matmu…

5e71163

…lWithSplitK

Merge branch 'main' into lei/splitk

d0e0726

fix the typo

30c0ae7

LeiWang199 and others added 8 commits June 29, 2024 12:14

Refactor CUDA kernel launch string for dynamic symbolic set

4bbccae

Refactor CUDA kernel launch string for dynamic symbolic set

0d1b649

Refactor CUDA kernel launch string for dynamic symbolic set

2ce41bb

Refactor CUDA kernel launch string for dynamic symbolic set

866f561

Merge branch 'main' of https://github.com/microsoft/BitBLAS into lei/…

8d2393c

…splitk

Merge branch 'microsoft:main' into main

22c12d7

Merge branch 'main' of https://github.com/LeiWang1999/MSBitBLAS into …

d9fdc21

…lei/splitk

Merge branch 'lei/splitk' of https://github.com/LeiWang1999/MSBitBLAS …

1e534f4

…into lei/splitk

LeiWang1999 marked this pull request as ready for review June 30, 2024 11:40

LeiWang1999 merged commit e7ed676 into microsoft:main Jun 30, 2024
4 checks passed

LeiWang1999 mentioned this pull request Jul 1, 2024

Further study Required for the performance gap when utilizing Block Reduction with Tensor Core #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Potentially improve performance through block reduction #63

[Dev] Potentially improve performance through block reduction #63

LeiWang1999 commented Jun 30, 2024

LeiWang1999 commented Jun 30, 2024

[Dev] Potentially improve performance through block reduction #63

[Dev] Potentially improve performance through block reduction #63

Conversation

LeiWang1999 commented Jun 30, 2024

LeiWang1999 commented Jun 30, 2024