-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dev] Potentially improve performance through block reduction #63
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…into lei/splitk
…into lei/splitk
The BitBLAS performance on small shapes has not yet met our expectations, indicating that further investigation is necessary, so the block reduce related items was disabled when we enable auto tuning, cc @tzj-fxz if you have time. # Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from bitblas.utils.target_detector import auto_detect_nvidia_target
from bitblas import Matmul, MatmulConfig
import argparse
import bitblas
import tvm
from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.roller.arch import CUDA
from bitblas.gpu.matmul_analysis import get_tensorized_func_and_tags
from bitblas.base.utils import apply_and_build
import time
from tvm import te, tir
bitblas.set_log_level("DEBUG")
# Initialize the parser
parser = argparse.ArgumentParser(
description="Benchmark BitBLAS int4 on a specific target."
)
# Add arguments to the parser
parser.add_argument(
"--target",
type=str,
default=auto_detect_nvidia_target(),
help="Specify the target device for benchmarking."
)
parser.add_argument(
"--group_size",
type=int,
default=None,
help="Group size for grouped quantization."
)
parser.add_argument(
"--A_dtype",
type=str,
default="float16",
choices=["float16", "float32", "float64", "int32", "int8"], # Assuming these are the valid choices
help="Data type of activation A."
)
parser.add_argument(
"--W_dtype",
type=str,
default="uint4",
help="Data type of weight W."
)
parser.add_argument(
"--accum_dtype",
type=str,
default="float16",
help="Data type for accumulation."
)
parser.add_argument(
"--out_dtype",
type=str,
default="float16",
choices=["float16", "float32", "int32", "int8"], # Assuming these are the valid choices
help="Data type for output."
)
parser.add_argument(
"--layout",
type=str,
default="nt",
choices=["nt", "nn"], # Assuming these are the valid choices
help="Matrix layout, 'nt' for non-transpose A and transpose W."
)
parser.add_argument(
"--with_bias",
action="store_true",
help="Include bias in the benchmark."
)
parser.add_argument(
"--with_scaling",
action="store_true",
help="Include scaling factor in the quantization."
)
parser.add_argument(
"--with_zeros",
action="store_true",
help="Include zeros in the quantization."
)
parser.add_argument(
"--zeros_mode",
type=str,
default=None,
choices=["original", "rescale", "quantized"], # Replace with actual modes if applicable
help="Specify the mode for calculating zeros."
)
parser.add_argument(
"--propagate_a",
type=str,
default=True,
choices=["original", "rescale", "quantized"], # Replace with actual modes if applicable
help="Specify the mode for calculating zeros."
)
parser.add_argument(
"--propagate_b",
type=str,
default=True,
choices=["original", "rescale", "quantized"], # Replace with actual modes if applicable
help="Specify the mode for calculating zeros."
)
# Parse the arguments
args = parser.parse_args()
# Assign arguments to variables
target = args.target
group_size = args.group_size
A_dtype = args.A_dtype
W_dtype = args.W_dtype
accum_dtype = args.accum_dtype
out_dtype = args.out_dtype
layout = args.layout
with_bias = args.with_bias
group_size = args.group_size
with_scaling = args.with_scaling
with_zeros = args.with_zeros
zeros_mode = args.zeros_mode
propagate_a = args.propagate_a
propagate_b = args.propagate_b
test_shapes = [
(MatmulConfig, Matmul, (16, 16384, 16384, A_dtype, W_dtype, out_dtype, accum_dtype, layout, with_bias, group_size, with_scaling, with_zeros, zeros_mode)),
]
benchmark_sets = []
benchmark_sets.extend(test_shapes)
# fmt:on
benchmark_results = {}
for config, operator, input_args in benchmark_sets:
matmul_config = config(*input_args, propagate_a=True, propagate_b=True, fast_decoding=True)
matmul = operator(matmul_config, target=target, enable_tuning=False)
func = matmul.prim_func
intrin_info = bitblas.base.roller.hint.IntrinInfo(
in_dtype="float16",
out_dtype="float16",
trans_b=True,
input_transform_kind=2,
weight_transform_kind=2,
)
sch_normal = bitblas.gpu.MatmulTensorizationMMAWithDequantizeInfo().sch_shared_memory_prefetch_with_config(
func,
bitblas.base.roller.hint.Hint().from_dict({
"warp": [16, 16],
"block": [16, 64],
"rstep": [128],
"pipeline_stage": 2,
"use_async": True,
"intrin_info": intrin_info,
"shared_scope": "shared",
"vectorize": {
"A": 8,
"B": 8,
},
"rasterization_plan": bitblas.base.roller.Rasterization2DColumn(10)
})
)
with tvm.transform.PassContext(config={"tir.use_async_copy": True, "tir.merge_static_smem": False, "cuda.kernels_output_dir": "./debug/bitblas_fp16xint4_fp16_pb_noscale_with_default"}):
rt_mod = tvm.build(sch_normal.mod, target=matmul.target)
time_evaluator = rt_mod.time_evaluator(rt_mod.entry_name, tvm.cuda(), number=10)
profile_tensors = matmul.get_profile_tensors()
latency = time_evaluator(*profile_tensors).mean * 1e3
# print(rt_mod.imported_modules[0].get_source())
print(f"Time cost is: {latency:.3f} ms")
sch_reduce = bitblas.gpu.MatmulTensorizationMMAWithDequantizeInfo().sch_shared_memory_prefetch_with_config(
func,
bitblas.base.roller.hint.Hint().from_dict({
"warp": [16, 16],
"block": [16, 64],
"rstep": [128],
"pipeline_stage": 2,
"use_async": True,
"intrin_info": intrin_info,
"shared_scope": "shared",
"vectorize": {
"A": 8,
"B": 8,
},
"block_reduction_depth": 2,
"rasterization_plan": bitblas.base.roller.Rasterization2DColumn(10)
})
)
with tvm.transform.PassContext(config={"tir.use_async_copy": True, "tir.merge_static_smem": False, "cuda.kernels_output_dir": "./debug/bitblas_fp16xint4_fp16_pb_noscale_with_default"}):
rt_mod = tvm.build(sch_reduce.mod, target=matmul.target)
time_evaluator = rt_mod.time_evaluator(rt_mod.entry_name, tvm.cuda(), number=10)
latency = time_evaluator(*profile_tensors).mean * 1e3
# print(rt_mod.imported_modules[0].get_source())
print(f"Time cost is: {latency:.3f} ms")
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In our recent evaluations, we observed that batch inference on matrix multiplication shapes for 7b/13b models didn't achieve the expected theoretical speedup. This performance bottleneck appears to be linked to the absence of block reduction support within our schedule template and tvm.
This pull request includes a variety of changes across multiple files, mainly focusing on introducing new features related to block reduction.
Here are the most important changes:
*
python/bitblas/base/roller/policy/tensorcore.py
: Theblock_reduction_depth
field was added to thePolicy
class, and this value is now considered in several functions, including_check_small_tile
,_enlarge
, and_score
. [1] [2] [3] [4] [5]*
python/bitblas/gpu/matmul_analysis.py
: Thecheck_last_trait
function was updated to set theblock_reduction_depth
to 2 for smallM
values.*
python/bitblas/gpu/matmul_mma.py
: Theapply_config
function was updated to call a new functionapply_block_reduction_with_config
ifblock_reduction_depth
is notNone
.*
3rdparty/tvm
: The subproject commit was updated.