-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121
Conversation
Thanks for the contribution. I was wondering if this has been integrated with Relay so that we can use it for quantized networks. |
Thanks for your question. The relay support is a collaboration work with the Berkeley team and is under a different repo. We are going to send a different PR for that. |
@Shawn-Inspur and @Hzfengsy Could you guys review the PR? |
Hi @Laurawly , great work! I am wondering what are the performance numbers for different batch_size? |
Hi Shawn, we have performance numbers for batch_size 8 and 16. |
@Shawn-Inspur @Laurawly Can you please review when you get time? This will unblock us to connect it Relay and all the way to QNN then. Thanks! |
python/tvm/relay/op/strategy/cuda.py
Outdated
elif layout == "HWNC": | ||
assert kernel_layout in ["HWOI", "HWOI16o16i", "HWOI8o32i", "HWOI32o16i"] | ||
_, _, N, in_channels = get_const_tuple(data.shape) | ||
pre_computed = len(kernel.shape) == 6 | ||
if pre_computed: | ||
_, _, oc_chunk, _, oc_block_factor, _ = get_const_tuple(kernel.shape) | ||
out_channels = oc_chunk * oc_block_factor | ||
else: | ||
_, _, out_channels, _ = get_const_tuple(kernel.shape) | ||
if topi.cuda.is_shape_tensorcore_direct_qualified(batch=N, in_channels=in_channels, num_filter=out_channels, in_dtype=data.dtype): | ||
strategy.add_implementation( | ||
wrap_compute_conv2d(topi.cuda.conv2d_hwnc_tensorcore), | ||
wrap_topi_schedule(topi.cuda.schedule_conv2d_hwnc_tensorcore), | ||
name="conv2d_hwnc_tensorcore_direct.cuda", | ||
plevel=20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is only tensor core implementation of HWNC layout, how to deal the cases when the shapes are not satisfied by tensor core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the HWNC layout doesn't support non tensor core versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaryYuyjl Let's add a message clarifying this when layout is HWNC and the shape doesn't satisfy tensor core schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
assert (batch % 16 == 0 and in_channels % 16 == 0 and num_filter % 16 == 0) or \ | ||
(batch % 8 == 0 and in_channels % 16 == 0 and num_filter % 32 == 0) or \ | ||
(batch % 32 == 0 and in_channels % 16 == 0 and num_filter % 8 == 0), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As indicated by lines 72-74, there is only one shape that can be support by non-int4 case. However the assertion here including three shapes regarding m,n,k, which is confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Sorry for the late reply. I was in travel last week
assert data_layout == "HWNC" and kernel_layout == "HWOI" | ||
|
||
H, W, N, CI = get_const_tuple(data.shape) | ||
KH, KW, CO, _ = get_const_tuple(kernel.shape) | ||
|
||
if kernel.dtype in ['int4', 'uint4'] and (CI % 32 != 0 or CO % 8 != 0) or \ | ||
kernel.dtype in ['int8', 'uint8'] and (CI % 16 != 0 or CO % 32 != 0): | ||
return relay.nn.conv2d(*inputs, **new_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because wmma is supported by sm_75 or higher, it is better to have a check on the GPU compute capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
python/tvm/relay/op/strategy/cuda.py
Outdated
elif layout == "HWNC": | ||
assert kernel_layout in ["HWOI", "HWOI16o16i", "HWOI8o32i", "HWOI32o16i"] | ||
_, _, N, in_channels = get_const_tuple(data.shape) | ||
pre_computed = len(kernel.shape) == 6 | ||
if pre_computed: | ||
_, _, oc_chunk, _, oc_block_factor, _ = get_const_tuple(kernel.shape) | ||
out_channels = oc_chunk * oc_block_factor | ||
else: | ||
_, _, out_channels, _ = get_const_tuple(kernel.shape) | ||
if topi.cuda.is_shape_tensorcore_direct_qualified(batch=N, in_channels=in_channels, num_filter=out_channels, in_dtype=data.dtype): | ||
strategy.add_implementation( | ||
wrap_compute_conv2d(topi.cuda.conv2d_hwnc_tensorcore), | ||
wrap_topi_schedule(topi.cuda.schedule_conv2d_hwnc_tensorcore), | ||
name="conv2d_hwnc_tensorcore_direct.cuda", | ||
plevel=20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaryYuyjl Let's add a message clarifying this when layout is HWNC and the shape doesn't satisfy tensor core schedule.
return unpacked_out | ||
|
||
def conv2d_hwnc_tensorcore(data, kernel, strides, padding, dilation, in_dtype, out_dtype='int32'): | ||
"""Compute conv2d internally using conv2d_nchwc layout for int8 dtype""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this sentence is not describing this kernel.
"""Compute conv2d internally using conv2d_nchwc layout for int8 dtype""" | ||
assert data.dtype in ('int4', 'uint4', 'int8', 'uint8') | ||
assert kernel.dtype in ('int4', 'uint4', 'int8', 'uint8') | ||
# assert data.dtype == kernel.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assert the data type equals kernel type here?
wmma_m, | ||
wmma_k) | ||
|
||
# Kernel: (H, W, OC, IC, ic, oc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is kernel (H, W, OC, IC, oc, ic)?
@Shawn-Inspur Could you review the updated changes? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @GaryYuyjl @Shawn-Inspur @Hzfengsy! |
…6121) * int4 tensorcore * a draft for new int4 schedule * update layout * add inline option * clean code * increase search space * fix kernel shape * update intrinsic * update intrinsic * support int4/int8 hwnc layout * remove useless code * remove useless code * remove useless code * remove useless code * fix int8 transpose * fix assert * add asf header * CI * CI * CI * fix bug fix bug Co-authored-by: Leyuan Wang <laurawly@gmail.com>
…6121) * int4 tensorcore * a draft for new int4 schedule * update layout * add inline option * clean code * increase search space * fix kernel shape * update intrinsic * update intrinsic * support int4/int8 hwnc layout * remove useless code * remove useless code * remove useless code * remove useless code * fix int8 transpose * fix assert * add asf header * CI * CI * CI * fix bug fix bug Co-authored-by: Leyuan Wang <laurawly@gmail.com>
…6121) * int4 tensorcore * a draft for new int4 schedule * update layout * add inline option * clean code * increase search space * fix kernel shape * update intrinsic * update intrinsic * support int4/int8 hwnc layout * remove useless code * remove useless code * remove useless code * remove useless code * fix int8 transpose * fix assert * add asf header * CI * CI * CI * fix bug fix bug Co-authored-by: Leyuan Wang <laurawly@gmail.com>
…6121) * int4 tensorcore * a draft for new int4 schedule * update layout * add inline option * clean code * increase search space * fix kernel shape * update intrinsic * update intrinsic * support int4/int8 hwnc layout * remove useless code * remove useless code * remove useless code * remove useless code * fix int8 transpose * fix assert * add asf header * CI * CI * CI * fix bug fix bug Co-authored-by: Leyuan Wang <laurawly@gmail.com>
…6121) * int4 tensorcore * a draft for new int4 schedule * update layout * add inline option * clean code * increase search space * fix kernel shape * update intrinsic * update intrinsic * support int4/int8 hwnc layout * remove useless code * remove useless code * remove useless code * remove useless code * fix int8 transpose * fix assert * add asf header * CI * CI * CI * fix bug fix bug Co-authored-by: Leyuan Wang <laurawly@gmail.com>
This pr supports int4/int8 tensor core with HWNC layout.
The layer-wise results of HWNC and NHWC for batch size 8 are here (workloads are from Resnet18). HWNC layout runs faster in most of the workloads.
The experiments were done on AWS G4 instance with Nvidia T4 GPU.
CC @Laurawly