Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121

Merged
merged 23 commits into from
Aug 18, 2020

Conversation

GaryYuyjl
Copy link
Contributor

@GaryYuyjl GaryYuyjl commented Jul 23, 2020

This pr supports int4/int8 tensor core with HWNC layout.
The layer-wise results of HWNC and NHWC for batch size 8 are here (workloads are from Resnet18). HWNC layout runs faster in most of the workloads.

The experiments were done on AWS G4 instance with Nvidia T4 GPU.

Workload (batch_size, in_channels, in_size, out_channels, kernel_size, stride, padding) NHWC int4 time(ms) HWNC int4 time(ms)
(8, 64, 56, 64, 3, 1, 1) 0.23432 0.1723
(8, 64, 56, 128, 3, 2, 1) 0.1386 0.10278
(8, 64, 56, 64, 1, 2, 0) 0.04395 0.0333
(8, 128, 28, 128, 3, 1, 1) 0.1715 0.15088
(8, 128, 28, 256, 3, 2, 1) 0.12039 0.11548
(8, 128, 28, 256, 1, 2, 0) 0.04789 0.04219
(8, 256, 14, 256, 3, 1, 1) 0.14469 0.05695
(8, 256, 14, 512, 3, 2, 1) 0.15656 0.14456
(8, 256, 14, 512, 1, 2, 0) 0.04402 0.0475
(8, 512, 7, 512, 3, 1, 1) 0.25156 0.147156

CC @Laurawly

@anijain2305
Copy link
Contributor

Thanks for the contribution. I was wondering if this has been integrated with Relay so that we can use it for quantized networks.

@Laurawly Laurawly changed the title support int4/int8 conv2d tensor core with HWNC layout [TOPI] Support int4/int8 conv2d tensor core with HWNC layout Jul 24, 2020
@Laurawly
Copy link
Contributor

Laurawly commented Jul 25, 2020

Thanks for the contribution. I was wondering if this has been integrated with Relay so that we can use it for quantized networks.

Thanks for your question. The relay support is a collaboration work with the Berkeley team and is under a different repo. We are going to send a different PR for that.

@Laurawly Laurawly self-assigned this Jul 25, 2020
@Laurawly
Copy link
Contributor

@Shawn-Inspur and @Hzfengsy Could you guys review the PR?

@Shawn-IEITSystems
Copy link
Contributor

Hi @Laurawly , great work! I am wondering what are the performance numbers for different batch_size?

@Laurawly
Copy link
Contributor

Hi @Laurawly , great work! I am wondering what are the performance numbers for different batch_size?

Hi Shawn, we have performance numbers for batch_size 8 and 16.

@anijain2305
Copy link
Contributor

@Shawn-Inspur @Laurawly Can you please review when you get time? This will unblock us to connect it Relay and all the way to QNN then. Thanks!

Comment on lines 175 to 189
elif layout == "HWNC":
assert kernel_layout in ["HWOI", "HWOI16o16i", "HWOI8o32i", "HWOI32o16i"]
_, _, N, in_channels = get_const_tuple(data.shape)
pre_computed = len(kernel.shape) == 6
if pre_computed:
_, _, oc_chunk, _, oc_block_factor, _ = get_const_tuple(kernel.shape)
out_channels = oc_chunk * oc_block_factor
else:
_, _, out_channels, _ = get_const_tuple(kernel.shape)
if topi.cuda.is_shape_tensorcore_direct_qualified(batch=N, in_channels=in_channels, num_filter=out_channels, in_dtype=data.dtype):
strategy.add_implementation(
wrap_compute_conv2d(topi.cuda.conv2d_hwnc_tensorcore),
wrap_topi_schedule(topi.cuda.schedule_conv2d_hwnc_tensorcore),
name="conv2d_hwnc_tensorcore_direct.cuda",
plevel=20)
Copy link
Contributor

@Shawn-IEITSystems Shawn-IEITSystems Aug 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is only tensor core implementation of HWNC layout, how to deal the cases when the shapes are not satisfied by tensor core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the HWNC layout doesn't support non tensor core versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaryYuyjl Let's add a message clarifying this when layout is HWNC and the shape doesn't satisfy tensor core schedule.

Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 87 to 89
assert (batch % 16 == 0 and in_channels % 16 == 0 and num_filter % 16 == 0) or \
(batch % 8 == 0 and in_channels % 16 == 0 and num_filter % 32 == 0) or \
(batch % 32 == 0 and in_channels % 16 == 0 and num_filter % 8 == 0), \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated by lines 72-74, there is only one shape that can be support by non-int4 case. However the assertion here including three shapes regarding m,n,k, which is confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Sorry for the late reply. I was in travel last week

Comment on lines 175 to 182
assert data_layout == "HWNC" and kernel_layout == "HWOI"

H, W, N, CI = get_const_tuple(data.shape)
KH, KW, CO, _ = get_const_tuple(kernel.shape)

if kernel.dtype in ['int4', 'uint4'] and (CI % 32 != 0 or CO % 8 != 0) or \
kernel.dtype in ['int8', 'uint8'] and (CI % 16 != 0 or CO % 32 != 0):
return relay.nn.conv2d(*inputs, **new_attrs)
Copy link
Contributor

@Shawn-IEITSystems Shawn-IEITSystems Aug 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because wmma is supported by sm_75 or higher, it is better to have a check on the GPU compute capability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 175 to 189
elif layout == "HWNC":
assert kernel_layout in ["HWOI", "HWOI16o16i", "HWOI8o32i", "HWOI32o16i"]
_, _, N, in_channels = get_const_tuple(data.shape)
pre_computed = len(kernel.shape) == 6
if pre_computed:
_, _, oc_chunk, _, oc_block_factor, _ = get_const_tuple(kernel.shape)
out_channels = oc_chunk * oc_block_factor
else:
_, _, out_channels, _ = get_const_tuple(kernel.shape)
if topi.cuda.is_shape_tensorcore_direct_qualified(batch=N, in_channels=in_channels, num_filter=out_channels, in_dtype=data.dtype):
strategy.add_implementation(
wrap_compute_conv2d(topi.cuda.conv2d_hwnc_tensorcore),
wrap_topi_schedule(topi.cuda.schedule_conv2d_hwnc_tensorcore),
name="conv2d_hwnc_tensorcore_direct.cuda",
plevel=20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaryYuyjl Let's add a message clarifying this when layout is HWNC and the shape doesn't satisfy tensor core schedule.

return unpacked_out

def conv2d_hwnc_tensorcore(data, kernel, strides, padding, dilation, in_dtype, out_dtype='int32'):
"""Compute conv2d internally using conv2d_nchwc layout for int8 dtype"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this sentence is not describing this kernel.

"""Compute conv2d internally using conv2d_nchwc layout for int8 dtype"""
assert data.dtype in ('int4', 'uint4', 'int8', 'uint8')
assert kernel.dtype in ('int4', 'uint4', 'int8', 'uint8')
# assert data.dtype == kernel.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert the data type equals kernel type here?

wmma_m,
wmma_k)

# Kernel: (H, W, OC, IC, ic, oc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kernel (H, W, OC, IC, oc, ic)?

@Laurawly
Copy link
Contributor

@Shawn-Inspur Could you review the updated changes? Thanks!

Copy link
Contributor

@Shawn-IEITSystems Shawn-IEITSystems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Laurawly
Copy link
Contributor

Thanks @GaryYuyjl @Shawn-Inspur @Hzfengsy!

@Laurawly Laurawly merged commit 9cc15a4 into apache:master Aug 18, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…6121)

* int4 tensorcore

* a draft for new int4 schedule

* update layout

* add inline option

* clean code

* increase search space

* fix kernel shape

* update intrinsic

* update intrinsic

* support int4/int8 hwnc layout

* remove useless code

* remove useless code

* remove useless code

* remove useless code

* fix int8 transpose

* fix assert

* add asf header

* CI

* CI

* CI

* fix bug

fix bug

Co-authored-by: Leyuan Wang <laurawly@gmail.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…6121)

* int4 tensorcore

* a draft for new int4 schedule

* update layout

* add inline option

* clean code

* increase search space

* fix kernel shape

* update intrinsic

* update intrinsic

* support int4/int8 hwnc layout

* remove useless code

* remove useless code

* remove useless code

* remove useless code

* fix int8 transpose

* fix assert

* add asf header

* CI

* CI

* CI

* fix bug

fix bug

Co-authored-by: Leyuan Wang <laurawly@gmail.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…6121)

* int4 tensorcore

* a draft for new int4 schedule

* update layout

* add inline option

* clean code

* increase search space

* fix kernel shape

* update intrinsic

* update intrinsic

* support int4/int8 hwnc layout

* remove useless code

* remove useless code

* remove useless code

* remove useless code

* fix int8 transpose

* fix assert

* add asf header

* CI

* CI

* CI

* fix bug

fix bug

Co-authored-by: Leyuan Wang <laurawly@gmail.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Sep 2, 2020
…6121)

* int4 tensorcore

* a draft for new int4 schedule

* update layout

* add inline option

* clean code

* increase search space

* fix kernel shape

* update intrinsic

* update intrinsic

* support int4/int8 hwnc layout

* remove useless code

* remove useless code

* remove useless code

* remove useless code

* fix int8 transpose

* fix assert

* add asf header

* CI

* CI

* CI

* fix bug

fix bug

Co-authored-by: Leyuan Wang <laurawly@gmail.com>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Sep 3, 2020
…6121)

* int4 tensorcore

* a draft for new int4 schedule

* update layout

* add inline option

* clean code

* increase search space

* fix kernel shape

* update intrinsic

* update intrinsic

* support int4/int8 hwnc layout

* remove useless code

* remove useless code

* remove useless code

* remove useless code

* fix int8 transpose

* fix assert

* add asf header

* CI

* CI

* CI

* fix bug

fix bug

Co-authored-by: Leyuan Wang <laurawly@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants