[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121

GaryYuyjl · 2020-07-23T02:30:37Z

This pr supports int4/int8 tensor core with HWNC layout.
The layer-wise results of HWNC and NHWC for batch size 8 are here (workloads are from Resnet18). HWNC layout runs faster in most of the workloads.

The experiments were done on AWS G4 instance with Nvidia T4 GPU.

Workload (batch_size, in_channels, in_size, out_channels, kernel_size, stride, padding)	NHWC int4 time(ms)	HWNC int4 time(ms)
(8, 64, 56, 64, 3, 1, 1)	0.23432	0.1723
(8, 64, 56, 128, 3, 2, 1)	0.1386	0.10278
(8, 64, 56, 64, 1, 2, 0)	0.04395	0.0333
(8, 128, 28, 128, 3, 1, 1)	0.1715	0.15088
(8, 128, 28, 256, 3, 2, 1)	0.12039	0.11548
(8, 128, 28, 256, 1, 2, 0)	0.04789	0.04219
(8, 256, 14, 256, 3, 1, 1)	0.14469	0.05695
(8, 256, 14, 512, 3, 2, 1)	0.15656	0.14456
(8, 256, 14, 512, 1, 2, 0)	0.04402	0.0475
(8, 512, 7, 512, 3, 1, 1)	0.25156	0.147156

CC @Laurawly

anijain2305 · 2020-07-24T02:57:29Z

Thanks for the contribution. I was wondering if this has been integrated with Relay so that we can use it for quantized networks.

Laurawly · 2020-07-25T03:51:15Z

Thanks for the contribution. I was wondering if this has been integrated with Relay so that we can use it for quantized networks.

Thanks for your question. The relay support is a collaboration work with the Berkeley team and is under a different repo. We are going to send a different PR for that.

Laurawly · 2020-07-27T20:07:52Z

@Shawn-Inspur and @Hzfengsy Could you guys review the PR?

Shawn-IEITSystems · 2020-07-28T01:15:30Z

Hi @Laurawly , great work! I am wondering what are the performance numbers for different batch_size?

Laurawly · 2020-07-29T02:29:51Z

Hi @Laurawly , great work! I am wondering what are the performance numbers for different batch_size?

Hi Shawn, we have performance numbers for batch_size 8 and 16.

anijain2305 · 2020-08-07T18:54:27Z

@Shawn-Inspur @Laurawly Can you please review when you get time? This will unblock us to connect it Relay and all the way to QNN then. Thanks!

Shawn-IEITSystems · 2020-08-08T07:17:51Z

python/tvm/relay/op/strategy/cuda.py

+        elif layout == "HWNC":
+            assert kernel_layout in ["HWOI", "HWOI16o16i", "HWOI8o32i", "HWOI32o16i"]
+            _, _, N, in_channels = get_const_tuple(data.shape)
+            pre_computed = len(kernel.shape) == 6
+            if pre_computed:
+                _, _, oc_chunk, _, oc_block_factor, _ = get_const_tuple(kernel.shape)
+                out_channels = oc_chunk * oc_block_factor
+            else:
+                _, _, out_channels, _ = get_const_tuple(kernel.shape)
+            if topi.cuda.is_shape_tensorcore_direct_qualified(batch=N, in_channels=in_channels, num_filter=out_channels, in_dtype=data.dtype):
+                strategy.add_implementation(
+                    wrap_compute_conv2d(topi.cuda.conv2d_hwnc_tensorcore),
+                    wrap_topi_schedule(topi.cuda.schedule_conv2d_hwnc_tensorcore),
+                    name="conv2d_hwnc_tensorcore_direct.cuda",
+                    plevel=20)


Because there is only tensor core implementation of HWNC layout, how to deal the cases when the shapes are not satisfied by tensor core?

So far the HWNC layout doesn't support non tensor core versions.

@GaryYuyjl Let's add a message clarifying this when layout is HWNC and the shape doesn't satisfy tensor core schedule.

Hzfengsy

LGTM

Shawn-IEITSystems · 2020-08-08T07:22:01Z

topi/python/topi/cuda/conv2d_hwnc_tensorcore.py

+        assert (batch % 16 == 0 and in_channels % 16 == 0 and num_filter % 16 == 0) or \
+               (batch % 8 == 0 and in_channels % 16 == 0 and num_filter % 32 == 0) or \
+               (batch % 32 == 0 and in_channels % 16 == 0 and num_filter % 8 == 0), \


As indicated by lines 72-74, there is only one shape that can be support by non-int4 case. However the assertion here including three shapes regarding m,n,k, which is confused.

Fixed.
Sorry for the late reply. I was in travel last week

Shawn-IEITSystems · 2020-08-08T07:29:19Z

topi/python/topi/cuda/conv2d_alter_op.py

+        assert data_layout == "HWNC" and kernel_layout == "HWOI"
+
+        H, W, N, CI = get_const_tuple(data.shape)
+        KH, KW, CO, _ = get_const_tuple(kernel.shape)
+
+        if kernel.dtype in ['int4', 'uint4'] and (CI % 32 != 0 or CO % 8 != 0) or \
+            kernel.dtype in ['int8', 'uint8'] and (CI % 16 != 0 or CO % 32 != 0):
+            return relay.nn.conv2d(*inputs, **new_attrs)


Because wmma is supported by sm_75 or higher, it is better to have a check on the GPU compute capability.

Laurawly · 2020-08-16T06:00:39Z

python/tvm/relay/op/strategy/cuda.py

+        elif layout == "HWNC":
+            assert kernel_layout in ["HWOI", "HWOI16o16i", "HWOI8o32i", "HWOI32o16i"]
+            _, _, N, in_channels = get_const_tuple(data.shape)
+            pre_computed = len(kernel.shape) == 6
+            if pre_computed:
+                _, _, oc_chunk, _, oc_block_factor, _ = get_const_tuple(kernel.shape)
+                out_channels = oc_chunk * oc_block_factor
+            else:
+                _, _, out_channels, _ = get_const_tuple(kernel.shape)
+            if topi.cuda.is_shape_tensorcore_direct_qualified(batch=N, in_channels=in_channels, num_filter=out_channels, in_dtype=data.dtype):
+                strategy.add_implementation(
+                    wrap_compute_conv2d(topi.cuda.conv2d_hwnc_tensorcore),
+                    wrap_topi_schedule(topi.cuda.schedule_conv2d_hwnc_tensorcore),
+                    name="conv2d_hwnc_tensorcore_direct.cuda",
+                    plevel=20)


@GaryYuyjl Let's add a message clarifying this when layout is HWNC and the shape doesn't satisfy tensor core schedule.

Laurawly · 2020-08-16T06:04:20Z

python/tvm/topi/cuda/conv2d_hwnc_tensorcore.py

+    return unpacked_out
+
+def conv2d_hwnc_tensorcore(data, kernel, strides, padding, dilation, in_dtype, out_dtype='int32'):
+    """Compute conv2d internally using conv2d_nchwc layout for int8 dtype"""


It seems this sentence is not describing this kernel.

Laurawly · 2020-08-16T06:04:54Z

python/tvm/topi/cuda/conv2d_hwnc_tensorcore.py

+    """Compute conv2d internally using conv2d_nchwc layout for int8 dtype"""
+    assert data.dtype in ('int4', 'uint4', 'int8', 'uint8')
+    assert kernel.dtype in ('int4', 'uint4', 'int8', 'uint8')
+    # assert data.dtype == kernel.dtype


Do we need to assert the data type equals kernel type here?

Laurawly · 2020-08-16T06:54:59Z

python/tvm/topi/cuda/conv2d_hwnc_tensorcore.py

+                  wmma_m,
+                  wmma_k)
+
+    # Kernel: (H, W, OC, IC, ic, oc)


Is kernel (H, W, OC, IC, oc, ic)?

fix bug

Laurawly · 2020-08-17T05:33:47Z

@Shawn-Inspur Could you review the updated changes? Thanks!

Shawn-IEITSystems

LGTM

Laurawly · 2020-08-18T05:32:26Z

Thanks @GaryYuyjl @Shawn-Inspur @Hzfengsy!

…6121) * int4 tensorcore * a draft for new int4 schedule * update layout * add inline option * clean code * increase search space * fix kernel shape * update intrinsic * update intrinsic * support int4/int8 hwnc layout * remove useless code * remove useless code * remove useless code * remove useless code * fix int8 transpose * fix assert * add asf header * CI * CI * CI * fix bug fix bug Co-authored-by: Leyuan Wang <laurawly@gmail.com>

GaryYuyjl added 16 commits May 2, 2020 01:41

int4 tensorcore

2db385d

a draft for new int4 schedule

23ac8f9

update layout

c666253

add inline option

f5d91da

clean code

2797ad0

increase search space

be92a0f

fix kernel shape

d0bd2ad

update intrinsic

112f948

update intrinsic

a8e7d21

support int4/int8 hwnc layout

37c690c

remove useless code

2750bfd

Merge branch 'master' into int4tensorcore

e2fb2ec

remove useless code

b5dd8d7

remove useless code

f6843df

remove useless code

b0cffec

fix int8 transpose

ae0d45c

Laurawly changed the title ~~support int4/int8 conv2d tensor core with HWNC layout~~ [TOPI] Support int4/int8 conv2d tensor core with HWNC layout Jul 24, 2020

Laurawly self-assigned this Jul 25, 2020

Shawn-IEITSystems reviewed Aug 8, 2020

View reviewed changes

Hzfengsy approved these changes Aug 8, 2020

View reviewed changes

Shawn-IEITSystems reviewed Aug 8, 2020

View reviewed changes

Laurawly added 2 commits August 14, 2020 09:14

fix assert

211d530

Merge branch 'master' into int4tensorcore

48b7a7e

Laurawly reviewed Aug 16, 2020

View reviewed changes

Laurawly and others added 5 commits August 16, 2020 10:06

add asf header

95f000c

CI

b6faf5b

CI

859ecb9

CI

4624477

fix bug

a048fe2

fix bug

Shawn-IEITSystems approved these changes Aug 17, 2020

View reviewed changes

Laurawly approved these changes Aug 18, 2020

View reviewed changes

Laurawly merged commit 9cc15a4 into apache:master Aug 18, 2020

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

jwfromm mentioned this pull request Oct 23, 2020

[Relay] A set of utilities that allows a model to be run efficiently on tensorcores. #6748

Merged

hypercubestart mentioned this pull request Apr 14, 2021

[AutoTVM] [TOPI] Support AutoTVM for int4 tensorcore #7831

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121

[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121

GaryYuyjl commented Jul 23, 2020 •

edited by Laurawly

Loading

anijain2305 commented Jul 24, 2020

Laurawly commented Jul 25, 2020 •

edited

Loading

Laurawly commented Jul 27, 2020

Shawn-IEITSystems commented Jul 28, 2020

Laurawly commented Jul 29, 2020

anijain2305 commented Aug 7, 2020

Shawn-IEITSystems Aug 8, 2020 •

edited

Loading

GaryYuyjl Aug 14, 2020

Laurawly Aug 16, 2020

Hzfengsy left a comment

Shawn-IEITSystems Aug 8, 2020

GaryYuyjl Aug 14, 2020

Shawn-IEITSystems Aug 8, 2020 •

edited

Loading

GaryYuyjl Aug 14, 2020

Laurawly Aug 16, 2020

Laurawly Aug 16, 2020

Laurawly Aug 16, 2020

Laurawly Aug 16, 2020

Laurawly commented Aug 17, 2020

Shawn-IEITSystems left a comment

Laurawly commented Aug 18, 2020

[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121

[TOPI] Support int4/int8 conv2d tensor core with HWNC layout #6121

Conversation

GaryYuyjl commented Jul 23, 2020 • edited by Laurawly Loading

anijain2305 commented Jul 24, 2020

Laurawly commented Jul 25, 2020 • edited Loading

Laurawly commented Jul 27, 2020

Shawn-IEITSystems commented Jul 28, 2020

Laurawly commented Jul 29, 2020

anijain2305 commented Aug 7, 2020

Shawn-IEITSystems Aug 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hzfengsy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shawn-IEITSystems Aug 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laurawly commented Aug 17, 2020

Shawn-IEITSystems left a comment

Choose a reason for hiding this comment

Laurawly commented Aug 18, 2020

GaryYuyjl commented Jul 23, 2020 •

edited by Laurawly

Loading

Laurawly commented Jul 25, 2020 •

edited

Loading

Shawn-IEITSystems Aug 8, 2020 •

edited

Loading

Shawn-IEITSystems Aug 8, 2020 •

edited

Loading