[CUDA][Schedule] Better Layout Transform Schedules #14167

AndrewZhaoLuo · 2023-03-01T23:48:22Z

CUDA: Improved Layout Transform Schedule

Summary

This PR does the following

Adds a scheduling rule for layout transform operator CUDA target
Forwards this information by adding "schedule_rule" attribute to layout transform topi
Annotates the layout transform topi with some information needed during scheduling.

Note: All numbers are on an RTX 3070 unless stated otherwise.

Motivation

The default implementation of layout transform has poor performance in some scenarios. A motivating factor behind all of this were layout transform operators taking up significant time in some stable diffusion models I tried.

NHWC layout is required for tensor-core use so we had to convert the layout of conv. operators. These introduced layout transform operators and their fused and unfused versions were extremely slow.

I ended up improving latency by at least 2x for all these operators in this PR:

Shape	src_layout	dst_layout	dtype	default (ms)	new_sch (ms)
1, 384, 512, 128	NHWC	NCHW	float16	2.32	0.247
1, 128, 384, 512	NCHW	NHWC	float16	0.521	0.25
1, 576, 384, 256	NHWC	NCHW	float16	6.14	0.556
2, 72, 48, 960	NHWC	NCHW	float16	0.217	0.067
16, 3456, 3456	ABC	ACB	float16	36.84	1.88

Now ncu states we achieve 90%+ memory throughput on these operators 😎 so we are close to theoretical limit.

Algorithm

Currently, layout transform relies on the AutoBind schedule rule which seems to guarantee contiguous writes to the output buffer when assigning thread and block indices to loops. However in the case of layout transforms where the inner dimensions of src_layout and dst_layout do not match, it is impossible to have contiguous writes and contiguous reads in the same operation.

We ignore the case where the inner dimension matches (though I believe the new schedule might be faster in some scenarios).

Simple Case: Transpose

The way around this is to use a small amount of shared memory and tile loads so that reads from global memory -> shared can be coalesced. We carefully load elements so writing from shared —> global memory also have coalesced access. An example is taking the transpose of the matrix [2048, 1024] —> [1024, 2048]. We can read 32 x 32 tiles of the src matrix. We might make rows of our tile of shared memory correspond to coalesced accesses of the src matrix, then columns of our shared memory tile correspond to coalesced accesses of the dst matrix. By doing this, we can maximize memory throughput of our operations.

(src: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/)

General Case

While a simple transpose is easy to read about, how do we guarantee this behavior for general layout transforms where dimensions can be arbitrarily rearranged?

We can have a similar analogue for the general case. Consider the 32 x 32 tile in the previous example. Now we simply want the inner dimension of our title to correspond to the inner dimensions of src_layout and the outer dimension to correspond to the inner dimensions of dst_layout. Then, in the best case both reads and writes to global memory from/to the tile will be coalesced.

We factor and split our loops to obtain this desired structure for reads, and then use the loop analysis to guarantee compatible writes automatically.

An example is probably best to show this idea:
Let's say we want a layout transform of ABCD --> DCAB. With shape
[1024_a, 2_b, 32_c, 8_d] --> [8_d, 32_c, 1024_a, 2_b]

And tile size 32.

Then we initially have a coalesced-read loop pattern of:
T.grid(1024_a, 2_b, 32_c, 8_d)

To obtain an inner tile of 32, we factor 4 from 32_c and 8 from 8_d:
T.grid(1024_a, 2_b, 8_c1, 1_d1, 4_c2t, 8_d2t)
T.grid(1024_a, 2_b, 8_cr, 1_dr, 32_dim1)

To obtain an outer tile of 32, we factor from B then A (as the dst_layout is DCAB) to follow contiguous write pattern:

T.grid(64_a1, 1_b1, 8_cr, 1_dr, 16_a2t, 2_b2t, 32_dim1)
T.grid(64_ar, 1_br, 8_cr, 1_dr, 32_dim0, 32_dim1)

Which allows us to read a tile with our wanted properties.

We now have our read and write dimension tiled to get those constraints as close as possible. To handle weirder shapes which don’t divide nicely into tile_size, we pad some dimensions until it divides into tile_size.

Choice of search space

Coalesced global memory transactions happen within a warp of which we have a limit of 32 threads. An upper limit of 32 for our tile size is therefore appropriate which is the maximum coalescing possible. In practice we try them all since some tile sizes will divide our dimensions more nicely.

Results:

We present two results in this spreadsheet: https://docs.google.com/spreadsheets/d/1O7GV50KlJTZ5G7mt9b9qDK2LKoBrImOZRV5Bd8O19p0/edit#gid=1854847295

One is obtained by manually invoking the schedule rule and comparing against the default implementation across a variety of tile sizes (8, 16, 32, 64) and shapes + layouts:

Speedup:
Geometric Mean:	3.042796464
Min speedup:	0.8902785674
Max speedup:	35.51726767
Median speedup:	2.202399163

The other is done similarly but by using the existing autotuning pipeline:

Speedup
Geometric Mean	2.567999905
Mean	4.290743877
Median	1.88365594
Min	0.9612520382
Max	42.05803413

Areas of Future Improvement:

This is only for weird layouts or weird non-nice shapes. For most common transforms, we now hit 90%+ throughput on ncu.

High Shared Memory From Some Tile Sizes

Analysis when using compute_at seems to fail (or I am missing something) if we sample factors for our inner or outer tile dimension from only the same dimensions for both dim0 and dim1 tiling. This leads to excessive shared memory use in some scenarios which can cause failure or lead to performance issues. A lot of times these are still faster than the default schedules but do use more shared memory than needed. These are a very tiny amount of tested cases however and we always try both the new schedule and AutoBind so it should be ok for now.

I am not sure why this happens and would need more investigation. An example is transposing [1209, 9] of any type and 32 tile size.

Shared memory — Bank Conflicts:

Shared memory bank conflicts exist and are common for the strategy used. Consider a simple transpose of a [2048, 1024] matrix to [1024, 2048]. Then with a tile size of 32, the shared memory buffer will be of shape [32, 32]. We might read from global memory into rows of shared memory, and then write columns of shared memory to global memory for contiguous access. However, note the columns of this shared memory buffer lie on the same memory bank!

A common solution is to simply pad the innermost dimension of shared memory. E.g. [32, 33]. Which now makes accesses along columns be bank-conflict free.

This is planned to be done in a future PR via a new scheduling rule and is a general problem throughout all CUDA generated schedules. To give an idea of impact a [1024, 2048] transpose went from 14.34us —> 12.35us after this change basing off the optimized layout transform described in this PR.

Non-aligned inner dimensions:

This is an issue I did not think of when writing this schedule. The schedule is done from the viewpoint of trying to maximize coalesced memory access in global memory. However, one small detail is coalesced memory access must be aligned to the size of the transaction. That is, if we have a coalesced access of 64 bytes (e.g. 32 float16’s), then the each address accessed must be on the same 64 byte line (e.g. only the last 6 bits of address may be different).

Consider a layout transform where dimensions are prime numbers. E.g. [3, 1024, 1024, 7] -> [7, 1024, 1024, 3]. Then the current strategy will read 7 element-wide chunks at a time. However, most accesses will occur across coalesced memory boundaries, resulting in two coalesced memory requests instead of just 1. E.g. let’s say coalesced memory must be 8 byte aligned and we are dealing with one-byte datatype. The first read of 7 elements might be 0x00, 0x01 … 0x07 and the next will be 0x08, 0x09 … 0x0E. For the second access, 0x08 belongs to the first 8-byte line, while 0x09…0x0E belong to the second 8-byte line, requiring two memory transactions.

One possible way to get around this is to treat the array as flattened and just access stuff coalesced, though I am not sure about the details, to guarantee good access for src and dst will require some thinking, though it might be possible.

E.g.
An interesting thing in this case is if we do the no-op reshape into [3, 1024, 32, 32, 7] and then into [3, 1024, 32, 32 * 7], then [3, 1024, 32, 7, 32]. Then things become obvious. However, trying something like this initially leads to weird calculated bounds in the compute_at step and excessive shared memory usage as we must also consider the dst_layout.

tvm-bot · 2023-03-01T23:48:25Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: cuda, schedule _{See #10317 for details}

_{Generated by tvm-bot}

python/tvm/relay/op/_transform.py

AndrewZhaoLuo · 2023-03-03T21:11:10Z

This is now ready for review. cc @tkonolige @masahi

include/tvm/topi/transform.h

vinx13 · 2023-03-03T21:30:15Z

python/tvm/meta_schedule/schedule/cuda/layout_transform.py

+
+    # For each schedule we also want to inline each stage as would be done in normal circumstances
+    # to prevent extraneous memory access.
+    block = auto_inline(sch, block)


is it for the consumer blocks other than layout_transform itself? will AutoInline meta schedule rule be applied automatically without calling this? cc @zxybazh

It does not appear AutoInline meta schedule rule get's applied automatically so I manually did it. It does seem to apply to fusion before the layout transform if that makes sense

That is if you have

a = x + y * z
b = layout_transform(a)
c = c * c + c

Then a's operations will be fused into b but c will not be fused into c.

Upon closer examination, it appears some PostProcs (RewriteCooperativeFetching) expect thread binding to be the last terms in the trace (otherwise it may fetch a loop which does not exist in the final schedule) so I am unsure of the best thing to do here. I would expect Autoinling to be automatic though.

For now I have relaxed RewriteCooperativeFetching behavior though and will add more tests to make sure fusion is working as intended for now.

This appears to be perhaps when generating the design space, the creation of new blocks to the schedule will not have rules applied. Not sure how to handle this.

masahi · 2023-03-03T21:49:59Z

Shared memory — Bank Conflicts

Have you tried using storage_align sch primitive? It achieves similar things as shmem padding for power of two size shmem.

tkonolige

I really like all the comments you have put in. Everything looks good to me except that you are missing tests. Could you add some?

python/tvm/meta_schedule/schedule/cuda/layout_transform.py

AndrewZhaoLuo · 2023-03-09T00:00:43Z

Shared memory — Bank Conflicts

Have you tried using storage_align sch primitive? It achieves similar things as shmem padding for power of two size shmem.

Ah yes, this might be exactly what I am looking for, thanks for bringing it to my attention.

Edit: Actually I think I fundamentally want to deal with a flattened shared buffer to make the analysis easier.

Ideally it would cool to have a rule which automatically annotates the best offsets for a shared memory buffer. Though I would need to think about this. Regardless, I think I'm going to push this work down to a later PR.

AndrewZhaoLuo · 2023-03-16T22:07:25Z

cc @tkonolige @vinx13

This is now ready for re-review.

There is a lint error because our version of black is out of date which might take a while to fix unfortunately (since have to update CI).

Main thing was handling implicit reshapes in the layout transform (e.g. NCHW --> NCHW4c) and adding tests. Tests are composed off some manual cases + some autogenerated cases. The autogenerated cases also try to fuse compatible ops into layout transform task and mainly checks things for correctness.

I offline tested ~8000 autogenerated cases for correctness of the schedule, for normal runs it tests ~9 autogenerated cases which takes about a minute on my computer.

tkonolige

In general, this looks good to me.

src/meta_schedule/postproc/rewrite_cooperative_fetch.cc

AndrewZhaoLuo · 2023-03-20T18:47:52Z

Will depend on #14346 before merge.

Will keep it open a few more days for additional comments.

Goes to the latest revision in the same major version (22.12.0). 23.1.0 is released but it involves some style changes so we would need to reformat the entire codebase. I need 22.12.0 to properly deal with processing some files found in this PR: #14167 Where black cannot parse the file in the current version, but can in the updated version.

lint move schedule rule to own file lint p2 layout transform fixings

AndrewZhaoLuo changed the title ~~[CUDA][Schedule] Better Layout Transform Schedules~~ [DRAFT][CUDA][Schedule] Better Layout Transform Schedules Mar 1, 2023

elvin-n reviewed Mar 2, 2023

View reviewed changes

python/tvm/relay/op/_transform.py Show resolved Hide resolved

AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch 2 times, most recently from 2004992 to b5701f1 Compare March 3, 2023 01:21

AndrewZhaoLuo marked this pull request as ready for review March 3, 2023 21:09

AndrewZhaoLuo changed the title ~~[DRAFT][CUDA][Schedule] Better Layout Transform Schedules~~ [CUDA][Schedule] Better Layout Transform Schedules Mar 3, 2023

vinx13 reviewed Mar 3, 2023

View reviewed changes

tkonolige requested changes Mar 6, 2023

View reviewed changes

tkonolige reviewed Mar 6, 2023

View reviewed changes

python/tvm/meta_schedule/schedule/cuda/layout_transform.py Outdated Show resolved Hide resolved

AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch from 30eff4a to ea354bd Compare March 10, 2023 18:08

tkonolige approved these changes Mar 20, 2023

View reviewed changes

src/meta_schedule/postproc/rewrite_cooperative_fetch.cc Outdated Show resolved Hide resolved

AndrewZhaoLuo mentioned this pull request Mar 20, 2023

[CI][Lint] Update black #14346

Merged

AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch 2 times, most recently from 0ed2f24 to 449516d Compare March 21, 2023 22:25

AndrewZhaoLuo added 9 commits March 22, 2023 20:18

initial basis

1df07f7

Generated all the tile sizes

8748bc8

is this all you need?

54d3625

linting

6ccea68

lint move schedule rule to own file lint p2 layout transform fixings

forgot to forward arg

ced49f4

fix tests

c6e8739

reduce search space

f18f933

lint

5a51ffa

schedule rule documentation

e1ce901

AndrewZhaoLuo added 25 commits March 22, 2023 20:18

testing harness

1fb271b

more progress on testing harness

31ca25a

fix case where shape changes in mod

585191b

inline after schedule genreation to help analysis

9822f4c

proper autoinlining INTO layout transform block to maintain extants

e152e39

clean up

8274d21

reindex for introducing cache block

1c7aa19

reorganize testing

e704664

more cleanup

67c2db9

remove forced false

6e1ea6b

use the proper dispatcher

8eace30

update test, make default schedule rule None

8eb6f8b

linting

5546079

fix mypy errors

a3729c9

clean up

bd57077

manual test cases

7eb21ad

manual tests

3d092fe

add comment, fix improper implicit reshape handling

7d4df3f

fix

37a0e9d

remove extra comments

0687e57

more lints

8945399

refactor

ceb7548

remove extraneous check

7fdcb69

lint again :/

a5d8f5f

remove uneeded newline

bf60774

AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch from 73ba40d to bf60774 Compare March 23, 2023 03:23

remove leading spaces

2c1047a

vinx13 approved these changes Mar 23, 2023

View reviewed changes

AndrewZhaoLuo merged commit e5ae434 into apache:main Mar 23, 2023

ysh329 mentioned this pull request Apr 17, 2023

[Release] v0.12.0 Release Candidate Notes #14645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA][Schedule] Better Layout Transform Schedules #14167

[CUDA][Schedule] Better Layout Transform Schedules #14167

AndrewZhaoLuo commented Mar 1, 2023 •

edited

Loading

tvm-bot commented Mar 1, 2023 •

edited

Loading

AndrewZhaoLuo commented Mar 3, 2023

vinx13 Mar 3, 2023

AndrewZhaoLuo Mar 8, 2023

AndrewZhaoLuo Mar 14, 2023 •

edited

Loading

masahi commented Mar 3, 2023 •

edited

Loading

tkonolige left a comment

AndrewZhaoLuo commented Mar 9, 2023 •

edited

Loading

AndrewZhaoLuo commented Mar 16, 2023

tkonolige left a comment

AndrewZhaoLuo commented Mar 20, 2023

[CUDA][Schedule] Better Layout Transform Schedules #14167

[CUDA][Schedule] Better Layout Transform Schedules #14167

Conversation

AndrewZhaoLuo commented Mar 1, 2023 • edited Loading

CUDA: Improved Layout Transform Schedule

Summary

Motivation

Algorithm

Simple Case: Transpose

General Case

Choice of search space

Results:

Areas of Future Improvement:

High Shared Memory From Some Tile Sizes

Shared memory — Bank Conflicts:

Non-aligned inner dimensions:

tvm-bot commented Mar 1, 2023 • edited Loading

AndrewZhaoLuo commented Mar 3, 2023

vinx13 Mar 3, 2023

Choose a reason for hiding this comment

AndrewZhaoLuo Mar 8, 2023

Choose a reason for hiding this comment

AndrewZhaoLuo Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

masahi commented Mar 3, 2023 • edited Loading

tkonolige left a comment

Choose a reason for hiding this comment

AndrewZhaoLuo commented Mar 9, 2023 • edited Loading

AndrewZhaoLuo commented Mar 16, 2023

tkonolige left a comment

Choose a reason for hiding this comment

AndrewZhaoLuo commented Mar 20, 2023

AndrewZhaoLuo commented Mar 1, 2023 •

edited

Loading

tvm-bot commented Mar 1, 2023 •

edited

Loading

AndrewZhaoLuo Mar 14, 2023 •

edited

Loading

masahi commented Mar 3, 2023 •

edited

Loading

AndrewZhaoLuo commented Mar 9, 2023 •

edited

Loading