-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA][Schedule] Better Layout Transform Schedules #14167
[CUDA][Schedule] Better Layout Transform Schedules #14167
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
2004992
to
b5701f1
Compare
This is now ready for review. cc @tkonolige @masahi |
|
||
# For each schedule we also want to inline each stage as would be done in normal circumstances | ||
# to prevent extraneous memory access. | ||
block = auto_inline(sch, block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it for the consumer blocks other than layout_transform
itself? will AutoInline
meta schedule rule be applied automatically without calling this? cc @zxybazh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not appear AutoInline meta schedule rule get's applied automatically so I manually did it. It does seem to apply to fusion before the layout transform if that makes sense
That is if you have
a = x + y * z
b = layout_transform(a)
c = c * c + c
Then a's operations will be fused into b but c will not be fused into c.
Upon closer examination, it appears some PostProcs (RewriteCooperativeFetching) expect thread binding to be the last terms in the trace (otherwise it may fetch a loop which does not exist in the final schedule) so I am unsure of the best thing to do here. I would expect Autoinling to be automatic though.
For now I have relaxed RewriteCooperativeFetching behavior though and will add more tests to make sure fusion is working as intended for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be perhaps when generating the design space, the creation of new blocks to the schedule will not have rules applied. Not sure how to handle this.
Have you tried using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like all the comments you have put in. Everything looks good to me except that you are missing tests. Could you add some?
Ah yes, this might be exactly what I am looking for, thanks for bringing it to my attention. Edit: Actually I think I fundamentally want to deal with a flattened shared buffer to make the analysis easier. Ideally it would cool to have a rule which automatically annotates the best offsets for a shared memory buffer. Though I would need to think about this. Regardless, I think I'm going to push this work down to a later PR. |
30eff4a
to
ea354bd
Compare
This is now ready for re-review. There is a lint error because our version of Main thing was handling implicit reshapes in the layout transform (e.g. NCHW --> NCHW4c) and adding tests. Tests are composed off some manual cases + some autogenerated cases. The autogenerated cases also try to fuse compatible ops into layout transform task and mainly checks things for correctness. I offline tested ~8000 autogenerated cases for correctness of the schedule, for normal runs it tests ~9 autogenerated cases which takes about a minute on my computer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this looks good to me.
Will depend on #14346 before merge. Will keep it open a few more days for additional comments. |
Goes to the latest revision in the same major version (22.12.0). 23.1.0 is released but it involves some style changes so we would need to reformat the entire codebase. I need 22.12.0 to properly deal with processing some files found in this PR: #14167 Where black cannot parse the file in the current version, but can in the updated version.
0ed2f24
to
449516d
Compare
73ba40d
to
bf60774
Compare
CUDA: Improved Layout Transform Schedule
Summary
This PR does the following
Note: All numbers are on an RTX 3070 unless stated otherwise.
Motivation
The default implementation of layout transform has poor performance in some scenarios. A motivating factor behind all of this were layout transform operators taking up significant time in some stable diffusion models I tried.
NHWC layout is required for tensor-core use so we had to convert the layout of conv. operators. These introduced layout transform operators and their fused and unfused versions were extremely slow.
I ended up improving latency by at least 2x for all these operators in this PR:
Now
ncu
states we achieve 90%+ memory throughput on these operators 😎 so we are close to theoretical limit.Algorithm
Currently, layout transform relies on the AutoBind schedule rule which seems to guarantee contiguous writes to the output buffer when assigning thread and block indices to loops. However in the case of layout transforms where the inner dimensions of
src_layout
anddst_layout
do not match, it is impossible to have contiguous writes and contiguous reads in the same operation.We ignore the case where the inner dimension matches (though I believe the new schedule might be faster in some scenarios).
Simple Case: Transpose
The way around this is to use a small amount of shared memory and tile loads so that reads from global memory -> shared can be coalesced. We carefully load elements so writing from shared —> global memory also have coalesced access. An example is taking the transpose of the matrix [2048, 1024] —> [1024, 2048]. We can read 32 x 32 tiles of the src matrix. We might make rows of our tile of shared memory correspond to coalesced accesses of the src matrix, then columns of our shared memory tile correspond to coalesced accesses of the dst matrix. By doing this, we can maximize memory throughput of our operations.
(src: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/)
General Case
While a simple transpose is easy to read about, how do we guarantee this behavior for general layout transforms where dimensions can be arbitrarily rearranged?
We can have a similar analogue for the general case. Consider the 32 x 32 tile in the previous example. Now we simply want the inner dimension of our title to correspond to the inner dimensions of src_layout and the outer dimension to correspond to the inner dimensions of dst_layout. Then, in the best case both reads and writes to global memory from/to the tile will be coalesced.
We factor and split our loops to obtain this desired structure for reads, and then use the loop analysis to guarantee compatible writes automatically.
An example is probably best to show this idea:
Let's say we want a layout transform of ABCD --> DCAB. With shape
[1024_a, 2_b, 32_c, 8_d] --> [8_d, 32_c, 1024_a, 2_b]
And tile size 32.
Then we initially have a coalesced-read loop pattern of:
T.grid(1024_a, 2_b, 32_c, 8_d)
To obtain an inner tile of 32, we factor 4 from 32_c and 8 from 8_d:
T.grid(1024_a, 2_b, 8_c1, 1_d1, 4_c2t, 8_d2t)
T.grid(1024_a, 2_b, 8_cr, 1_dr, 32_dim1)
To obtain an outer tile of 32, we factor from B then A (as the dst_layout is DCAB) to follow contiguous write pattern:
T.grid(64_a1, 1_b1, 8_cr, 1_dr, 16_a2t, 2_b2t, 32_dim1)
T.grid(64_ar, 1_br, 8_cr, 1_dr, 32_dim0, 32_dim1)
Which allows us to read a tile with our wanted properties.
We now have our read and write dimension tiled to get those constraints as close as possible. To handle weirder shapes which don’t divide nicely into tile_size, we pad some dimensions until it divides into tile_size.
Choice of search space
Coalesced global memory transactions happen within a warp of which we have a limit of 32 threads. An upper limit of 32 for our tile size is therefore appropriate which is the maximum coalescing possible. In practice we try them all since some tile sizes will divide our dimensions more nicely.
Results:
We present two results in this spreadsheet: https://docs.google.com/spreadsheets/d/1O7GV50KlJTZ5G7mt9b9qDK2LKoBrImOZRV5Bd8O19p0/edit#gid=1854847295
One is obtained by manually invoking the schedule rule and comparing against the default implementation across a variety of tile sizes (8, 16, 32, 64) and shapes + layouts:
The other is done similarly but by using the existing autotuning pipeline:
Areas of Future Improvement:
This is only for weird layouts or weird non-nice shapes. For most common transforms, we now hit 90%+ throughput on ncu.
High Shared Memory From Some Tile Sizes
Analysis when using
compute_at
seems to fail (or I am missing something) if we sample factors for our inner or outer tile dimension from only the same dimensions for both dim0 and dim1 tiling. This leads to excessive shared memory use in some scenarios which can cause failure or lead to performance issues. A lot of times these are still faster than the default schedules but do use more shared memory than needed. These are a very tiny amount of tested cases however and we always try both the new schedule and AutoBind so it should be ok for now.I am not sure why this happens and would need more investigation. An example is transposing [1209, 9] of any type and 32 tile size.
Shared memory — Bank Conflicts:
Shared memory bank conflicts exist and are common for the strategy used. Consider a simple transpose of a [2048, 1024] matrix to [1024, 2048]. Then with a tile size of 32, the shared memory buffer will be of shape [32, 32]. We might read from global memory into rows of shared memory, and then write columns of shared memory to global memory for contiguous access. However, note the columns of this shared memory buffer lie on the same memory bank!
A common solution is to simply pad the innermost dimension of shared memory. E.g. [32, 33]. Which now makes accesses along columns be bank-conflict free.
This is planned to be done in a future PR via a new scheduling rule and is a general problem throughout all CUDA generated schedules. To give an idea of impact a [1024, 2048] transpose went from 14.34us —> 12.35us after this change basing off the optimized layout transform described in this PR.
Non-aligned inner dimensions:
This is an issue I did not think of when writing this schedule. The schedule is done from the viewpoint of trying to maximize coalesced memory access in global memory. However, one small detail is coalesced memory access must be aligned to the size of the transaction. That is, if we have a coalesced access of 64 bytes (e.g. 32 float16’s), then the each address accessed must be on the same 64 byte line (e.g. only the last 6 bits of address may be different).
Consider a layout transform where dimensions are prime numbers. E.g. [3, 1024, 1024, 7] -> [7, 1024, 1024, 3]. Then the current strategy will read 7 element-wide chunks at a time. However, most accesses will occur across coalesced memory boundaries, resulting in two coalesced memory requests instead of just 1. E.g. let’s say coalesced memory must be 8 byte aligned and we are dealing with one-byte datatype. The first read of 7 elements might be 0x00, 0x01 … 0x07 and the next will be 0x08, 0x09 … 0x0E. For the second access, 0x08 belongs to the first 8-byte line, while 0x09…0x0E belong to the second 8-byte line, requiring two memory transactions.
One possible way to get around this is to treat the array as flattened and just access stuff coalesced, though I am not sure about the details, to guarantee good access for src and dst will require some thinking, though it might be possible.
E.g.
An interesting thing in this case is if we do the no-op reshape into [3, 1024, 32, 32, 7] and then into [3, 1024, 32, 32 * 7], then [3, 1024, 32, 7, 32]. Then things become obvious. However, trying something like this initially leads to weird calculated bounds in the
compute_at
step and excessive shared memory usage as we must also consider the dst_layout.