Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][Schedule] Better Layout Transform Schedules #14167

Merged

Conversation

AndrewZhaoLuo
Copy link
Contributor

@AndrewZhaoLuo AndrewZhaoLuo commented Mar 1, 2023

CUDA: Improved Layout Transform Schedule

Summary

This PR does the following

  • Adds a scheduling rule for layout transform operator CUDA target
  • Forwards this information by adding "schedule_rule" attribute to layout transform topi
  • Annotates the layout transform topi with some information needed during scheduling.

Note: All numbers are on an RTX 3070 unless stated otherwise.

Motivation

The default implementation of layout transform has poor performance in some scenarios. A motivating factor behind all of this were layout transform operators taking up significant time in some stable diffusion models I tried.

NHWC layout is required for tensor-core use so we had to convert the layout of conv. operators. These introduced layout transform operators and their fused and unfused versions were extremely slow.

I ended up improving latency by at least 2x for all these operators in this PR:

Shape src_layout dst_layout dtype default (ms) new_sch (ms)
1, 384, 512, 128 NHWC NCHW float16 2.32 0.247
1, 128, 384, 512 NCHW NHWC float16 0.521 0.25
1, 576, 384, 256 NHWC NCHW float16 6.14 0.556
2, 72, 48, 960 NHWC NCHW float16 0.217 0.067
16, 3456, 3456 ABC ACB float16 36.84 1.88

Now ncu states we achieve 90%+ memory throughput on these operators 😎 so we are close to theoretical limit.

Algorithm

Currently, layout transform relies on the AutoBind schedule rule which seems to guarantee contiguous writes to the output buffer when assigning thread and block indices to loops. However in the case of layout transforms where the inner dimensions of src_layout and dst_layout do not match, it is impossible to have contiguous writes and contiguous reads in the same operation.

We ignore the case where the inner dimension matches (though I believe the new schedule might be faster in some scenarios).

Simple Case: Transpose

The way around this is to use a small amount of shared memory and tile loads so that reads from global memory -> shared can be coalesced. We carefully load elements so writing from shared —> global memory also have coalesced access. An example is taking the transpose of the matrix [2048, 1024] —> [1024, 2048]. We can read 32 x 32 tiles of the src matrix. We might make rows of our tile of shared memory correspond to coalesced accesses of the src matrix, then columns of our shared memory tile correspond to coalesced accesses of the dst matrix. By doing this, we can maximize memory throughput of our operations.

image
(src: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/)

General Case

While a simple transpose is easy to read about, how do we guarantee this behavior for general layout transforms where dimensions can be arbitrarily rearranged?

We can have a similar analogue for the general case. Consider the 32 x 32 tile in the previous example. Now we simply want the inner dimension of our title to correspond to the inner dimensions of src_layout and the outer dimension to correspond to the inner dimensions of dst_layout. Then, in the best case both reads and writes to global memory from/to the tile will be coalesced.

We factor and split our loops to obtain this desired structure for reads, and then use the loop analysis to guarantee compatible writes automatically.

An example is probably best to show this idea:
Let's say we want a layout transform of ABCD --> DCAB. With shape
[1024_a, 2_b, 32_c, 8_d] --> [8_d, 32_c, 1024_a, 2_b]

And tile size 32.

Then we initially have a coalesced-read loop pattern of:
T.grid(1024_a, 2_b, 32_c, 8_d)

To obtain an inner tile of 32, we factor 4 from 32_c and 8 from 8_d:
T.grid(1024_a, 2_b, 8_c1, 1_d1, 4_c2t, 8_d2t)
T.grid(1024_a, 2_b, 8_cr, 1_dr, 32_dim1)

To obtain an outer tile of 32, we factor from B then A (as the dst_layout is DCAB) to follow contiguous write pattern:

T.grid(64_a1, 1_b1, 8_cr, 1_dr, 16_a2t, 2_b2t, 32_dim1)
T.grid(64_ar, 1_br, 8_cr, 1_dr, 32_dim0, 32_dim1)

Which allows us to read a tile with our wanted properties.

We now have our read and write dimension tiled to get those constraints as close as possible. To handle weirder shapes which don’t divide nicely into tile_size, we pad some dimensions until it divides into tile_size.

Choice of search space

Coalesced global memory transactions happen within a warp of which we have a limit of 32 threads. An upper limit of 32 for our tile size is therefore appropriate which is the maximum coalescing possible. In practice we try them all since some tile sizes will divide our dimensions more nicely.

Results:

We present two results in this spreadsheet: https://docs.google.com/spreadsheets/d/1O7GV50KlJTZ5G7mt9b9qDK2LKoBrImOZRV5Bd8O19p0/edit#gid=1854847295

One is obtained by manually invoking the schedule rule and comparing against the default implementation across a variety of tile sizes (8, 16, 32, 64) and shapes + layouts:

Speedup:  
Geometric Mean: 3.042796464
Min speedup: 0.8902785674
Max speedup: 35.51726767
Median speedup: 2.202399163

The other is done similarly but by using the existing autotuning pipeline:

Speedup  
Geometric Mean 2.567999905
Mean 4.290743877
Median 1.88365594
Min 0.9612520382
Max 42.05803413

Areas of Future Improvement:

This is only for weird layouts or weird non-nice shapes. For most common transforms, we now hit 90%+ throughput on ncu.

High Shared Memory From Some Tile Sizes

Analysis when using compute_at seems to fail (or I am missing something) if we sample factors for our inner or outer tile dimension from only the same dimensions for both dim0 and dim1 tiling. This leads to excessive shared memory use in some scenarios which can cause failure or lead to performance issues. A lot of times these are still faster than the default schedules but do use more shared memory than needed. These are a very tiny amount of tested cases however and we always try both the new schedule and AutoBind so it should be ok for now.

I am not sure why this happens and would need more investigation. An example is transposing [1209, 9] of any type and 32 tile size.

Shared memory — Bank Conflicts:

Shared memory bank conflicts exist and are common for the strategy used. Consider a simple transpose of a [2048, 1024] matrix to [1024, 2048]. Then with a tile size of 32, the shared memory buffer will be of shape [32, 32]. We might read from global memory into rows of shared memory, and then write columns of shared memory to global memory for contiguous access. However, note the columns of this shared memory buffer lie on the same memory bank!

A common solution is to simply pad the innermost dimension of shared memory. E.g. [32, 33]. Which now makes accesses along columns be bank-conflict free.

This is planned to be done in a future PR via a new scheduling rule and is a general problem throughout all CUDA generated schedules. To give an idea of impact a [1024, 2048] transpose went from 14.34us —> 12.35us after this change basing off the optimized layout transform described in this PR.

Non-aligned inner dimensions:

This is an issue I did not think of when writing this schedule. The schedule is done from the viewpoint of trying to maximize coalesced memory access in global memory. However, one small detail is coalesced memory access must be aligned to the size of the transaction. That is, if we have a coalesced access of 64 bytes (e.g. 32 float16’s), then the each address accessed must be on the same 64 byte line (e.g. only the last 6 bits of address may be different).

Consider a layout transform where dimensions are prime numbers. E.g. [3, 1024, 1024, 7] -> [7, 1024, 1024, 3]. Then the current strategy will read 7 element-wide chunks at a time. However, most accesses will occur across coalesced memory boundaries, resulting in two coalesced memory requests instead of just 1. E.g. let’s say coalesced memory must be 8 byte aligned and we are dealing with one-byte datatype. The first read of 7 elements might be 0x00, 0x01 … 0x07 and the next will be 0x08, 0x09 … 0x0E. For the second access, 0x08 belongs to the first 8-byte line, while 0x09…0x0E belong to the second 8-byte line, requiring two memory transactions.

One possible way to get around this is to treat the array as flattened and just access stuff coalesced, though I am not sure about the details, to guarantee good access for src and dst will require some thinking, though it might be possible.

E.g.
An interesting thing in this case is if we do the no-op reshape into [3, 1024, 32, 32, 7] and then into [3, 1024, 32, 32 * 7], then [3, 1024, 32, 7, 32]. Then things become obvious. However, trying something like this initially leads to weird calculated bounds in the compute_at step and excessive shared memory usage as we must also consider the dst_layout.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Mar 1, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • No users to tag found in teams: cuda, schedule See #10317 for details

Generated by tvm-bot

@AndrewZhaoLuo AndrewZhaoLuo changed the title [CUDA][Schedule] Better Layout Transform Schedules [DRAFT][CUDA][Schedule] Better Layout Transform Schedules Mar 1, 2023
@AndrewZhaoLuo AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch 2 times, most recently from 2004992 to b5701f1 Compare March 3, 2023 01:21
@AndrewZhaoLuo AndrewZhaoLuo marked this pull request as ready for review March 3, 2023 21:09
@AndrewZhaoLuo
Copy link
Contributor Author

This is now ready for review. cc @tkonolige @masahi

@AndrewZhaoLuo AndrewZhaoLuo changed the title [DRAFT][CUDA][Schedule] Better Layout Transform Schedules [CUDA][Schedule] Better Layout Transform Schedules Mar 3, 2023
include/tvm/topi/transform.h Show resolved Hide resolved

# For each schedule we also want to inline each stage as would be done in normal circumstances
# to prevent extraneous memory access.
block = auto_inline(sch, block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it for the consumer blocks other than layout_transform itself? will AutoInline meta schedule rule be applied automatically without calling this? cc @zxybazh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not appear AutoInline meta schedule rule get's applied automatically so I manually did it. It does seem to apply to fusion before the layout transform if that makes sense

That is if you have

a = x + y * z
b = layout_transform(a)
c = c * c + c

Then a's operations will be fused into b but c will not be fused into c.

Upon closer examination, it appears some PostProcs (RewriteCooperativeFetching) expect thread binding to be the last terms in the trace (otherwise it may fetch a loop which does not exist in the final schedule) so I am unsure of the best thing to do here. I would expect Autoinling to be automatic though.

For now I have relaxed RewriteCooperativeFetching behavior though and will add more tests to make sure fusion is working as intended for now.

Copy link
Contributor Author

@AndrewZhaoLuo AndrewZhaoLuo Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be perhaps when generating the design space, the creation of new blocks to the schedule will not have rules applied. Not sure how to handle this.

@masahi
Copy link
Member

masahi commented Mar 3, 2023

Shared memory — Bank Conflicts

Have you tried using storage_align sch primitive? It achieves similar things as shmem padding for power of two size shmem.

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like all the comments you have put in. Everything looks good to me except that you are missing tests. Could you add some?

@AndrewZhaoLuo
Copy link
Contributor Author

AndrewZhaoLuo commented Mar 9, 2023

Shared memory — Bank Conflicts

Have you tried using storage_align sch primitive? It achieves similar things as shmem padding for power of two size shmem.

Ah yes, this might be exactly what I am looking for, thanks for bringing it to my attention.

Edit: Actually I think I fundamentally want to deal with a flattened shared buffer to make the analysis easier.

Ideally it would cool to have a rule which automatically annotates the best offsets for a shared memory buffer. Though I would need to think about this. Regardless, I think I'm going to push this work down to a later PR.

@AndrewZhaoLuo AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch from 30eff4a to ea354bd Compare March 10, 2023 18:08
@AndrewZhaoLuo
Copy link
Contributor Author

cc @tkonolige @vinx13

This is now ready for re-review.

There is a lint error because our version of black is out of date which might take a while to fix unfortunately (since have to update CI).

Main thing was handling implicit reshapes in the layout transform (e.g. NCHW --> NCHW4c) and adding tests. Tests are composed off some manual cases + some autogenerated cases. The autogenerated cases also try to fuse compatible ops into layout transform task and mainly checks things for correctness.

I offline tested ~8000 autogenerated cases for correctness of the schedule, for normal runs it tests ~9 autogenerated cases which takes about a minute on my computer.

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this looks good to me.

src/meta_schedule/postproc/rewrite_cooperative_fetch.cc Outdated Show resolved Hide resolved
@AndrewZhaoLuo
Copy link
Contributor Author

Will depend on #14346 before merge.

Will keep it open a few more days for additional comments.

junrushao pushed a commit that referenced this pull request Mar 21, 2023
Goes to the latest revision in the same major version (22.12.0). 

23.1.0 is released but it involves some style changes so we would need to reformat the entire codebase. 

I need 22.12.0 to properly deal with processing some files found in this PR: 

#14167

Where black cannot parse the file in the current version, but can in the updated version.
@AndrewZhaoLuo AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch 2 times, most recently from 0ed2f24 to 449516d Compare March 21, 2023 22:25
@AndrewZhaoLuo AndrewZhaoLuo force-pushed the aluo/cuda-better-layout-transform branch from 73ba40d to bf60774 Compare March 23, 2023 03:23
@AndrewZhaoLuo AndrewZhaoLuo merged commit e5ae434 into apache:main Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants