Some question about GTC2020 cutlass's talk about conflict free load shared memory #1130

MARD1NO · 2023-10-08T11:28:53Z

MARD1NO
Oct 8, 2023

In previous discussion, manishucsd suggest to read cutlass GTC2020's talk: https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/ . and I have some question about load shared memory in cutlass.

When load global memory data in shared memory, each thread load is vectorized. Since tensorcore need a special layout. each thread in warp need other's thread's data.

So use ldmatrix instruction can directly get other's thread's data and avoid 4 times lds32.

thread0 need thread T0,8,16,24's shared memory data to load in registers, to avoid bank conflicts, cutlass has a special shared memory store layout. (Am i right?)

In slides 45 - 48, it discuss about LOADING FROM SHARED MEMORY TO REGISTERS. I cannot understand it, in above we have discussed, it seems we have done LOADING FROM SHARED MEMORY TO REGISTERS. Or it means 45-48 slides is talk about we accumulate result in shared memory, load the matmul result to register and finish epilogue? (I'm not sure....)

Answered by hwu36

Oct 9, 2023

thread0 need thread T0,8,16,24's shared memory data to load in registers

correct

cutlass has a special shared memory store layout.

correct.

slides 45-48 just deep dive into an example to show why the special layout can avoid bank conflicts when loading from the shared memory to the registers.

View full answer

hwu36 · 2023-10-09T16:53:43Z

hwu36
Oct 9, 2023
Maintainer

thread0 need thread T0,8,16,24's shared memory data to load in registers

correct

cutlass has a special shared memory store layout.

correct.

slides 45-48 just deep dive into an example to show why the special layout can avoid bank conflicts when loading from the shared memory to the registers.

3 replies

MARD1NO Oct 10, 2023
Author

So much thanks!!!

MARD1NO Oct 13, 2023
Author

If I use ldmatrix to load shared memory to registers, there is no need for the process in slides 45-48, right?

hwu36 Oct 13, 2023
Maintainer

these slides are about how to use ldmatrix in a bank conflict way.

alexsamardzic · 2023-11-06T19:11:31Z

alexsamardzic
Nov 6, 2023

(I hope it's ok to follow-up on this discussion, instead of opening a new issue.)

I also find the part of this talk on the shared memory layout, that is pretty much all of slides 36-55, rather confusing. Here are my initial questions:

Is the matrix layout in the global memory, as shown on slides, supposed to be column-major, and do we have to care about that, and say also about coalescing global memory access patterns, at all here?
The shape of the rectangle used to represent values hold by single thread in shared memory changes between slides 38 and 39. Is this is an omission, and we should just suppose that these rectangles always represent the same number of bits (and is this number of bits equal 128), or the change actually has some meaning?

0 replies

hwu36 · 2023-11-06T19:24:48Z

hwu36
Nov 6, 2023
Maintainer

it does not have to be column major. we use column major as an example here. the key idea here is the same no matter which operand or which layout. We use xor (row ^ col) to swizzle a 8x8 basic block to prevent bank conflicts during both load and store.

i don't understand your 2nd question. what shape of the rectangle is changed?

0 replies

alexsamardzic · 2023-11-06T20:03:22Z

alexsamardzic
Nov 6, 2023

The rectangles representing values stored to shared memory: on slide 38 these are "vertical", on the next slide they're "horizontal". I mean - it's silly, probably an omission, but when ones is trying to understand this stuff, it raises a question does it mean something... Anyway, more important part of that question for me is: how many bits each of these rectangles represents? I assume that for global memory, they represent 128 bits, and for registers they're actually 32 bits. Are they consistently representing 128 bits for shared memory on these slides?

0 replies

hwu36 · 2023-11-07T03:08:32Z

hwu36
Nov 7, 2023
Maintainer

128bits in both global and shared memory - 8 consecutive fp16 data. We then use ldmatrix to load them from the shared memory to the registers. ldmatrix can automatically distribute the data across the warp to the format wanted by tensor cores. ldmatrix is like shared_memory_load + shuffle.

0 replies

alexsamardzic · 2023-11-07T13:03:19Z

alexsamardzic
Nov 7, 2023

Ok, so the rectangles do represent 128 bits for global and shared memory, and 32 bits for registers; again, it's nitpicking but it's confusing that they are almost of the same size in the slides. There is lots of alike details in the talk that make it confusing, for example at the slide 36 presenter says: "the matrix A is in row-major order", even if it's clearly depicted in column-major order, this is why I asked above about it.

Anyway, thanks for the clarifications! After looking further into this, I think I got it; for me, the key for understanding the "scattering" of data throughout shared memory was to stress that on slide 39 and the next couple slides, denoting a rectangle in global and shared memory say by "T0" means that thread 0 is doing cp.async between corresponding regions in global/shared memory; on the other side, from slide 45 and onwards, the "assignment" of threads to rectangles representing data means that corresponding thread is passing the address of given 128 bit segment to ldmatrix. So in order to avoid conflicts, and looking backwards from loading from shared memory into registers: for example for group of threads T0-T7, rectangles passed by these threads to ldmatrix must be in different columns of shared memory representation in the bottom left of the slides, but that means that during the cp.async step threads that stored these rectangles from global memory to shared memory had to put them in different columns - in this particular case, these were threads in the set (T0, T2, T4, T6, T8, T10, T12, T14). At the same time, during the cp.async step, it's best to have consecutive threads copying consecutive blocks from global to shared memory, so here groups of thread T0-T7, and so on, have also to put their rectangles into different columns of shared memory representation. So the arrangement of data in shared memory is to be chosen to satisfy these two sets of requirements, that boils down to having threads within each of following 8 sets of threads: T0-T7, T8-T15, T16-T23, T24-T32, (T0, T2, T4, T6, T8, T10, T12, T14), (T1, T3, T5, T7, T9, T11, T13, T15), (T16, T18, T20, T22, T24, T26, T28, T30) and (T17, T19, T21, T23, T25, T27, T29, T31) to put their rectangles into different columns of shared memory representation. This is doable in multiple ways, but the arrangement chosen for CUTLASS is such that mapping thread indices to shared memory pointers is very fast, during both store and load. I guess "gaps" in shared memory (as on slide 45 and onwards) are the price to be paid for the indexing, but conveniently these could be utilized to store data for the next ldmatrix call, again by simple and fast change of corresponding pointers.

0 replies

hwu36 · 2023-11-07T15:25:27Z

hwu36
Nov 7, 2023
Maintainer

your understanding is correct (i still don't know what do you mean "gaps" though).

at last, this is just gemm, high school stuff, not rocket science. don't be scared of it.

0 replies

kimbaol · 2023-11-17T06:28:38Z

kimbaol
Nov 17, 2023

(I also want to make sure my understanding is correct)
I read test/unit/gemm/device/default_gemm_configuration.hpp and sm80_mma_multistage.hpp.
It seems the slides on GTC2020's talk is describing the "A row-major, small k-bkock(k=32)" config

//
// F16: 128-by-128-by-32 (small k-block)
//

/// Operand A - Row-major (K-Major)
template <>
struct DefaultGemm_TensorOpSm80_OperandA<half_t, layout::RowMajor, 8, 32>
{
  // Smem
  using SmemLayoutAtom = decltype(
    composition(Swizzle<2,3,3>{},
                Layout<Shape < _8,_32>,
                       Stride<_32, _1>>{}));
  using SmemCopyAtom = Copy_Atom<SM75_U32x4_LDSM_N, half_t>;

  // Gmem
  using GmemTiledCopy = decltype(
    make_tiled_copy(Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<cute::uint128_t>, half_t>{},
                    Layout<Shape <_32,_4>,
                           Stride< _4,_1>>{},
                    Layout<Shape < _1,_8>>{}));
};

If we use (F16: 128-by-128-by-64) block, and A col-major/M-major
The ldmatrix of operand A would be:

Am I right?

0 replies

hwu36 · 2023-11-17T14:55:29Z

hwu36
Nov 17, 2023
Maintainer

your A is correct.

0 replies

xws117 · 2024-04-16T07:33:32Z

xws117
Apr 16, 2024

Need for help !
I am sorry,but i still have a lot of questions about this part even i have read all the answers above . here are my questions:

in the part processing gmem memory to smem, swizzle use 4 phases to avoid bank conflict, what if do not use swizzle, the will be 4 way bank conflict , those two method all take 4 cycle clock, so ,what the advantage of swizzle?
after use swizzle ,the smem layout is no bank conflict ,why don not use the same method to trans the location of thread to load data ,but use a different way ,
i dont understand the layout of logical view of threadblock tile here , is the layout row major or column major , if it is row major , the first part of ldmatrix should read T0,2,4,8...., if it is column major , then with one load of Gmem -> Smem, we can not get complete smem , am i right ?

1 reply

hwu36 Apr 24, 2024
Maintainer

swizzle use 4 phases to avoid bank conflict,

I don't really know what 4 phases are. Anyway, swizzling is computed before the mainloop (more precisely, the constructor of the iterators), in the mainloop, we usually just need to do simple math (e.g. adding a constant offset) to reach the next location (e.g. see the operator++() method of the iterators).

4 way bank conflict , those two method all take 4 cycle clock

4 way bank conflict takes way more than 4 cycles.

why don not use the same method to trans the location of thread to load data ,but use a different way ,

I don't understand this question.

i dont understand the layout of logical view of threadblock tile here , is the layout row major or column major.

You can consider this figure as A matrix with column major. Every thread holds eight fp16 data which is 128bit totally. The shared memory cache liine is 128B which can hold 8 threads' data. One mma.fp16.m16n8k16 instruction needs 16x16 data from A matrix which are the 32 blue and green blocks from the logic view. One warp wise ldmatrix can load all these 32 blocks. The next ldmatrix will then load the yellow and purple blocks.

Ther-LF · 2024-05-05T08:43:46Z

Ther-LF
May 5, 2024

Hi, @hwu36 . Currently I am reading cutlass's source code about shared memory layout. But I find there are some strange things.
First, I want to make sure my understanding of gmem data access and smem layout is correct.

Assuming the half_t Matrix A has a size of 32 x 64 in Row Major format, and the ThreadblockShape is <32, 32>. Additionally, in the kernel, we have only one thread block that includes 2 warps.

According to

cutlass/include/cutlass/gemm/threadblock/default_mma_core_sm80.h

Line 1458 in 033d9ef

using IteratorThreadMapA = transform::PitchLinearWarpRakedThreadMap<

and https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/transform/pitch_linear_thread_map.h[](pitch_linear_thread_map.h),
the first data access for each warp is shown as follows:

and the following data access of gmem is shown as follows:

The same process applies to TB1 as well.

For now, if we consider Stage = 2 and use mma_pipelined.h, we will have a FragmentA which will contain ElementPerAccess × Iterations::kCount = 8 × 4 = 32 half elements. And then we need to store it back to Shared Memory. That's my understanding of data movement from gmem to smem, is it correct?

During a single data access, we will obtain the results of 8 x 4 data accesses. Since the size of 4 data accessed is 4 × 128b = 4 × 16B = 64B, which doesn't fullfill 32 banks, we will flatten the 8 x 4 data accessed into a data distribution of 4 x 8 in smem, which is shown as follows:

And for each thread's single data accessed, to avoid bank conflict in loading and storing, cutlass swizzle each data accessed based on XOR's characteristic. What cutlass applies to each data accesed is new_col = original_row xor original_col, which is shown as follows:

That's my understanding of One single data accessed 's smem layout, is it correct?

And for several data accessed, I guess they should be placed in order?. Like this:

But when I look at the source code of tensor_op_multiplicand_sm75.h and the

cutlass/include/cutlass/layout/tensor_op_multiplicand_sm75.h

Line 54 in 033d9ef

struct TensorOpMultiplicand {

, it contains some components that is very confusing.

I am wondering what is the function of

cutlass/include/cutlass/layout/tensor_op_multiplicand_sm75.h

Line 89 in 033d9ef

static int const kFactor =

and

cutlass/include/cutlass/layout/tensor_op_multiplicand_sm75.h

Line 112 in 033d9ef

using PartitionShape = PitchLinearShape<4, 4>;

, and the swizzle of

cutlass/include/cutlass/layout/tensor_op_multiplicand_sm75.h

Line 194 in 033d9ef

int element_contiguous = (tile_contiguous_idx * TileShape::kContiguous +

is also hard to understand.

I initially thought that I misunderstood the concept of swizzle, but I have confirmed my previous understanding by using the dump_reg_shmem.cu code for verification.

0 replies

hwu36 · 2024-07-10T19:24:23Z

hwu36
Jul 10, 2024
Maintainer

kFactor means how many rows of data in gmem are needed to fill one shared memory 128B cache line. In your case, this number is 2. pretty much the same as what you said in "we will flatten the 8 x 4 data accessed into a data distribution of 4 x 8 in smem"

PartitionShape is the basic swizzling block unit. We basically do this

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

=>

0 1 2 3
5 4 7 6
10 11 8 9 
15 14 13 12

basically new_col_id = row_id ^ col_id

As to

    int element_contiguous = (tile_contiguous_idx * TileShape::kContiguous +
                              permuted_partition_contiguous_within_tile *
                                  PartitionShape::kContiguous +
                              permuted_vec_contiguous_within_partition) *
                                 kElementsPerAccess +
                             (coord.contiguous() % kElementsPerAccess);

your TileShape is 8x4, your PartitionShape is 4x4. So, there are two PartitionShape inside one TileShape in the contiguous dimension. tile_contiguous_idx means which TileShape you are in, permuted_partition_contiguous_within_tile means which one of two PartitionShape you are in within the current TileShape. permuted_vec_contiguous_within_partition is basically the new_col_id

0 replies

linuxlonelyeagle · 2024-12-04T07:11:04Z

linuxlonelyeagle
Dec 4, 2024

@hwu36 I know it's bad to @ people, but I need your help.I am writing Gemm and using ncu I am seeing bank conflict.I read the answers above and in the power point and my question is more basic & simple.Why is this access pattern bank conflict?Thanks you.

7 replies

Ther-LF Dec 4, 2024

Oh, sorry, I misunderstood the question earlier. If a warp accesses 128 banks, it will divide the process into 4 phases, with each phase handled by T0-T7, T8-T15, T16-T23, and T24-T31, respectively. For the diagram above, during the process of loading data from global memory to shared memory, there will be no bank conflict. However, if ldmatrix is used to load data into registers, bank conflicts will occur. For more details, please refer to cutlass swizzle机制解析（一）.

hwu36 Dec 4, 2024
Maintainer

The zhihu link is great. Actually, there are many great cutlass articles in zhihu.

I prefer people to @ me directly. Too many things going on in parallel these days.

linuxlonelyeagle Dec 6, 2024

@hwu36 I'd like to ask if there are any optimization tips for a100 to go from registers to share memory after gemm is done? I found huge bank conflicts here.Thank you.

hwu36 Dec 6, 2024
Maintainer

Cutlass epilogue uses padding zeros to avoid bank conflict. It is a classical approach.

linuxlonelyeagle Dec 6, 2024

ok. but the earlier thinking... Can I do a swizzle from register to share memory, and from share memory to global memory, which should ensure the correct ordering of the data? But share memory to global memory is definitely going to have bank conflicts.I am using vectorized access.it will not divide the process into 4 phases.
Also, I think the content in zhihu only removes bank conflicts from ldmtrix.The process of global memory to share memory will inevitably cause bank conflicts for the same reason as above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some question about GTC2020 cutlass's talk about conflict free load shared memory #1130

{{title}}

Replies: 13 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some question about GTC2020 cutlass's talk about conflict free load shared memory #1130

Replies: 13 comments · 11 replies

hwu36 Oct 9, 2023 Maintainer

MARD1NO Oct 10, 2023 Author

MARD1NO Oct 13, 2023 Author

hwu36 Oct 13, 2023 Maintainer

hwu36 Nov 6, 2023 Maintainer

hwu36 Nov 7, 2023 Maintainer

hwu36 Nov 7, 2023 Maintainer

hwu36 Nov 17, 2023 Maintainer

hwu36 Apr 24, 2024 Maintainer

hwu36 Jul 10, 2024 Maintainer

hwu36 Dec 4, 2024 Maintainer

hwu36 Dec 6, 2024 Maintainer

Replies: 13 comments 11 replies

hwu36
Oct 9, 2023
Maintainer

MARD1NO Oct 10, 2023
Author

MARD1NO Oct 13, 2023
Author

hwu36 Oct 13, 2023
Maintainer

hwu36
Nov 6, 2023
Maintainer

hwu36
Nov 7, 2023
Maintainer

hwu36
Nov 7, 2023
Maintainer

hwu36
Nov 17, 2023
Maintainer

hwu36 Apr 24, 2024
Maintainer

hwu36
Jul 10, 2024
Maintainer

hwu36 Dec 4, 2024
Maintainer

hwu36 Dec 6, 2024
Maintainer