-
In previous discussion, manishucsd suggest to read cutlass GTC2020's talk: https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/ . and I have some question about load shared memory in cutlass. When load global memory data in shared memory, each thread load is vectorized. Since tensorcore need a special layout. each thread in warp need other's thread's data. So use thread0 need thread T0,8,16,24's shared memory data to load in registers, to avoid bank conflicts, cutlass has a special shared memory store layout. (Am i right?) In slides 45 - 48, it discuss about |
Beta Was this translation helpful? Give feedback.
Replies: 13 comments 11 replies
-
correct
correct. slides 45-48 just deep dive into an example to show why the special layout can avoid bank conflicts when loading from the shared memory to the registers. |
Beta Was this translation helpful? Give feedback.
-
(I hope it's ok to follow-up on this discussion, instead of opening a new issue.) I also find the part of this talk on the shared memory layout, that is pretty much all of slides 36-55, rather confusing. Here are my initial questions:
|
Beta Was this translation helpful? Give feedback.
-
it does not have to be column major. we use column major as an example here. the key idea here is the same no matter which operand or which layout. We use xor (row ^ col) to swizzle a 8x8 basic block to prevent bank conflicts during both load and store. i don't understand your 2nd question. what shape of the rectangle is changed? |
Beta Was this translation helpful? Give feedback.
-
The rectangles representing values stored to shared memory: on slide 38 these are "vertical", on the next slide they're "horizontal". I mean - it's silly, probably an omission, but when ones is trying to understand this stuff, it raises a question does it mean something... Anyway, more important part of that question for me is: how many bits each of these rectangles represents? I assume that for global memory, they represent 128 bits, and for registers they're actually 32 bits. Are they consistently representing 128 bits for shared memory on these slides? |
Beta Was this translation helpful? Give feedback.
-
128bits in both global and shared memory - 8 consecutive fp16 data. We then use |
Beta Was this translation helpful? Give feedback.
-
Ok, so the rectangles do represent 128 bits for global and shared memory, and 32 bits for registers; again, it's nitpicking but it's confusing that they are almost of the same size in the slides. There is lots of alike details in the talk that make it confusing, for example at the slide 36 presenter says: "the matrix A is in row-major order", even if it's clearly depicted in column-major order, this is why I asked above about it. Anyway, thanks for the clarifications! After looking further into this, I think I got it; for me, the key for understanding the "scattering" of data throughout shared memory was to stress that on slide 39 and the next couple slides, denoting a rectangle in global and shared memory say by "T0" means that thread 0 is doing |
Beta Was this translation helpful? Give feedback.
-
your understanding is correct (i still don't know what do you mean "gaps" though). at last, this is just gemm, high school stuff, not rocket science. don't be scared of it. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
your A is correct. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hi, @hwu36 . Currently I am reading cutlass's source code about shared memory layout. But I find there are some strange things. Assuming the According to and https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/transform/pitch_linear_thread_map.h[](pitch_linear_thread_map.h), the first data access for each warp is shown as follows: and the following data access of gmem is shown as follows: The same process applies to TB1 as well.
For now, if we consider During a single data access, we will obtain the results of And for each thread's single data accessed, to avoid bank conflict in loading and storing, cutlass swizzle each data accessed based on That's my understanding of And for several data accessed, I guess they should be placed in order?. Like this: But when I look at the source code of tensor_op_multiplicand_sm75.h and the , it contains some components that is very confusing.I am wondering what is the function of and , and the swizzle of is also hard to understand.I initially thought that I misunderstood the concept of swizzle, but I have confirmed my previous understanding by using the |
Beta Was this translation helpful? Give feedback.
-
=>
basically As to
your |
Beta Was this translation helpful? Give feedback.
-
@hwu36 I know it's bad to @ people, but I need your help.I am writing Gemm and using ncu I am seeing bank conflict.I read the answers above and in the power point and my question is more basic & simple.Why is this access pattern bank conflict?Thanks you. |
Beta Was this translation helpful? Give feedback.
correct
correct.
slides 45-48 just deep dive into an example to show why the special layout can avoid bank conflicts when loading from the shared memory to the registers.