Make vectorized store convert and perform multiple stores if required #111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Similarly to #109, when doing a vectorized store, convert the elements so that they match the element type of the layout. In addition, if we want to store more values than what we can using a single vectorized store, perform multiple stores.
This matters when the element type of the global layout differs from the shared layout. For example, if
global_layout
isAlignedColMajor{Float16}
, butshared_layout_a
isAlignedColMajor{Float32}
, we initially load 8 Float16's, but then only store 4 values into the Float32 shared-memory workspace, because the number of elements is currently always computed as16 ÷ sizeof(T)
.The above happens when doing Float16xFloat32, in which case I'm trying to use
T=promote_type(Float16, Float32)=Float32
, causing a type mismatch between global and shared memory. I want to take this approach (as opposed to keeping the shared layout Float16 too) because this opens up a path to using WMMA for incompatible inputs (say, forFloat16xFloat32
we could then use WMMA on a GPU that hasFloat32xFloat32=...
tensor cores).