Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make vectorized store convert and perform multiple stores if required #111

Merged
merged 4 commits into from
Jun 29, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jun 27, 2023

Similarly to #109, when doing a vectorized store, convert the elements so that they match the element type of the layout. In addition, if we want to store more values than what we can using a single vectorized store, perform multiple stores.

This matters when the element type of the global layout differs from the shared layout. For example, if global_layout is AlignedColMajor{Float16}, but shared_layout_a is AlignedColMajor{Float32}, we initially load 8 Float16's, but then only store 4 values into the Float32 shared-memory workspace, because the number of elements is currently always computed as 16 ÷ sizeof(T).

The above happens when doing Float16xFloat32, in which case I'm trying to use T=promote_type(Float16, Float32)=Float32, causing a type mismatch between global and shared memory. I want to take this approach (as opposed to keeping the shared layout Float16 too) because this opens up a path to using WMMA for incompatible inputs (say, for Float16xFloat32 we could then use WMMA on a GPU that has Float32xFloat32=... tensor cores).

@codecov
Copy link

codecov bot commented Jun 27, 2023

Codecov Report

Patch coverage: 50.00% and project coverage change: +0.34 🎉

Comparison is base (2a6ad1d) 29.61% compared to head (8881635) 29.96%.

❗ Current head 8881635 differs from pull request most recent head 27903f6. Consider uploading reports for the commit 27903f6 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #111      +/-   ##
==========================================
+ Coverage   29.61%   29.96%   +0.34%     
==========================================
  Files          11       11              
  Lines         763      761       -2     
==========================================
+ Hits          226      228       +2     
+ Misses        537      533       -4     
Impacted Files Coverage Δ
src/layout.jl 21.60% <50.00%> (+1.91%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/layout.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member Author

maleadt commented Jun 29, 2023

Benchmark results for commit 2b25c99 (comparing to 48279bc):

ID before after change

@maleadt maleadt force-pushed the tb/vstore_mismatch branch 4 times, most recently from acc3c48 to 73cf2ae Compare June 29, 2023 11:50
@maleadt
Copy link
Member Author

maleadt commented Jun 29, 2023

The 1.6 CI failures is JuliaGPU/GPUCompiler.jl#481

@maleadt maleadt changed the title Make vectorized store convert elements Make vectorized store convert and perform multiple stores if required Jun 29, 2023
@maleadt maleadt merged commit ee0fe8d into master Jun 29, 2023
@maleadt maleadt deleted the tb/vstore_mismatch branch June 29, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants