Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix vector op indexing and add boundscheck. #127

Merged
merged 2 commits into from
Jul 3, 2023
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jul 3, 2023

Previously, the pointers passed to vstore etc were being offset based on the size of the vector. However, when working with e.g. an 127x127 input, the offset of [1,2] is 127*sizeof(T), i.e., not a multiple of the vector size. Although this doesn't matter in sofar that this pointer is not sufficiently aligned and thus not compatible with vector operations, fixing the calculation at least makes it throw an unaligned memory access error instead of silently computing wrong things.

While at it, also add a bounds check.

@maleadt
Copy link
Member Author

maleadt commented Jul 3, 2023

Benchmark results for commit a03ed7d (comparing to 70ff3fb):

ID before after change
["BLAS", "Float16*Float16'=Float32 (4096×4096×4096, alpha)"] 6.745 ms ± 39.298 μs 6.658 ms ± 298.967 μs 24.0% ✅

@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.31 ⚠️

Comparison is base (51bf8ee) 30.27% compared to head (8256f53) 29.97%.

❗ Current head 8256f53 differs from pull request most recent head a03ed7d. Consider uploading reports for the commit a03ed7d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
- Coverage   30.27%   29.97%   -0.31%     
==========================================
  Files          11       11              
  Lines         786      794       +8     
==========================================
  Hits          238      238              
- Misses        548      556       +8     
Impacted Files Coverage Δ
src/epilogue.jl 5.55% <0.00%> (ø)
src/kernel.jl 5.30% <0.00%> (ø)
src/layout.jl 20.27% <0.00%> (-1.21%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@maleadt
Copy link
Member Author

maleadt commented Jul 3, 2023

Weird; locally this consistently gives an 8% speed-up, due to a 3% reduction in register usage. AFAICT the change makes it possible for LLVM to offset all vector operations from a single pointer instead of having to re-compute over and over.

Before:

❯ grep ld.shared.v4 master/matmul_pipelined_1.asm
	ld.shared.v4.f32 	{%f577, %f578, %f579, %f580}, [%rd409];
	ld.shared.v4.f32 	{%f581, %f582, %f583, %f584}, [%rd424];
	ld.shared.v4.f32 	{%f585, %f586, %f587, %f588}, [%rd438];
	ld.shared.v4.f32 	{%f589, %f590, %f591, %f592}, [%rd452];
	ld.shared.v4.f32 	{%f593, %f594, %f595, %f596}, [%rd466];
	ld.shared.v4.f32 	{%f597, %f598, %f599, %f600}, [%rd480];
	ld.shared.v4.f32 	{%f601, %f602, %f603, %f604}, [%rd494];
	ld.shared.v4.f32 	{%f605, %f606, %f607, %f608}, [%rd508];
	ld.shared.v4.f32 	{%f609, %f610, %f611, %f612}, [%rd522];
	ld.shared.v4.f32 	{%f613, %f614, %f615, %f616}, [%rd536];
	ld.shared.v4.f32 	{%f617, %f618, %f619, %f620}, [%rd550];
	ld.shared.v4.f32 	{%f621, %f622, %f623, %f624}, [%rd564];
	ld.shared.v4.f32 	{%f625, %f626, %f627, %f628}, [%rd578];
	ld.shared.v4.f32 	{%f629, %f630, %f631, %f632}, [%rd592];
	ld.shared.v4.f32 	{%f633, %f634, %f635, %f636}, [%rd606];
	ld.shared.v4.f32 	{%f637, %f638, %f639, %f640}, [%rd620];

After:

❯ grep ld.shared.v4 vector/matmul_pipelined_1.asm
	ld.shared.v4.f32 	{%f577, %f578, %f579, %f580}, [%rd245+-16];
	ld.shared.v4.f32 	{%f581, %f582, %f583, %f584}, [%rd245+4080];
	ld.shared.v4.f32 	{%f585, %f586, %f587, %f588}, [%rd245+8176];
	ld.shared.v4.f32 	{%f589, %f590, %f591, %f592}, [%rd245+12272];
	ld.shared.v4.f32 	{%f593, %f594, %f595, %f596}, [%rd245+16368];
	ld.shared.v4.f32 	{%f597, %f598, %f599, %f600}, [%rd245+20464];
	ld.shared.v4.f32 	{%f601, %f602, %f603, %f604}, [%rd245+24560];
	ld.shared.v4.f32 	{%f605, %f606, %f607, %f608}, [%rd245+28656];
	ld.shared.v4.f32 	{%f609, %f610, %f611, %f612}, [%rd245+32752];
	ld.shared.v4.f32 	{%f613, %f614, %f615, %f616}, [%rd245+36848];
	ld.shared.v4.f32 	{%f617, %f618, %f619, %f620}, [%rd245+40944];
	ld.shared.v4.f32 	{%f621, %f622, %f623, %f624}, [%rd245+45040];
	ld.shared.v4.f32 	{%f625, %f626, %f627, %f628}, [%rd245+49136];
	ld.shared.v4.f32 	{%f629, %f630, %f631, %f632}, [%rd245+53232];
	ld.shared.v4.f32 	{%f633, %f634, %f635, %f636}, [%rd245+57328];
	ld.shared.v4.f32 	{%f637, %f638, %f639, %f640}, [%rd245+61424];

That shaves off almost 15% of the instruction count. The effects also exist at the SASS level, albeit less pronounced:

❯ grep STG master/matmul_pipelined_1.sass
        STG.E.128 [R72.64], R36 ;
        STG.E.128 [R70.64], R32 ;
        STG.E.128 [R68.64], R28 ;
        STG.E.128 [R66.64], R24 ;
        STG.E.128 [R64.64], R20 ;
        STG.E.128 [R74.64], R16 ;
        STG.E.128 [R76.64], R12 ;
        STG.E.128 [R80.64], R8 ;
        STG.E.128 [R78.64], R4 ;
        STG.E.128 [R26.64], R40 ;
        STG.E.128 [R24.64], R32 ;
        STG.E.128 [R28.64], R20 ;
        STG.E.128 [R26.64], R16 ;
        STG.E.128 [R2.64], R12 ;
        STG.E.128 [R24.64], R8 ;
        STG.E.128 [R30.64], R4 ;

❯ grep STG vector/matmul_pipelined_1.sass
        STG.E.128 [R94.64+-0x8], R64 ;
        STG.E.128 [R92.64+-0x8], R60 ;
        STG.E.128 [R90.64+-0x8], R56 ;
        STG.E.128 [R88.64+-0x8], R52 ;
        STG.E.128 [R86.64+-0x8], R48 ;
        STG.E.128 [R84.64+-0x8], R44 ;
        STG.E.128 [R82.64+-0x8], R40 ;
        STG.E.128 [R80.64+-0x8], R36 ;
        STG.E.128 [R78.64+-0x8], R32 ;
        STG.E.128 [R76.64+-0x8], R28 ;
        STG.E.128 [R74.64+-0x8], R24 ;
        STG.E.128 [R96.64+-0x8], R20 ;
        STG.E.128 [R104.64+-0x8], R16 ;
        STG.E.128 [R102.64+-0x8], R12 ;
        STG.E.128 [R100.64+-0x8], R8 ;
        STG.E.128 [R98.64+-0x8], R4 ;

... so I expect this to matter even more for compute-heavy kernels.

@thomasfaingnaert
Copy link
Member

AFAICT the change makes it possible for LLVM to offset all vector operations from a single pointer instead of having to re-compute over and over.

Strange, it used to be the case that all memory operations used constant offsets w.r.t. a single base address; I guess that regressed at some point... Anyway, I did notice a substantial improvement in performance when I added this optimisation back in the day as well.

@maleadt maleadt merged commit 1222d80 into master Jul 3, 2023
@maleadt maleadt deleted the tb/vector_ops branch July 3, 2023 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants