Fix vector op indexing and add boundscheck. #127

maleadt · 2023-07-03T08:35:24Z

Previously, the pointers passed to vstore etc were being offset based on the size of the vector. However, when working with e.g. an 127x127 input, the offset of [1,2] is 127*sizeof(T), i.e., not a multiple of the vector size. Although this doesn't matter in sofar that this pointer is not sufficiently aligned and thus not compatible with vector operations, fixing the calculation at least makes it throw an unaligned memory access error instead of silently computing wrong things.

While at it, also add a bounds check.

maleadt · 2023-07-03T08:41:25Z

Benchmark results for commit a03ed7d (comparing to 70ff3fb):

ID	before	after	change
`["BLAS", "Float16*Float16'=Float32 (4096×4096×4096, alpha)"]`	6.745 ms ± 39.298 μs	6.658 ms ± 298.967 μs	24.0% ✅

codecov · 2023-07-03T09:57:59Z

Codecov Report

Patch coverage has no change and project coverage change: -0.31 ⚠️

Comparison is base (51bf8ee) 30.27% compared to head (8256f53) 29.97%.

❗ Current head 8256f53 differs from pull request most recent head a03ed7d. Consider uploading reports for the commit a03ed7d to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
- Coverage   30.27%   29.97%   -0.31%     
==========================================
  Files          11       11              
  Lines         786      794       +8     
==========================================
  Hits          238      238              
- Misses        548      556       +8

Impacted Files	Coverage Δ
src/epilogue.jl	`5.55% <0.00%> (ø)`
src/kernel.jl	`5.30% <0.00%> (ø)`
src/layout.jl	`20.27% <0.00%> (-1.21%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

maleadt · 2023-07-03T12:38:43Z

Weird; locally this consistently gives an 8% speed-up, due to a 3% reduction in register usage. AFAICT the change makes it possible for LLVM to offset all vector operations from a single pointer instead of having to re-compute over and over.

Before:

❯ grep ld.shared.v4 master/matmul_pipelined_1.asm
	ld.shared.v4.f32 	{%f577, %f578, %f579, %f580}, [%rd409];
	ld.shared.v4.f32 	{%f581, %f582, %f583, %f584}, [%rd424];
	ld.shared.v4.f32 	{%f585, %f586, %f587, %f588}, [%rd438];
	ld.shared.v4.f32 	{%f589, %f590, %f591, %f592}, [%rd452];
	ld.shared.v4.f32 	{%f593, %f594, %f595, %f596}, [%rd466];
	ld.shared.v4.f32 	{%f597, %f598, %f599, %f600}, [%rd480];
	ld.shared.v4.f32 	{%f601, %f602, %f603, %f604}, [%rd494];
	ld.shared.v4.f32 	{%f605, %f606, %f607, %f608}, [%rd508];
	ld.shared.v4.f32 	{%f609, %f610, %f611, %f612}, [%rd522];
	ld.shared.v4.f32 	{%f613, %f614, %f615, %f616}, [%rd536];
	ld.shared.v4.f32 	{%f617, %f618, %f619, %f620}, [%rd550];
	ld.shared.v4.f32 	{%f621, %f622, %f623, %f624}, [%rd564];
	ld.shared.v4.f32 	{%f625, %f626, %f627, %f628}, [%rd578];
	ld.shared.v4.f32 	{%f629, %f630, %f631, %f632}, [%rd592];
	ld.shared.v4.f32 	{%f633, %f634, %f635, %f636}, [%rd606];
	ld.shared.v4.f32 	{%f637, %f638, %f639, %f640}, [%rd620];

After:

❯ grep ld.shared.v4 vector/matmul_pipelined_1.asm
	ld.shared.v4.f32 	{%f577, %f578, %f579, %f580}, [%rd245+-16];
	ld.shared.v4.f32 	{%f581, %f582, %f583, %f584}, [%rd245+4080];
	ld.shared.v4.f32 	{%f585, %f586, %f587, %f588}, [%rd245+8176];
	ld.shared.v4.f32 	{%f589, %f590, %f591, %f592}, [%rd245+12272];
	ld.shared.v4.f32 	{%f593, %f594, %f595, %f596}, [%rd245+16368];
	ld.shared.v4.f32 	{%f597, %f598, %f599, %f600}, [%rd245+20464];
	ld.shared.v4.f32 	{%f601, %f602, %f603, %f604}, [%rd245+24560];
	ld.shared.v4.f32 	{%f605, %f606, %f607, %f608}, [%rd245+28656];
	ld.shared.v4.f32 	{%f609, %f610, %f611, %f612}, [%rd245+32752];
	ld.shared.v4.f32 	{%f613, %f614, %f615, %f616}, [%rd245+36848];
	ld.shared.v4.f32 	{%f617, %f618, %f619, %f620}, [%rd245+40944];
	ld.shared.v4.f32 	{%f621, %f622, %f623, %f624}, [%rd245+45040];
	ld.shared.v4.f32 	{%f625, %f626, %f627, %f628}, [%rd245+49136];
	ld.shared.v4.f32 	{%f629, %f630, %f631, %f632}, [%rd245+53232];
	ld.shared.v4.f32 	{%f633, %f634, %f635, %f636}, [%rd245+57328];
	ld.shared.v4.f32 	{%f637, %f638, %f639, %f640}, [%rd245+61424];

That shaves off almost 15% of the instruction count. The effects also exist at the SASS level, albeit less pronounced:

❯ grep STG master/matmul_pipelined_1.sass
        STG.E.128 [R72.64], R36 ;
        STG.E.128 [R70.64], R32 ;
        STG.E.128 [R68.64], R28 ;
        STG.E.128 [R66.64], R24 ;
        STG.E.128 [R64.64], R20 ;
        STG.E.128 [R74.64], R16 ;
        STG.E.128 [R76.64], R12 ;
        STG.E.128 [R80.64], R8 ;
        STG.E.128 [R78.64], R4 ;
        STG.E.128 [R26.64], R40 ;
        STG.E.128 [R24.64], R32 ;
        STG.E.128 [R28.64], R20 ;
        STG.E.128 [R26.64], R16 ;
        STG.E.128 [R2.64], R12 ;
        STG.E.128 [R24.64], R8 ;
        STG.E.128 [R30.64], R4 ;

❯ grep STG vector/matmul_pipelined_1.sass
        STG.E.128 [R94.64+-0x8], R64 ;
        STG.E.128 [R92.64+-0x8], R60 ;
        STG.E.128 [R90.64+-0x8], R56 ;
        STG.E.128 [R88.64+-0x8], R52 ;
        STG.E.128 [R86.64+-0x8], R48 ;
        STG.E.128 [R84.64+-0x8], R44 ;
        STG.E.128 [R82.64+-0x8], R40 ;
        STG.E.128 [R80.64+-0x8], R36 ;
        STG.E.128 [R78.64+-0x8], R32 ;
        STG.E.128 [R76.64+-0x8], R28 ;
        STG.E.128 [R74.64+-0x8], R24 ;
        STG.E.128 [R96.64+-0x8], R20 ;
        STG.E.128 [R104.64+-0x8], R16 ;
        STG.E.128 [R102.64+-0x8], R12 ;
        STG.E.128 [R100.64+-0x8], R8 ;
        STG.E.128 [R98.64+-0x8], R4 ;

... so I expect this to matter even more for compute-heavy kernels.

thomasfaingnaert · 2023-07-03T14:18:57Z

AFAICT the change makes it possible for LLVM to offset all vector operations from a single pointer instead of having to re-compute over and over.

Strange, it used to be the case that all memory operations used constant offsets w.r.t. a single base address; I guess that regressed at some point... Anyway, I did notice a substantial improvement in performance when I added this optimisation back in the day as well.

maleadt force-pushed the tb/vector_ops branch from b0193c3 to 8256f53 Compare July 3, 2023 10:56

maleadt added 2 commits July 3, 2023 16:23

Fix vector op indexing and add boundscheck.

4dbb3e9

Put inbounds annotations in the kernel instead.

a03ed7d

maleadt force-pushed the tb/vector_ops branch from 8256f53 to a03ed7d Compare July 3, 2023 14:23

maleadt merged commit 1222d80 into master Jul 3, 2023

maleadt deleted the tb/vector_ops branch July 3, 2023 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vector op indexing and add boundscheck. #127

Fix vector op indexing and add boundscheck. #127

maleadt commented Jul 3, 2023

maleadt commented Jul 3, 2023 •

edited

Loading

codecov bot commented Jul 3, 2023 •

edited

Loading

maleadt commented Jul 3, 2023 •

edited

Loading

thomasfaingnaert commented Jul 3, 2023

Fix vector op indexing and add boundscheck. #127

Fix vector op indexing and add boundscheck. #127

Conversation

maleadt commented Jul 3, 2023

maleadt commented Jul 3, 2023 • edited Loading

codecov bot commented Jul 3, 2023 • edited Loading

Codecov Report

maleadt commented Jul 3, 2023 • edited Loading

thomasfaingnaert commented Jul 3, 2023

maleadt commented Jul 3, 2023 •

edited

Loading

codecov bot commented Jul 3, 2023 •

edited

Loading

maleadt commented Jul 3, 2023 •

edited

Loading