combining full loads/stores instead of masked ones #2018

sarah-quinones · 2024-11-14T18:36:08Z

sarah-quinones
Nov 14, 2024

instead of using instructions like _mm512_mask_store_pd for the prologue and epilogue of an array, i've been experimenting with codegen-ing a bunch of functions for each prologue and epilogue mask (and intersections thereof).

i then dispatch to the right desired one at runtime based on the start and end positions of the elements i want to select, with avx512 and f32, this "only" requires 256 functions, each of which is only a few instructions long, so it's not that bad for code size.

codegen here https://github.com/sarah-quinones/pulp/blob/91ce445b7273ec25149dce469f25a51f6068aff2/pulp/build.rs

benchmark here https://github.com/sarah-quinones/pulp/blob/91ce445b7273ec25149dce469f25a51f6068aff2/pulp/examples/mask_store.rs

results are 80ns (mask (load + store) × 16)
vs 38ns (separate loads + stores + combining them)

for reference, the full unmasked load + store takes 22ns

i think this could be a good improvement to the current implementation, as it's also more portable since you can use the same strategy on avx2, and arm at the very least

DenisYaroshevskiy · 2024-11-14T19:37:51Z

DenisYaroshevskiy
Nov 14, 2024
Collaborator

Hi!

Thanks for sharing. Do you mind sharing the asm? I have trouble reading this, I'm thinking if asm would help.

At the moment I am not sure what do you need the loads for.

1 reply

sarah-quinones Nov 14, 2024
Author

for a mask that selects values between indices 3 and 13 (exclusive), this is what the asm looks like

https://pastebin.com/raw/qW59jQaF

the loads in general are used for the loop epilogue, in this case im just loading and immediately storing back to keep external overhead as low as possible

DenisYaroshevskiy · 2024-11-14T21:56:45Z

DenisYaroshevskiy
Nov 14, 2024
Collaborator

Judging from asm, what it seems like you are doing is:

have an optimal function for each of the possible combinations of offsets.
use that function.

Also not an unreasonable thing to do, it seems like a lot of work.
Especially with shuffles.

Let me try poking holes at it.

Will start with looking at uops.info.

0 replies

DenisYaroshevskiy · 2024-11-14T22:12:20Z

DenisYaroshevskiy
Nov 14, 2024
Collaborator

OK, so masked store.
I'm lokiing skylake and zen4

10/11 cycles latency on skylake
20/24 cycles latency on zen4 (sorta expected - they don't have native avx512).

Now, what about insert?
3 cycles without load, 9 cycles wth the load.

I think you might have a problem in the benhcmark design.

You load the same data you stored:
https://github.com/sarah-quinones/pulp/blob/91ce445b7273ec25149dce469f25a51f6068aff2/pulp/examples/mask_store.rs#L43

That means, since masked store is high latency, you maybe encountering a very long dependency chain.
Normal stores don't do that.

Can you do an array copy with your loads? So that the memory is independent?

BTW

There is 100% some potential in doing masked stores with partial strores, especially on platforms where there is none.
I tried but failed and just call memcpy.

There was a discussion that lead nowhere here: https://stackoverflow.com/questions/62183557/how-to-most-efficiently-store-a-part-of-m128i-m256i-while-ignoring-some-num

here is the memcpy code: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#304

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combining full loads/stores instead of masked ones #2018

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

combining full loads/stores instead of masked ones #2018

sarah-quinones Nov 14, 2024

Replies: 3 comments · 1 reply

DenisYaroshevskiy Nov 14, 2024 Collaborator

sarah-quinones Nov 14, 2024 Author

DenisYaroshevskiy Nov 14, 2024 Collaborator

DenisYaroshevskiy Nov 14, 2024 Collaborator

BTW

sarah-quinones
Nov 14, 2024

Replies: 3 comments 1 reply

DenisYaroshevskiy
Nov 14, 2024
Collaborator

sarah-quinones Nov 14, 2024
Author

DenisYaroshevskiy
Nov 14, 2024
Collaborator

DenisYaroshevskiy
Nov 14, 2024
Collaborator