combining full loads/stores instead of masked ones #2018
Replies: 3 comments 1 reply
-
Hi! Thanks for sharing. Do you mind sharing the asm? I have trouble reading this, I'm thinking if asm would help. At the moment I am not sure what do you need the loads for. |
Beta Was this translation helpful? Give feedback.
-
Judging from asm, what it seems like you are doing is: have an optimal function for each of the possible combinations of offsets. Also not an unreasonable thing to do, it seems like a lot of work. Let me try poking holes at it. Will start with looking at uops.info. |
Beta Was this translation helpful? Give feedback.
-
OK, so masked store. 10/11 cycles latency on skylake Now, what about insert? I think you might have a problem in the benhcmark design. You load the same data you stored: That means, since masked store is high latency, you maybe encountering a very long dependency chain. Can you do an array copy with your loads? So that the memory is independent? BTWThere is 100% some potential in doing masked stores with partial strores, especially on platforms where there is none. There was a discussion that lead nowhere here: https://stackoverflow.com/questions/62183557/how-to-most-efficiently-store-a-part-of-m128i-m256i-while-ignoring-some-num here is the memcpy code: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#304 |
Beta Was this translation helpful? Give feedback.
-
instead of using instructions like
_mm512_mask_store_pd
for the prologue and epilogue of an array, i've been experimenting with codegen-ing a bunch of functions for each prologue and epilogue mask (and intersections thereof).i then dispatch to the right desired one at runtime based on the start and end positions of the elements i want to select, with avx512 and f32, this "only" requires 256 functions, each of which is only a few instructions long, so it's not that bad for code size.
codegen here https://github.com/sarah-quinones/pulp/blob/91ce445b7273ec25149dce469f25a51f6068aff2/pulp/build.rs
benchmark here https://github.com/sarah-quinones/pulp/blob/91ce445b7273ec25149dce469f25a51f6068aff2/pulp/examples/mask_store.rs
results are 80ns (mask (load + store) × 16)
vs 38ns (separate loads + stores + combining them)
for reference, the full unmasked load + store takes 22ns
i think this could be a good improvement to the current implementation, as it's also more portable since you can use the same strategy on avx2, and arm at the very least
Beta Was this translation helpful? Give feedback.
All reactions