Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make u8x16 and u8x32 have Vector call ABI #589
Make u8x16 and u8x32 have Vector call ABI #589
Changes from 2 commits
3240a1a
3ab2236
8004ada
4da205c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps better to have
bytes(&self) -> &[u8; 32]
andbytes_mut(&mut self) -> &mut [u8; 32]
? It would be more Rust-idiomatic than providing a getter+setter, and in some cases might give a faster way to change separate bytes without replacing the whole thing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does that
[u8; 32]
live ? (same forbytes_mut
). Or are you suggesting transmuting&__m256i
into a&[u8; 32]
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of the function wouldn't change - you'd still use
transmute
, just between references and not values.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example with both implementations and you can see how code for setting a single byte differs between two: https://rust.godbolt.org/z/XF7m95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the
__m256i
is in a SIMD register, it doesn't have a memory address, so creating a pointer to it requires spilling it to the stack to be able to give it an address. Once the modification on the stack is done, moving it back into a SIMD register requires copying the whole thing, not a single byte (at least if dynamic indices are involved, if the indices are compile-time constants, for some index values, sometimes, the compiler can do better).One doesn't want this back-and-forth to happen accidentally, every time one modifies a part of the vector, and returning a
&[u8; 32]
would encourage that.The proposed API forces its users to explicitly move the vector contents between the registers and memory. This revealed a couple of places where, e.g., instead of doing this back-and-forth on every iteration of a loop, one can just load the vector into memory once before the loop, operate on memory, and move the contents back into a SIMD register after the loop has completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I'd argue that having several
.bytes_mut()[...] = ...
patterns in a row would also be pretty obvious and most people would store the reference when they could, but I see how your proposed API forces this a bit better.However, if that's the goal, I wonder if it's worth to go one step further and take mutations as a callback like
.with_bytes(|bytes| ...)
where one can modify contents in any way they want inside, but then on the type system level you wouldn't be able to do anything else outside, and transformation back to a register would be guaranteed to happen as part of the call?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code only wants to read the bytes, so forcing a write would be bad for that code - I've pinged you in one example.
I also don't see really an advantage in restricting access to a scope. If you wanted to mutate two vectors, you would need to nest scopes. For an API that's only intended for internal consumption, this API would feel even more like overengineering than the one I proposed - and I consider mine borderline overengineering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fair enough. I'm not 100% comfortable with
replace_bytes
vsbytes_mut
for idiomaticity reasons, but I can see how this is the least of the evils :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, this should be properly benchmarked before being merged. Those examples do not really use the intrinsics, so the code is being generated in isolation without any objective in mind. That is not very representative of what this library actually does. If you actually try to use the APIs, you'll see that they generate the exact same code when all optimizations are turned on: https://rust.godbolt.org/z/Z8sdH9
But this is not about optimizing the implementation, this is about optimizing the amount of work that LLVM has to do to produce efficient code. If you look at the LLVM-IR produced in debug mode: https://rust.godbolt.org/z/sMfxHm the version using
replace_bytes
produces 134 lines of LLVM-IR, while the version usingbytes_mut
produces 758 lines of LLVM-IR (if you look at the assembly at opt-level=1, you also see a much better result). There are a couple of factors at play here, but producing 6x more LLVM-IR for this isn't really worth it. LLVM can optimize it without problems, at least in this case where everything is private.