-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riscv64: Implement insertlane
#6408
Conversation
(rule (gen_vec_mask mask) | ||
(rv_vmv_sx (imm $I64 mask) (vstate_from_type $I64X2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about RISC-V vector instructions, but is this the most optimal this can be? For example are there instructions which move immediates into registers if the immediate is small?
I ask because it seems sort of out of place abstractly to have these sorts of operations when dealing with the immediates of insertlane/extractlane, but hey if it's the way it is then that's how it's gonna be.
One thing I thought though is that this might be good to delegate to a general vconst
helper. Right now vconst
always performs a load but it might be reasonable to fold a rule like this where if the immediate fits in an i64 it could do the x->v register movement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about RISC-V vector instructions, but is this the most optimal this can be? For example are there instructions which move immediates into registers if the immediate is small?
I'm not sure!
There are a few combos here that I can think of that could potentially be better than this in terms of materializing the mask. One of them like you mentioned is using something like the immediate splat instruction vmv.v.i
, but that only covers a weird subset of masks. Where the mask result can fit in a signed imm5 per lane. So with that one we can do a mask like 0b001111
but not 0b011111
, though we do get the benefit of it being just a single instruction.
I didn't want to try to pull off those sort of weird edge cases yet, but it might be easier than I think if we can look at the whole thing through the lens of vconst
, I'm going to give that a try.
I ask because it seems sort of out of place abstractly to have these sorts of operations when dealing with the immediates of insertlane/extractlane, but hey if it's the way it is then that's how it's gonna be.
Do you mean having a separate move instruction into the X / F registers that has no immediate? We used to have instructions that matched exactly those semantics, but they were removed with the following reasoning:
NOTE: The complementary
vins.v.x
instruction, which allow a write to
any element in a vector register, has been removed. This instruction
would be the only instruction (apart fromvsetvl
) that requires two
integer source operands, and also would be slow to execute in an
implementation with vector register renaming, relegating its main use
to debugger modifications to state. The alternative and more
generally usefulvslide1up
andvslide1down
instructions can be
used to update vector register state in place over a debug link
without accessing memory.
vslide1up
and vslide1down
aren't exactly applicable to insertlane
and extractlane
. And they have a special note in the standard stating that they are slow instructions, pretty much just reserved for debuggers.
One thing I thought though is that this might be good to delegate to a general vconst helper. Right now vconst always performs a load but it might be reasonable to fold a rule like this where if the immediate fits in an i64 it could do the x->v register movement.
Yeah, that seems like a good idea! I'm also going to try to merge the splat const rules into that.
On a broader note I checked what the other engines are doing here w.r.t insertlane
:
V8 seems to be doing exactly what we do.
LLVM does a vslideup
with a tail undisturbed policy, which is also an option and it does mean they can avoid materializing a mask since they can use vslideup.vi
with the lane as part of the immediate. I have no idea if this is better or worse than the option above since it also emits 2 or 3 vsetvli
's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, and thanks for explaining all that! To be clear I don't think any changes are necessary for this PR, I was mostly just musing. With prior art in V8 this is definitely fine to land (even without that I think it's ok). I agree it's probably not worth it yet to have lots of special cases for various immediates here and there, better to start simple and build out. I was mostly curious if there was a pattern that was going to be implemented later that would simplify this or whether this would be the way forward for awhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer handling the merging all of the vconst stuff in a future PR. But yeah, making mask generation as part of that makes a ton of sense!
Some of these carry their source in vs2
Remove the space between , and the register. This is inline with the rest of our formatting.
89a6b01
to
a5952ba
Compare
👋 Hey,
This PR implements
insertlane
. The implementation is essentially the same as thesplat
instruction, but using the masked version ofvmv
where only one lane is enabled. Maskedvmv
's are calledvmerge
, and they are only special in the fact that they don't follow the mask policy for masked out elements. Instead, masked out elements are sourced from one of the operands, instead of being ignored.Almost all instructions in the V spec support a mask. The mask is determined by a single bit in the instruction encoding, and when enabled always sources register
v0
as a mask. Masks are represented as bitfields where each bit corresponds to one lane.The WASM
replacelane
tests exposed a preexisting bug in the encoding ofvmv.s.x
andvfmv.s.f
, where their source operands are encoded in thevs1
field, instead ofvs2
. A fix for that is also included in this PR.