-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64: Add more support for more AVX instructions #5931
x64: Add more support for more AVX instructions #5931
Conversation
Subscribe to Label Action
This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change makes sense but I do feel the performance concern in the StackOverflow question is overstated, at least for this context. The OP for that question does modify the upper bits of a YMM register with a load and thus incurs a significant performance penalty; that is not really possible in Cranelift at the moment. If anyone stumbles on this PR later and is interested, I would highly recommend reading the entire section that question references in Intel's optimization manual — "Mixing AVX Code with SSE Code" in section 15.3. My reading is that in the Cranelift context (for Skylake and later micro-architectures) we only ever cycle on the "Clean UpperState" node of figure 15-2 — Cranelift only emits SSE instructions or 128-bit AVX instructions, avoiding the expensive transition penalty (the ~70 cycle slowdown mentioned in the StackOverflow question).
Now, there could be a partial register dependency but even that is not entirely clear: that may only apply when the upper bits of a YMM register are dirty, which again I do not think is possible here. Again, let me just recommend reading the section directly or running experiments as @alexcrichton has done (with no performance effect, right?).
All in all, though, I think this change looks good to me! The nits are around renaming some helpers but those are not crucial.
|
||
(decl x64_movdqu_store (SyntheticAmode Xmm) SideEffectNoResult) | ||
(rule (x64_movdqu_store addr data) | ||
(x64_xmm_movrm (SseOpcode.Movdqu) addr data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a rename?
(x64_xmm_movrm (SseOpcode.Movdqu) addr data)) | |
(xmm_movrm_sse (SseOpcode.Movdqu) addr data)) |
Back when I was adding things like this, I tried to keep a convention of x64_<instruction>
but it's been a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to drop the _sse
suffix since the other xmm_*
helpers don't have that, although it might be good to go back and rename Xmm*
to Sse
and Xmm*Vex
to Avx*
perhaps.
Oh good points! Shame on me for not actually reading all the way through on these bits... So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much? Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does. |
…ecodealliance#5930) * x64: Add lowerings for `punpck{h,l}wd` Add some special cases for `shuffle` for more specialized x86 instructions. * x64: Add `shuffle` lowerings for `pshufd` This commit adds special-cased lowerings for the x64 `shuffle` instruction when the `pshufd` instruction alone is necessary. This is possible when the shuffle immediate permutes 32-bit values within one of the vector inputs of the `shuffle` instruction, but not both. * x64: Add shuffle lowerings for `punpck{h,l}{q,}dq` This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values. * x64: Add `shuffle` lowerings for `shufps` This commit adds targeted lowerings for the `shuffle` instruction that match the pattern that `shufps` supports. The `shufps` instruction selects two elements from the first vector and two elements from the second vector which means while it's not generally applicable it should still be more useful than the catch-all lowering of `shuffle`. * x64: Add shuffle support for `pshuf{l,h}w` This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value. * x64: Specialize `shuffle` with an all-zeros immediate Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast. * Review comments
Yeah, that's a good point. I guess we should remember that, beyond just the normal overhead of switching between guest and host, this YMM transition penalty could add to the switch overhead. Maybe it's worthwhile to think about running
Yeah, I wanted to say it earlier but don't want to sound cavalier: one might have to work rather hard to make the partial register dependency become a noticeable issue in a real benchmark. I'm not saying it can't be done and we shouldn't try to avoid it, just... the StackOverflow answer ("you are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions") felt more alarmist than I thought was warranted. |
Oh sorry I didn't mean to raise an alarms or convey any sense of urgency. I should probably more succinctly put it as "I was interested in filling out more AVX instructions, but had no technical motivation to document as the reason to do so, so I picked the first google result and pasted it here" I'll need to read up more on VZEROUPPER as I'm not sure what it does and how it affects performance myself. |
This will benefit from lack of need for alignment vs the `pshufd` instruction if working with a memory operand and additionally, as I've just learned, this reduces dependencies between instructions because the `v*` instructions zero the upper bits as opposed to preserving them which could accidentally create false dependencies in the CPU between instructions.
This commit adds VEX-encoded versions of instructions such as `mov{ss,sd,upd,ups,dqu}` for load and store operations. This also changes some signatures so the `load` helpers specifically take a `SyntheticAmode` argument which ended up doing a small refactoring of the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.
This commit refactors the internal ISLE helpers for creating zero'd xmm registers to leverage the AVX support for all other instructions. This moves away from picking opcodes to instead picking instructions with a bit of reorganization.
All existing users can be replaced with usage of the `xmm_uninit_value` helper instruction so there's no longer any need for these otherwise constant operations. This additionally reduces manual usage of opcodes in favor of instruction helpers.
The optimization manual says that I think I remember at least one of those trampolines does a tail-call, so it doesn't have the opportunity to do this when the callee returns, which I suppose could lead to surprising results too. Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something. |
93c5408
to
1fd6e36
Compare
Honestly, hadn't thought too much about this idea until today so I don't know, but if we did add |
One aspect of AVX that I have just recently become aware of is that there's apparently a performance penalty associated with mixing AVX and SSE intructions. One reason for this is that AVX instructions which operate over 128-bit values always zero the higher-than-128-bits of each register operate on. SSE instructions, however, don't do this. This means that false dependencies can be created between instructions because SSE instructions look like they're intentionally preserving higher bits where AVX instructions intentionally zero them. According to this stackoverflow question the processor also tracks whether an instruction has been executed and there's a "scary red line" for mixing AVX/SSE.
On the local meshoptimizer benchmark this PR doesn't actually have any effect on the generate code's performance, or not one that I can measure. In that sense this is more of a hygiene thing than anything else.
Specifically the changes here were to refactor many ISLE helpers that were generating instructions with
SseOpcode.XXX
manually to instead using the instruction helpers which will use the AVX variant if enabled. Additionally more AVX instructions were added for moving data to/from memory and such.I don't think this 100% handles all the SSE instructions cranelift can generate when AVX is enabled, but it at least raises the bar further and removes a bunch of cases of SSE-generated instructions when AVX is enabled.