Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64: Add more support for more AVX instructions #5931

Merged
merged 7 commits into from
Mar 10, 2023

Conversation

alexcrichton
Copy link
Member

One aspect of AVX that I have just recently become aware of is that there's apparently a performance penalty associated with mixing AVX and SSE intructions. One reason for this is that AVX instructions which operate over 128-bit values always zero the higher-than-128-bits of each register operate on. SSE instructions, however, don't do this. This means that false dependencies can be created between instructions because SSE instructions look like they're intentionally preserving higher bits where AVX instructions intentionally zero them. According to this stackoverflow question the processor also tracks whether an instruction has been executed and there's a "scary red line" for mixing AVX/SSE.

On the local meshoptimizer benchmark this PR doesn't actually have any effect on the generate code's performance, or not one that I can measure. In that sense this is more of a hygiene thing than anything else.

Specifically the changes here were to refactor many ISLE helpers that were generating instructions with SseOpcode.XXX manually to instead using the instruction helpers which will use the AVX variant if enabled. Additionally more AVX instructions were added for moving data to/from memory and such.

I don't think this 100% handles all the SSE instructions cranelift can generate when AVX is enabled, but it at least raises the bar further and removes a bunch of cases of SSE-generated instructions when AVX is enabled.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Mar 4, 2023
@github-actions
Copy link

github-actions bot commented Mar 4, 2023

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@alexcrichton alexcrichton requested a review from abrown March 8, 2023 20:49
Copy link
Contributor

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense but I do feel the performance concern in the StackOverflow question is overstated, at least for this context. The OP for that question does modify the upper bits of a YMM register with a load and thus incurs a significant performance penalty; that is not really possible in Cranelift at the moment. If anyone stumbles on this PR later and is interested, I would highly recommend reading the entire section that question references in Intel's optimization manual — "Mixing AVX Code with SSE Code" in section 15.3. My reading is that in the Cranelift context (for Skylake and later micro-architectures) we only ever cycle on the "Clean UpperState" node of figure 15-2 — Cranelift only emits SSE instructions or 128-bit AVX instructions, avoiding the expensive transition penalty (the ~70 cycle slowdown mentioned in the StackOverflow question).

image

Now, there could be a partial register dependency but even that is not entirely clear: that may only apply when the upper bits of a YMM register are dirty, which again I do not think is possible here. Again, let me just recommend reading the section directly or running experiments as @alexcrichton has done (with no performance effect, right?).

All in all, though, I think this change looks good to me! The nits are around renaming some helpers but those are not crucial.


(decl x64_movdqu_store (SyntheticAmode Xmm) SideEffectNoResult)
(rule (x64_movdqu_store addr data)
(x64_xmm_movrm (SseOpcode.Movdqu) addr data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a rename?

Suggested change
(x64_xmm_movrm (SseOpcode.Movdqu) addr data))
(xmm_movrm_sse (SseOpcode.Movdqu) addr data))

Back when I was adding things like this, I tried to keep a convention of x64_<instruction> but it's been a while.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to drop the _sse suffix since the other xmm_* helpers don't have that, although it might be good to go back and rename Xmm* to Sse and Xmm*Vex to Avx* perhaps.

cranelift/codegen/src/isa/x64/inst.isle Outdated Show resolved Hide resolved
@alexcrichton
Copy link
Member Author

Oh good points! Shame on me for not actually reading all the way through on these bits...

So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?

Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.

…ecodealliance#5930)

* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments
@abrown
Copy link
Contributor

abrown commented Mar 9, 2023

So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?

Yeah, that's a good point. I guess we should remember that, beyond just the normal overhead of switching between guest and host, this YMM transition penalty could add to the switch overhead. Maybe it's worthwhile to think about running VZEROUPPER in the "host to guest" trampoline so that we feel more sure that guest code will be in the "Clean UpperState"? cc: @cfallin, @elliottt, @jameysharp; I guess this is a "better safe than sorry" kind of thought, but that goes along with the intent of this PR.

Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.

Yeah, I wanted to say it earlier but don't want to sound cavalier: one might have to work rather hard to make the partial register dependency become a noticeable issue in a real benchmark. I'm not saying it can't be done and we shouldn't try to avoid it, just... the StackOverflow answer ("you are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions") felt more alarmist than I thought was warranted.

@alexcrichton
Copy link
Member Author

Oh sorry I didn't mean to raise an alarms or convey any sense of urgency. I should probably more succinctly put it as "I was interested in filling out more AVX instructions, but had no technical motivation to document as the reason to do so, so I picked the first google result and pasted it here"

I'll need to read up more on VZEROUPPER as I'm not sure what it does and how it affects performance myself.

This will benefit from lack of need for alignment vs the `pshufd`
instruction if working with a memory operand and additionally, as I've
just learned, this reduces dependencies between instructions because the
`v*` instructions zero the upper bits as opposed to preserving them
which could accidentally create false dependencies in the CPU between
instructions.
This commit adds VEX-encoded versions of instructions such as
`mov{ss,sd,upd,ups,dqu}` for load and store operations. This also
changes some signatures so the `load` helpers specifically take a
`SyntheticAmode` argument which ended up doing a small refactoring of
the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.
This commit refactors the internal ISLE helpers for creating zero'd
xmm registers to leverage the AVX support for all other instructions.
This moves away from picking opcodes to instead picking instructions
with a bit of reorganization.
All existing users can be replaced with usage of the `xmm_uninit_value`
helper instruction so there's no longer any need for these otherwise
constant operations. This additionally reduces manual usage of opcodes
in favor of instruction helpers.
@jameysharp
Copy link
Contributor

The optimization manual says that vzeroupper "has zero latency" so I guess the only cost is instruction decode. Given that, adding one no-operand instruction to Wasmtime's trampolines sounds reasonable to me. (I guess it should be added for transitions in both directions between host and guest, based on the optimization manual's recommendations.)

I think I remember at least one of those trampolines does a tail-call, so it doesn't have the opportunity to do this when the callee returns, which I suppose could lead to surprising results too.

Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.

@abrown
Copy link
Contributor

abrown commented Mar 9, 2023

Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.

Honestly, hadn't thought too much about this idea until today so I don't know, but if we did add VZEROUPPER in the "host to guest" direction, e.g., I think we would want to do so before we fill in any registers with passed v128 values.

@github-actions github-actions bot added the cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. label Mar 9, 2023
@alexcrichton alexcrichton added this pull request to the merge queue Mar 9, 2023
Merged via the queue into bytecodealliance:main with commit 83f21e7 Mar 10, 2023
@alexcrichton alexcrichton deleted the x64-more-avx branch March 10, 2023 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants