-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64 backend: merge loads into ALU ops when appropriate. #2389
Conversation
If there are no atomic accesses in between, no effectful operations and no stores, then duplicating loads should be fine I think. |
I thought so too at first, but the interesting case occurs once we have threads/shared memory -- if a store to a particular address interleaves between two loads L1 and L2, which are two instances originating from the same CLIF-level load L, then L1 and L2 could produce different values, which could result in impossible executions. I think that some compilers may reason that such a case is undefined according to the memory consistency model, so anything can happen, but I'm a little uncomfortable allowing for this from a security / risk-mitigation perspective. Thoughts? |
That's why there must not be an atomic operation, nor instruction with side-effects. If neither exists in between, it is guaranteed that there is no synchronization between the current thread and another and as such multiple non-atomic memory accesses to the same location would consistute a data-race, which is UB. |
There could be a new |
@cfallin +1 for not allowing loads to be duplicated. We might be able to construct some complex story about why this is OK, but it's extra verification/reasoning-overhead/fragility that we don't want to carry if we don't have to. Besides, loads are expensive and generally a hindrance to ILP. |
1dd0d90
to
76cb7dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice (and small); but a few more comments wouldn't go amiss.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This PR makes use of the support in #2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:
we want to generate this:
As described in more detail in #2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see #2366 for how we ensure this).
This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.
Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized
LowerCtx
types rather than a&mut dyn LowerCtx
.On
bz2.wasm
, this results in ~1% instruction-count reduction. More islikely possible by following up with other instructions that can merge
memory loads as well.
This PR includes #2366 and also #2376 (I built on top of the latter because
otherwise there would be some merge conflicts due to their overlap); both
of those should land before this does.