x64 backend: merge loads into ALU ops when appropriate. #2389

cfallin · 2020-11-11T01:35:21Z

This PR makes use of the support in #2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

    movq 0(%rdi), %rax
    addq %rax, %rbx

we want to generate this:

    addq 0(%rdi), %rax

As described in more detail in #2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see #2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized LowerCtx types rather than a &mut dyn LowerCtx.

On bz2.wasm, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.

This PR includes #2366 and also #2376 (I built on top of the latter because
otherwise there would be some merge conflicts due to their overlap); both
of those should land before this does.

bjorn3 · 2020-11-11T19:31:12Z

In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once)

If there are no atomic accesses in between, no effectful operations and no stores, then duplicating loads should be fine I think.

cfallin · 2020-11-11T19:39:14Z

In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once)

If there are no atomic accesses in between, no effectful operations and no stores, then duplicating loads should be fine I think.

I thought so too at first, but the interesting case occurs once we have threads/shared memory -- if a store to a particular address interleaves between two loads L1 and L2, which are two instances originating from the same CLIF-level load L, then L1 and L2 could produce different values, which could result in impossible executions.

I think that some compilers may reason that such a case is undefined according to the memory consistency model, so anything can happen, but I'm a little uncomfortable allowing for this from a security / risk-mitigation perspective. Thoughts?

bjorn3 · 2020-11-11T19:58:28Z

if a store to a particular address interleaves between two loads L1 and L2, which are two instances originating from the same CLIF-level load L, then L1 and L2 could produce different values, which could result in impossible executions.

That's why there must not be an atomic operation, nor instruction with side-effects. If neither exists in between, it is guaranteed that there is no synchronization between the current thread and another and as such multiple non-atomic memory accesses to the same location would consistute a data-race, which is UB.

bjorn3 · 2020-11-11T19:59:41Z

I think that some compilers may reason that such a case is undefined according to the memory consistency model, so anything can happen, but I'm a little uncomfortable allowing for this from a security / risk-mitigation perspective. Thoughts?

There could be a new no_race memory flag to enable this optimization.

julian-seward1 · 2020-11-12T06:13:16Z

@cfallin +1 for not allowing loads to be duplicated. We might be able to construct some complex story about why this is OK, but it's extra verification/reasoning-overhead/fragility that we don't want to carry if we don't have to. Besides, loads are expensive and generally a hindrance to ILP.

julian-seward1

Nice (and small); but a few more comments wouldn't go amiss.

cranelift/codegen/src/isa/x64/lower.rs

cranelift/filetests/filetests/isa/x64/load-op.clif

This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.

cfallin added the cranelift:area:x64 Issues related to x64 codegen label Nov 11, 2020

cfallin requested review from abrown and julian-seward1 November 11, 2020 01:35

cfallin force-pushed the x64-load-op branch from 3e67395 to 8db77c2 Compare November 11, 2020 01:36

cfallin mentioned this pull request Nov 12, 2020

MachInst lowering logic: allow effectful instructions to merge. #2366

Merged

cfallin mentioned this pull request Nov 16, 2020

AArch64 SIMD: replace LoadSplat with pattern-matching on load+splat #2376

Merged

cfallin force-pushed the x64-load-op branch 2 times, most recently from 1dd0d90 to 76cb7dc Compare November 17, 2020 16:06

julian-seward1 approved these changes Nov 17, 2020

View reviewed changes

cfallin force-pushed the x64-load-op branch from 76cb7dc to 46714e2 Compare November 17, 2020 19:03

cfallin force-pushed the x64-load-op branch from 46714e2 to b97f07b Compare November 17, 2020 19:06

cfallin mentioned this pull request Nov 17, 2020

Carry MemFlags on loads/stores in MachInst backends, and emit trap info only where needed. #2426

Merged

cfallin merged commit e7df081 into bytecodealliance:main Nov 17, 2020

fitzgen mentioned this pull request Nov 23, 2020

Collapse load; add; store into a single add on x86-64 #1925

Closed

cfallin deleted the x64-load-op branch January 6, 2021 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64 backend: merge loads into ALU ops when appropriate. #2389

x64 backend: merge loads into ALU ops when appropriate. #2389

cfallin commented Nov 11, 2020

bjorn3 commented Nov 11, 2020

cfallin commented Nov 11, 2020

bjorn3 commented Nov 11, 2020

bjorn3 commented Nov 11, 2020

julian-seward1 commented Nov 12, 2020

julian-seward1 left a comment

x64 backend: merge loads into ALU ops when appropriate. #2389

x64 backend: merge loads into ALU ops when appropriate. #2389

Conversation

cfallin commented Nov 11, 2020

bjorn3 commented Nov 11, 2020

cfallin commented Nov 11, 2020

bjorn3 commented Nov 11, 2020

bjorn3 commented Nov 11, 2020

julian-seward1 commented Nov 12, 2020

julian-seward1 left a comment

Choose a reason for hiding this comment