Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64 backend: merge loads into ALU ops when appropriate. #2389

Merged
merged 1 commit into from
Nov 17, 2020

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Nov 11, 2020

This PR makes use of the support in #2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

    movq 0(%rdi), %rax
    addq %rax, %rbx

we want to generate this:

    addq 0(%rdi), %rax

As described in more detail in #2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see #2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized LowerCtx types rather than a &mut dyn LowerCtx.

On bz2.wasm, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.

This PR includes #2366 and also #2376 (I built on top of the latter because
otherwise there would be some merge conflicts due to their overlap); both
of those should land before this does.

@cfallin cfallin added the cranelift:area:x64 Issues related to x64 codegen label Nov 11, 2020
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:meta Everything related to the meta-language. cranelift:wasm labels Nov 11, 2020
@bjorn3
Copy link
Contributor

bjorn3 commented Nov 11, 2020

In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once)

If there are no atomic accesses in between, no effectful operations and no stores, then duplicating loads should be fine I think.

@cfallin
Copy link
Member Author

cfallin commented Nov 11, 2020

In particular, we need to ensure
that the use is the only use (otherwise the load happens more than
once)

If there are no atomic accesses in between, no effectful operations and no stores, then duplicating loads should be fine I think.

I thought so too at first, but the interesting case occurs once we have threads/shared memory -- if a store to a particular address interleaves between two loads L1 and L2, which are two instances originating from the same CLIF-level load L, then L1 and L2 could produce different values, which could result in impossible executions.

I think that some compilers may reason that such a case is undefined according to the memory consistency model, so anything can happen, but I'm a little uncomfortable allowing for this from a security / risk-mitigation perspective. Thoughts?

@bjorn3
Copy link
Contributor

bjorn3 commented Nov 11, 2020

if a store to a particular address interleaves between two loads L1 and L2, which are two instances originating from the same CLIF-level load L, then L1 and L2 could produce different values, which could result in impossible executions.

That's why there must not be an atomic operation, nor instruction with side-effects. If neither exists in between, it is guaranteed that there is no synchronization between the current thread and another and as such multiple non-atomic memory accesses to the same location would consistute a data-race, which is UB.

@bjorn3
Copy link
Contributor

bjorn3 commented Nov 11, 2020

I think that some compilers may reason that such a case is undefined according to the memory consistency model, so anything can happen, but I'm a little uncomfortable allowing for this from a security / risk-mitigation perspective. Thoughts?

There could be a new no_race memory flag to enable this optimization.

@julian-seward1
Copy link
Contributor

@cfallin +1 for not allowing loads to be duplicated. We might be able to construct some complex story about why this is OK, but it's extra verification/reasoning-overhead/fragility that we don't want to carry if we don't have to. Besides, loads are expensive and generally a hindrance to ILP.

Copy link
Contributor

@julian-seward1 julian-seward1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice (and small); but a few more comments wouldn't go amiss.

cranelift/codegen/src/isa/x64/lower.rs Show resolved Hide resolved
cranelift/codegen/src/isa/x64/lower.rs Show resolved Hide resolved
cranelift/filetests/filetests/isa/x64/load-op.clif Outdated Show resolved Hide resolved
cranelift/filetests/filetests/isa/x64/load-op.clif Outdated Show resolved Hide resolved
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants