Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift sinking loop-invariant code into a loop #7283

Closed
alexcrichton opened this issue Oct 18, 2023 · 1 comment · Fixed by #7306
Closed

Cranelift sinking loop-invariant code into a loop #7283

alexcrichton opened this issue Oct 18, 2023 · 1 comment · Fixed by #7306

Comments

@alexcrichton
Copy link
Member

Given this input CLIF:

function %foo(i64, i64) {
block0(v0: i64, v1: i64):
    ;; Create a loop-invariant value `v10` which is some operation which
    ;; includes a constant somewhere.
    v8 = load.f64 v0+100
    v9 = f64const 0x1.0000000000000p1
    v10 = fdiv v8, v9

    ;; jump to the loop
    v3 = iconst.i64 0
    jump block2(v3)  ; v3 = 0

block2(v11: i64):
    ;; store the loop-invariant `v10` to memory "somewhere"
    v15 = iadd v0, v11
    store.f64 v10, v15

    ;; loop breakout condition
    v17 = iadd_imm v11, 1
    v19 = icmp_imm ne v17, 100
    brif v19, block2(v17), block1

block1:
    return
}

this will currently optimize as:

$ cargo run compile -p ./foo.clif --target aarch64 --set opt_level=speed
function %foo(i64, i64) fast {
block0(v0: i64, v1: i64):
    v8 = load.f64 v0+100
    v3 = iconst.i64 0
    jump block2(v3)  ; v3 = 0

block2(v11: i64):
    v9 = f64const 0x1.0000000000000p1
    v10 = fdiv.f64 v8, v9  ; v9 = 0x1.0000000000000p1
    v15 = iadd.i64 v0, v11
    store v10, v15
    v20 = iconst.i64 1
    v17 = iadd v11, v20  ; v20 = 1
    v21 = iconst.i64 100
    v19 = icmp ne v17, v21  ; v21 = 100
    brif v19, block2(v17), block1

block1:
    return
}

This notably sinks the fdiv operation, which is loop invariant and originally outside of the loop, into the loop.

After debugging with @elliottt the reason this seems to be happening is that during elaboration to elaborate the fdiv instruction it first elaborates the inputs to the fdiv instruction. When doing so one of the inputs is an f64const which is rematerializable, meaning that the constant value is materialized inside of the loop. Next when elaborating fdiv it sees that one of its inputs is inside of the loop, so it concludes that the fdiv must also itself be inside of the loop. Put another way the rematerialization of the constant argument additionally forces the fdiv to go inside of the loop. This hypothesis was tested by commenting out these lines which avoided putting the fdiv into the loop. This isn't a complete fix, however, because rematerialized integer constants would still cause integer operations to be sunk into loops erroneously.

I remember first noticing this behavior in an older issue about false dependencies where a division operation was sunk into a loop, although the performance regression in that issue was due to something else. @elliottt's and my investigation of #7246 however turned up that this sinking is the cause of the slowdown there. The vdivsd instruction is present in the loop when in the original wasm the f64.div was not present in the loop. The above commenting of remat.isle rules improves the runtime performance of that test case to be as expected.

cc @elliottt and @cfallin as y'all are likely interested in this.

@cfallin
Copy link
Member

cfallin commented Oct 18, 2023

This is fascinating -- thank you both for debugging this!

I agree that this "reverse LICM" should be, well, reversed. (Though I'm tempted to argue that "loop invariant code motion" doesn't specify which way the code should move 😅 ) Basically the remat should not affect code placement decisions at all.

It seems to me that there should be a relatively small change possible to the remat logic. The part that perplexes me a little bit right now, without going through a tracelog, is why we're remat'ing into block2 at all, if there are no other uses of v9. In any case I should be able to look at this more later today and hopefully fix it -- thank you for the very helpful minimal example!

cfallin added a commit to cfallin/wasmtime that referenced this issue Oct 19, 2023
This reworks the way that remat and LICM interact during aegraph
elaboration. In principle, both happen during the same single-pass "code
placement" algorithm: we decide where to place pure instructions (those
that are eligible for movement), and remat pushes them one way while
LICM pushes them the other.

The interaction is a little more subtle than simple heuristic priority,
though -- it's really a decision ordering issue. A remat'd value wants to sink
as deep into the loop nest as it can (to the use's block), but we don't
know *where* the uses go until we process them (and make LICM-related
choices), and we process uses after defs during elaboration. Or more
precisely, we have some work at the use before recursively processing
the def, and some work after the recursion returns; and the LICM
decision happens after recursion returns, because LICM wants to know
where the defs are to know how high we can hoist. (The recursion is
itself unrolled into a state machine on an explicit stack so that's a
little hard to see but that's what is happening in principle.)

The solution here is to make remat a separate just-in-time thing, once
we have arg values. Just before we plug the final arg values into the
elaborated instruction, we ask: is this a remat'd value, and if so, do
we have a copy of the computation in this block yet. If not, we make
one. This has to happen in two places (the main elab loop and the
toplevel driver from the skeleton).

The one downside of this solution is that it doesn't handle *recursive*
rematerialization by default. This means that if we, for example, decide
to remat single-constant-arg adds (as we actually do in our current
rules), we won't then also recursively remat the constant arg to those
adds. This can be seen in the `licm.clif` test case. This doesn't seem
to be a dealbreaker to me because most such cases will be able to fold
the constants anyway (they happen mostly because of pointer
pre-computations: a loop over structs in Wasm computes heap_base + p +
offset, and naive LICM pulls a `heap_base + offset` out of the loop for
every struct field accessed in the loop, with horrible register pressure
resulting; that's why we have that remat rule. Most such offsets are
pretty small.).

Fixes bytecodealliance#7283.
cfallin added a commit to cfallin/wasmtime that referenced this issue Oct 19, 2023
This reworks the way that remat and LICM interact during aegraph
elaboration. In principle, both happen during the same single-pass "code
placement" algorithm: we decide where to place pure instructions (those
that are eligible for movement), and remat pushes them one way while
LICM pushes them the other.

The interaction is a little more subtle than simple heuristic priority,
though -- it's really a decision ordering issue. A remat'd value wants to sink
as deep into the loop nest as it can (to the use's block), but we don't
know *where* the uses go until we process them (and make LICM-related
choices), and we process uses after defs during elaboration. Or more
precisely, we have some work at the use before recursively processing
the def, and some work after the recursion returns; and the LICM
decision happens after recursion returns, because LICM wants to know
where the defs are to know how high we can hoist. (The recursion is
itself unrolled into a state machine on an explicit stack so that's a
little hard to see but that's what is happening in principle.)

The solution here is to make remat a separate just-in-time thing, once
we have arg values. Just before we plug the final arg values into the
elaborated instruction, we ask: is this a remat'd value, and if so, do
we have a copy of the computation in this block yet. If not, we make
one. This has to happen in two places (the main elab loop and the
toplevel driver from the skeleton).

The one downside of this solution is that it doesn't handle *recursive*
rematerialization by default. This means that if we, for example, decide
to remat single-constant-arg adds (as we actually do in our current
rules), we won't then also recursively remat the constant arg to those
adds. This can be seen in the `licm.clif` test case. This doesn't seem
to be a dealbreaker to me because most such cases will be able to fold
the constants anyway (they happen mostly because of pointer
pre-computations: a loop over structs in Wasm computes heap_base + p +
offset, and naive LICM pulls a `heap_base + offset` out of the loop for
every struct field accessed in the loop, with horrible register pressure
resulting; that's why we have that remat rule. Most such offsets are
pretty small.).

Fixes bytecodealliance#7283.
cfallin added a commit to cfallin/wasmtime that referenced this issue Oct 20, 2023
This reworks the way that remat and LICM interact during aegraph
elaboration. In principle, both happen during the same single-pass "code
placement" algorithm: we decide where to place pure instructions (those
that are eligible for movement), and remat pushes them one way while
LICM pushes them the other.

The interaction is a little more subtle than simple heuristic priority,
though -- it's really a decision ordering issue. A remat'd value wants to sink
as deep into the loop nest as it can (to the use's block), but we don't
know *where* the uses go until we process them (and make LICM-related
choices), and we process uses after defs during elaboration. Or more
precisely, we have some work at the use before recursively processing
the def, and some work after the recursion returns; and the LICM
decision happens after recursion returns, because LICM wants to know
where the defs are to know how high we can hoist. (The recursion is
itself unrolled into a state machine on an explicit stack so that's a
little hard to see but that's what is happening in principle.)

The solution here is to make remat a separate just-in-time thing, once
we have arg values. Just before we plug the final arg values into the
elaborated instruction, we ask: is this a remat'd value, and if so, do
we have a copy of the computation in this block yet. If not, we make
one. This has to happen in two places (the main elab loop and the
toplevel driver from the skeleton).

The one downside of this solution is that it doesn't handle *recursive*
rematerialization by default. This means that if we, for example, decide
to remat single-constant-arg adds (as we actually do in our current
rules), we won't then also recursively remat the constant arg to those
adds. This can be seen in the `licm.clif` test case. This doesn't seem
to be a dealbreaker to me because most such cases will be able to fold
the constants anyway (they happen mostly because of pointer
pre-computations: a loop over structs in Wasm computes heap_base + p +
offset, and naive LICM pulls a `heap_base + offset` out of the loop for
every struct field accessed in the loop, with horrible register pressure
resulting; that's why we have that remat rule. Most such offsets are
pretty small.).

Fixes bytecodealliance#7283.
cfallin added a commit to cfallin/wasmtime that referenced this issue Oct 20, 2023
This reworks the way that remat and LICM interact during aegraph
elaboration. In principle, both happen during the same single-pass "code
placement" algorithm: we decide where to place pure instructions (those
that are eligible for movement), and remat pushes them one way while
LICM pushes them the other.

The interaction is a little more subtle than simple heuristic priority,
though -- it's really a decision ordering issue. A remat'd value wants to sink
as deep into the loop nest as it can (to the use's block), but we don't
know *where* the uses go until we process them (and make LICM-related
choices), and we process uses after defs during elaboration. Or more
precisely, we have some work at the use before recursively processing
the def, and some work after the recursion returns; and the LICM
decision happens after recursion returns, because LICM wants to know
where the defs are to know how high we can hoist. (The recursion is
itself unrolled into a state machine on an explicit stack so that's a
little hard to see but that's what is happening in principle.)

The solution here is to make remat a separate just-in-time thing, once
we have arg values. Just before we plug the final arg values into the
elaborated instruction, we ask: is this a remat'd value, and if so, do
we have a copy of the computation in this block yet. If not, we make
one. This has to happen in two places (the main elab loop and the
toplevel driver from the skeleton).

The one downside of this solution is that it doesn't handle *recursive*
rematerialization by default. This means that if we, for example, decide
to remat single-constant-arg adds (as we actually do in our current
rules), we won't then also recursively remat the constant arg to those
adds. This can be seen in the `licm.clif` test case. This doesn't seem
to be a dealbreaker to me because most such cases will be able to fold
the constants anyway (they happen mostly because of pointer
pre-computations: a loop over structs in Wasm computes heap_base + p +
offset, and naive LICM pulls a `heap_base + offset` out of the loop for
every struct field accessed in the loop, with horrible register pressure
resulting; that's why we have that remat rule. Most such offsets are
pretty small.).

Fixes bytecodealliance#7283.
github-merge-queue bot pushed a commit that referenced this issue Oct 20, 2023
This reworks the way that remat and LICM interact during aegraph
elaboration. In principle, both happen during the same single-pass "code
placement" algorithm: we decide where to place pure instructions (those
that are eligible for movement), and remat pushes them one way while
LICM pushes them the other.

The interaction is a little more subtle than simple heuristic priority,
though -- it's really a decision ordering issue. A remat'd value wants to sink
as deep into the loop nest as it can (to the use's block), but we don't
know *where* the uses go until we process them (and make LICM-related
choices), and we process uses after defs during elaboration. Or more
precisely, we have some work at the use before recursively processing
the def, and some work after the recursion returns; and the LICM
decision happens after recursion returns, because LICM wants to know
where the defs are to know how high we can hoist. (The recursion is
itself unrolled into a state machine on an explicit stack so that's a
little hard to see but that's what is happening in principle.)

The solution here is to make remat a separate just-in-time thing, once
we have arg values. Just before we plug the final arg values into the
elaborated instruction, we ask: is this a remat'd value, and if so, do
we have a copy of the computation in this block yet. If not, we make
one. This has to happen in two places (the main elab loop and the
toplevel driver from the skeleton).

The one downside of this solution is that it doesn't handle *recursive*
rematerialization by default. This means that if we, for example, decide
to remat single-constant-arg adds (as we actually do in our current
rules), we won't then also recursively remat the constant arg to those
adds. This can be seen in the `licm.clif` test case. This doesn't seem
to be a dealbreaker to me because most such cases will be able to fold
the constants anyway (they happen mostly because of pointer
pre-computations: a loop over structs in Wasm computes heap_base + p +
offset, and naive LICM pulls a `heap_base + offset` out of the loop for
every struct field accessed in the loop, with horrible register pressure
resulting; that's why we have that remat rule. Most such offsets are
pretty small.).

Fixes #7283.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants