Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift, egraphs, jump threading, and missed lowering rules #6154

Open
alexcrichton opened this issue Apr 5, 2023 · 5 comments
Open

Cranelift, egraphs, jump threading, and missed lowering rules #6154

alexcrichton opened this issue Apr 5, 2023 · 5 comments
Labels
cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:mid-end clif-to-clif related passes, legalizations, etc... cranelift:wasm

Comments

@alexcrichton
Copy link
Member

At a high level this is a pretty simple issue where this module:

(module
  (memory 1)
  (func (param i32) (result v128)
    local.get 0
    v128.load32_splat
  )
)

generates this code by default on x86_64:

$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       4c 8b 5f 50             mov    0x50(%rdi),%r11
       8:       8b f2                   mov    %edx,%esi
       a:       45 8b 5c 33 00          mov    0x0(%r11,%rsi,1),%r11d
       f:       c4 41 79 6e c3          vmovd  %r11d,%xmm8
      14:       c4 c1 79 70 c0 00       vpshufd $0x0,%xmm8,%xmm0
      1a:       48 89 ec                mov    %rbp,%rsp
      1d:       5d                      pop    %rbp
      1e:       c3                      retq
        ...

Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the v128.load32_splat instruction, is to generate a load (the instruction at 0xa) followed by a splat (the instructions at 0xf and 0x14). With AVX, however, this instruction should lower to a single vbroadcastss instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:

$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       44 8b ca                mov    %edx,%r9d
       7:       4c 8b 57 50             mov    0x50(%rdi),%r10
       b:       c4 82 79 18 44 0a 00    vbroadcastss 0x0(%r10,%r9,1),%xmm0
      12:       48 89 ec                mov    %rbp,%rsp
      15:       5d                      pop    %rbp
      16:       c3                      retq
        ...

So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v4 = uextend.i64 v2
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0024                               jump block1

                                block1:
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               return v9
}

I believe that the issue here is that the splat has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v4 = uextend.i64 v2
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               jump block1

                                block1:
@0024                               return v3
}

where it can clearly be seen that egraphs are moving the splat and bitcast instructions across basic blocks.


So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:

  • Technically there's no need for cranelift-wasm to generate the block1 block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for the return instruction.
  • Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly block1 has one predecessor which has a jump instruction, so there's no need for block1 to exist and the block0 and block1 blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due to MachBuffer optimizations.
  • Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.

At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.

@alexcrichton alexcrichton added cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:wasm cranelift:mid-end clif-to-clif related passes, legalizations, etc... labels Apr 5, 2023
@tschneidereit
Copy link
Member

Is this still relevant with #6818 landed?

@alexcrichton
Copy link
Member Author

Unfortunately, yes. The reproductions are now:

  • Uses vbroadcastss: wasmtime compile foo.wat --target x86_64 -Ccranelift-has-avx -O opt-level=0 && objdump -S foo.cwasm
  • Does not use vbroadcastss: wasmtime compile foo.wat --target x86_64 -Ccranelift-has-avx && objdump -S foo.cwasm

@cfallin
Copy link
Member

cfallin commented Oct 23, 2023

@jameysharp and I talked about this briefly this morning, and with a fresh look, we think that an explicit jump-threading opt that runs in the mid-end (probably before e-graphs because it exposes additional opportunities) is the simplest approach. #6818's discussion was more about whether we could remove the backend's branch opts, which include jump-threading but at a lower level, and the conclusion was that we shouldn't; but this doesn't preclude us from also doing this at the IR level (both give us different things).

It's probably a relatively small pass (50-100 lines); I'll see if I can get to this at some point soon!

@dimitris-aspetakis
Copy link

dimitris-aspetakis commented Sep 6, 2024

I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.

In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.

I understand that my comments are a bit too abstract — what I mostly seek is feedback (more like a vibe-check) before I dive deeper with possible implementation details.

As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.

@cfallin
Copy link
Member

cfallin commented Sep 6, 2024

@dimitris-aspetakis I think you're talking about two different things here:

I think that if we could integrate lowering more directly with the sea-of-nodes aegraph representation, that would probably uncover some additional opportunity -- the most direct one being that lowering rules could act as a sort of natural cost function that replaces the ad-hoc one in aegraph extraction. That's probably quite a difficult thing to do (open design/research questions, maybe 6 months or so of work) but would be interesting to see the results of.

Integrating regalloc with instruction selection and/or scheduling (basically, altering the code we produce based on regalloc signals) is I think far more difficult however. There is at least one PhD thesis (David Koes, CMU, 2009) I'm aware of on the topic. It's a combinatorial constraint problem, everything affects everything else, and it's basically asking for a redesign of the entire compiler backend. I'm not saying it isn't possible but the scope is likely multiple years and there's no guarantee we'll come up with something that lands at a new interesting point in the design space in terms of runtime and compile time. Please do feel free to play with this more and propose designs and experiment with it (I'd recommend starting with some canonical examples that "don't work well" currently), just be aware of the difficulty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:mid-end clif-to-clif related passes, legalizations, etc... cranelift:wasm
Projects
None yet
Development

No branches or pull requests

4 participants