-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift, egraphs, jump threading, and missed lowering rules #6154
Comments
Is this still relevant with #6818 landed? |
Unfortunately, yes. The reproductions are now:
|
@jameysharp and I talked about this briefly this morning, and with a fresh look, we think that an explicit jump-threading opt that runs in the mid-end (probably before e-graphs because it exposes additional opportunities) is the simplest approach. #6818's discussion was more about whether we could remove the backend's branch opts, which include jump-threading but at a lower level, and the conclusion was that we shouldn't; but this doesn't preclude us from also doing this at the IR level (both give us different things). It's probably a relatively small pass (50-100 lines); I'll see if I can get to this at some point soon! |
I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically. In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass. I understand that my comments are a bit too abstract — what I mostly seek is feedback (more like a vibe-check) before I dive deeper with possible implementation details. As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations. |
@dimitris-aspetakis I think you're talking about two different things here: I think that if we could integrate lowering more directly with the sea-of-nodes aegraph representation, that would probably uncover some additional opportunity -- the most direct one being that lowering rules could act as a sort of natural cost function that replaces the ad-hoc one in aegraph extraction. That's probably quite a difficult thing to do (open design/research questions, maybe 6 months or so of work) but would be interesting to see the results of. Integrating regalloc with instruction selection and/or scheduling (basically, altering the code we produce based on regalloc signals) is I think far more difficult however. There is at least one PhD thesis (David Koes, CMU, 2009) I'm aware of on the topic. It's a combinatorial constraint problem, everything affects everything else, and it's basically asking for a redesign of the entire compiler backend. I'm not saying it isn't possible but the scope is likely multiple years and there's no guarantee we'll come up with something that lands at a new interesting point in the design space in terms of runtime and compile time. Please do feel free to play with this more and propose designs and experiment with it (I'd recommend starting with some canonical examples that "don't work well" currently), just be aware of the difficulty. |
At a high level this is a pretty simple issue where this module:
generates this code by default on x86_64:
Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the
v128.load32_splat
instruction, is to generate a load (the instruction at0xa
) followed by a splat (the instructions at0xf
and0x14
). With AVX, however, this instruction should lower to a singlevbroadcastss
instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):
I believe that the issue here is that the
splat
has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:where it can clearly be seen that egraphs are moving the
splat
andbitcast
instructions across basic blocks.So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:
cranelift-wasm
to generate theblock1
block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for thereturn
instruction.block1
has one predecessor which has ajump
instruction, so there's no need forblock1
to exist and theblock0
andblock1
blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due toMachBuffer
optimizations.At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.
The text was updated successfully, but these errors were encountered: