-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
egraph support: rewrite to work in terms of CLIF data structures. #5382
Conversation
I'll see about splitting this into logical pieces in separate commits tomorrow; I wanted to get the rebased / cleaned-up state up first. I'll run some more benchmarks as well. |
1d69034
to
743f8f1
Compare
How are alias values printed in the textual format of clif ir? And can it be read back? |
6096ff0
to
88e6d00
Compare
They currently aren't; because union values aren't present in the layout, the usual CLIF writer won't see them (it ignores all insts and values not "in the function" as per the layout). I agree it would be great to come up with a text format for egraph-stage CLIF; maybe as a followup to the initial PR? |
88e6d00
to
777bd97
Compare
This work rewrites the "egraph"-based optimization framework in Cranelift to operate on aegraphs (acyclic egraphs) represented in the CLIF itself rather than as a separate data structure to which and from which we translate the CLIF. The basic idea is to add a new kind of value, a "union", that is like an alias but refers to two other values rather than one. This allows us to represent an eclass of enodes (values) as a tree. The union node allows for a value to have *multiple representations*: either constituent value could be used, and (in well-formed CLIF produced by correct optimization rules) they must be equivalent. Like the old egraph infrastructure, we take advantage of acyclicity and eager rule application to do optimization in a single pass. Like before, we integrate GVN (during the optimization pass) and LICM (during elaboration). Unlike the old egraph infrastructure, everything stays in the DataFlowGraph. "Pure" enodes are represented as instructions that have values attached, but that are not placed into the function layout. When entering "egraph" form, we remove them from the layout while optimizing. When leaving "egraph" form, during elaboration, we can place an instruction back into the layout the first time we elaborate the enode; if we elaborate it more than once, we clone the instruction. The implementation performs two passes overall: - One, a forward pass in RPO (to see defs before uses), that (i) removes "pure" instructions from the layout and (ii) optimizes as it goes. As before, we eagerly optimize, so we form the entire union of optimized forms of a value before we see any uses of that value. This lets us rewrite uses to use the most "up-to-date" form of the value and canonicalize and optimize that form. The eager rewriting and acyclic representation make each other work (we could not eagerly rewrite if there were cycles; and acyclicity does not miss optimization opportunities only because the first time we introduce a value, we immediately produce its "best" form). This design choice is also what allows us to avoid the "parent pointers" and fixpoint loop of traditional egraphs. This forward optimization pass keeps a scoped hashmap to "intern" nodes (thus performing GVN), and also interleaves on a per-instruction level with alias analysis. The interleaving with alias analysis allows alias analysis to see the most optimized form of each address (so it can see equivalences), and allows the next value to see any equivalences (reuses of loads or stored values) that alias analysis uncovers. - Two, a forward pass in domtree preorder, that "elaborates" pure enodes back into the layout, possibly in multiple places if needed. This tracks the loop nest and hoists nodes as needed, performing LICM as it goes. Note that by doing this in forward order, we avoid the "fixpoint" that traditional LICM needs: we hoist a def before its uses, so when we place a node, we place it in the right place the first time rather than moving later. This PR replaces the old (a)egraph implementation. It removes both the cranelift-egraph crate and the logic in cranelift-codegen that uses it. On `spidermonkey.wasm` running a simple recursive Fibonacci microbenchmark, this work shows 5.5% compile-time reduction and 7.7% runtime improvement (speedup). Most of this implementation was done in (very productive) pair programming sessions with Jamey Sharp, thus: Co-authored-by: Jamey Sharp <jsharp@fastly.com>
777bd97
to
149b8c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(posting partial review because the diff keeps changing as I go and making things outdated)
cranelift/codegen/src/egraph.rs
Outdated
// A pure node always has exactly one result. | ||
let orig_value = self.func.dfg.inst_results(inst)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What about something like
iadd_carry
, which is a pure instruction with two results, how does that fit in here? We used to return tuples and then project fields out of them, but now that we are using the DFG directly, how is this handled? - This could be simplified as
Suggested change
// A pure node always has exactly one result. let orig_value = self.func.dfg.inst_results(inst)[0]; let orig_value = self.func.dfg.first_result(inst);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great point with first_result
, I hadn't remembered that method existed. Updated!
Regarding multi-output instructions in general: yes, this was a bit of a compromise for now. Jamey and I went back and forth a bit on an Inst
-centric vs. Value
-centric design in the elaborator and ultimately basing everything on Value
s was far, far simpler; so to keep everything straight, we need to deal only with single-result instructions.
We judged this to be acceptable-ish because the most common multi-result instructions are calls. There are indeed also adds-with-carries but, for now, not optimizing these probably will not have a material impact on speedups.
Ultimately I'm curious whether we might be able to adapt a tuple-types-and-projection approach to CLIF as well, though that's a much larger change.
Sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with all my feedback addressed. Really excited for this!
Updated and addressed everything, I think! (Left the one thread about multi-result instructions open; that's something we'll want to address somehow eventually, but I think isn't a quick/easy thing to address in this PR.) Thanks for the review! |
f1e5926
to
40928ab
Compare
Interesting benchmark discovery:
This seems to indicate to me that on worse code (explicit bounds checks), this approach is a little more efficient, while otherwise it's more-or-less equivalent. Given the two other benefits of flexibility (writing rewrite rules in a simple way) and verifiability, on balance I suspect this will still make sense to turn on by default once fuzz-clean. (I'm doing a followup PR to add it to fuzzing configs.) But I'm very open to discussion on that; one alternative is to keep adding rewrite rules until we see a very clear code-quality (runtime) win. |
This PR reverts bytecodealliance#5128 (commit b3333bf), adding back the ability for the fuzzing config generator to set the `use_egraphs` Cranelift option. This will start to fuzz the egraphs-based optimization framework again, now that bytecodealliance#5382 has landed.
I just did some benchmarking of egraphs and the perf improvement is huge on the benchmark I tried:
(release uses cranelift's speed_and_size opt_level as well as mir inlining and other rustc mir optimizations) |
This PR follows up on #5382 and #5391, which rebuilt the egraph-based optimization framework to be more performant, by enabling it by default. Based on performance results in #5382 (my measurements on SpiderMonkey and bjorn3's independent confirmation with cg_clif), it seems that this is reasonable to enable. Now that we have been fuzzing compiler configurations with egraph opts (#5388) for 6 weeks, having fixed a few fuzzbugs that came up (#5409, #5420, #5438) and subsequently received no further reports from OSS-Fuzz, I believe it is stable enough to rely on. This PR enables `use_egraphs`, and also normalizes its meaning: previously it forced optimization (it basically meant "turn on the egraph optimization machinery"), now it runs egraph opts if the opt level indicates (it means "use egraphs to optimize if we are going to optimize"). The conditionals in the top-level pass driver are a little subtle, but will get simpler once we can remove the non-egraph path (which we plan to do eventually!). Fixes #5181.
In bytecodealliance#5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes bytecodealliance#6548.
In #5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes #6548.
…e#6549) In bytecodealliance#5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes bytecodealliance#6548.
…e#6549) In bytecodealliance#5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes bytecodealliance#6548.
In #5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes #6548.
…e#6549) In bytecodealliance#5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes bytecodealliance#6548.
In #5382 ("egraph support: rewrite to work in terms of CLIF data structures"), we added the `trace-log` feature to the set of default features for `cranelift-codegen`. I think this was an accident, probably added while debugging and overlooked when cleaning up to merge. So let's undo that change. Fixes #6548.
This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.
The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one. This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have multiple representations: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.
Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).
Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.
The implementation performs two passes overall:
One, a forward pass in RPO (to see defs before uses), that (i) removes
"pure" instructions from the layout and (ii) optimizes as it goes. As
before, we eagerly optimize, so we form the entire union of optimized
forms of a value before we see any uses of that value. This lets us
rewrite uses to use the most "up-to-date" form of the value and
canonicalize and optimize that form.
The eager rewriting and acyclic representation make each other work
(we could not eagerly rewrite if there were cycles; and acyclicity
does not miss optimization opportunities only because the first time
we introduce a value, we immediately produce its "best" form). This
design choice is also what allows us to avoid the "parent pointers"
and fixpoint loop of traditional egraphs.
This forward optimization pass keeps a scoped hashmap to "intern"
nodes (thus performing GVN), and also interleaves on a per-instruction
level with alias analysis. The interleaving with alias analysis allows
alias analysis to see the most optimized form of each address (so it
can see equivalences), and allows the next value to see any
equivalences (reuses of loads or stored values) that alias analysis
uncovers.
Two, a forward pass in domtree preorder, that "elaborates" pure enodes
back into the layout, possibly in multiple places if needed. This
tracks the loop nest and hoists nodes as needed, performing LICM as it
goes. Note that by doing this in forward order, we avoid the
"fixpoint" that traditional LICM needs: we hoist a def before its
uses, so when we place a node, we place it in the right place the
first time rather than moving later.
This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.
On
spidermonkey.wasm
running a simple recursive Fibonaccimicrobenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup). More benchmarks to come.
Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:
Co-authored-by: Jamey Sharp jsharp@fastly.com