-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163
Conversation
… fowarding opts. This PR adds a basic *alias analysis*, and optimizations that use it. This is a "mid-end optimization": it operates on CLIF, the machine-independent IR, before lowering occurs. The alias analysis (or maybe more properly, a sort of memory-value analysis) determines when it can prove a particular memory location is equal to a given SSA value, and when it can, it replaces any loads of that location. This subsumes two common optimizations: * Redundant load elimination: when the same memory address is loaded two times, and it can be proven that no intervening operations will write to that memory, then the second load is *redundant* and its result must be the same as the first. We can use the first load's result and remove the second load. * Store-to-load forwarding: when a load can be proven to access exactly the memory written by a preceding store, we can replace the load's result with the store's data operand, and remove the load. Both of these optimizations rely on a "last store" analysis that is a sort of coloring mechanism, split across disjoint categories of abstract state. The basic idea is that every memory-accessing operation is put into one of N disjoint categories; it is disallowed for memory to ever be accessed by an op in one category and later accessed by an op in another category. (The frontend must ensure this.) Then, given this, we scan the code and determine, for each memory-accessing op, when a single prior instruction is a store to the same category. This "colors" the instruction: it is, in a sense, a static name for that version of memory. This analysis provides an important invariant: if two operations access memory with the same last-store, then *no other store can alias* in the time between that last store and these operations. This must-not-alias property, together with a check that the accessed address is *exactly the same* (same SSA value and offset), and other attributes of the access (type, extension mode) are the same, let us prove that the results are the same. Given last-store info, we scan the instructions and build a table from "memory location" key (last store, address, offset, type, extension) to known SSA value stored in that location. A store inserts a new mapping. A load may also insert a new mapping, if we didn't already have one. Then when a load occurs and an entry already exists for its "location", we can reuse the value. This will be either RLE or St-to-Ld depending on where the value came from. Note that this *does* work across basic blocks: the last-store analysis is a full iterative dataflow pass, and we are careful to check dominance of a previously-defined value before aliasing to it at a potentially redundant load. So we will do the right thing if we only have a "partially redundant" load (loaded already but only in one predecessor block), but we will also correctly reuse a value if there is a store or load above a loop and a redundant load of that value within the loop, as long as no potentially-aliasing stores happen within the loop. Passes tests and runs SpiderMonkey correctly locally; benchmarks TBD.
Great! I will take a look at this tomorrow. |
I ran the Sightglass benchmarks on this and the only delta is for
(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline. |
//! | ||
//! We partition memory state into several *disjoint pieces* of | ||
//! "abstract state". There are a finite number of such pieces: | ||
//! currently, we call them "heap", "table", "vmctx", and "other".Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if all stack slots would be their own category until their address is leaked using stack_addr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that would be nice, but entails an escape analysis, which is outside the scope of this PR. Happy to discuss further (and review PRs!) once this is in, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even stopping optimization entirely if stack_addr is used on a stack slot would likely help cg_clif a lot. Having each stack slot whose address is never leaked be optimized independently is the most important for cg_clif. I'm fine with leaving that for a later PR.
I think you will need to add support for volatile loads and stores for this to not miscompile rust code. It would also be nice if you could state an exhaustive list of UB it depends on. For example that reading/writing memory racing with the compiled clif function is UB unless it is an atomic or volatile load/store. |
By the way what is the time complexity of this optinization pass? |
@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are? Otherwise removing a load and replacing it with a known value is always legal, given CLIF semantics today. (If it helps, imagine the disjoint alias categories don't exist: by default all loads/stores fall into "other", so memory locations are named by the last store of any kind.) Atomics should use atomic ops at the CLIF level of course; I believe those should act as full barriers across all categories, as calls do. (I don't think I've actually added this logic yet; will ensure it's there tomorrow.)
Same as any of the other iterative dataflow analyses: worst case passes over any given basic block bounded by maximum descending chain depth in the lattice. Here the last-store vector meets to "current inst" on any conflict at a merge point so the lattice is effectively three levels deep, constant; so that's O(|blocks|) to converge last-store info (likely one visit per block, rarely two). Then one more visit per block to discover and use aliases. |
Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB. |
This is a slight regresion for the simple-raytracer benchmark of cg_clif. I would guess it increases live-range sizes causing a regalloc pessimization. Implementing #4163 (comment) would make it a beneficial optimization I think. I used to have a simple implementation of that in cg_clif which also looks at which bytes of a stack slot were accessed by the load/store to effectively allow SROA, but I removed it in bjorn3/rustc_codegen_cranelift@a793be8 as it was buggy and I didn't think I knew how to fix it. It also only operated within a single basic block. See https://github.com/bjorn3/rustc_codegen_cranelift/issues/846.
|
OK, so if I understand correctly, cg_clif is currently compiling volatiles at the Rust level to normal loads/stores at the CLIF level? That I agree would result in miscompiles; the issue though isn't alias analysis as the semantics of |
@bjorn3 I may be misreading your benchmark result, but:
looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great. I'd like to dig more into the tests before giving my sign off and I think having test alias-anlysis
and comments will really help with that.
Thanks!
Right, my bad. |
Also, fix bug in which basic blocks were skipped.
Updated, thanks! |
Hang on, I think I know why it didn't help much for cg_clif. Cg_clif uses stack_load and stack_store rather than the stack_addr+load/store this opt pass needs. |
@bjorn3 the alias analysis / redundant-load pass runs after legalization, so it should see plain loads/stores rather than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A whole lot easier to review with test alias-analysis
:)
Looks great. A few nitpicks/optional suggestions below. r=me with everything addressed/considered.
FWIW, I've noticed around ≈15% regression and bisection points on this commit. the results for 0824abb:
previous good, 89ccc56:
This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm. UPD: uname: Linux hetzner 5.10.78 #1-NixOS SMP Sat Nov 6 13:10:10 UTC 2021 x86_64 GNU/Linux |
Well, it gets weird. If I apply the following change on top of 0824abb diff --cc crates/wasmtime/src/config.rs
index 9c1ed87ca,9c1ed87ca..6e9bcd1f7
--- a/crates/wasmtime/src/config.rs
+++ b/crates/wasmtime/src/config.rs
@@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy
},
)
.unwrap();
++ builder.set("enable_pinned_reg", "true").unwrap();
Ok(builder)
}
diff --cc crates/wasmtime/src/engine.rs
index c81ae6e67,c81ae6e67..cc150af3a
--- a/crates/wasmtime/src/engine.rs
+++ b/crates/wasmtime/src/engine.rs
@@@ -312,7 -312,7 +312,7 @@@ impl Engine
"baldrdash_prologue_words" => *value == FlagValue::Num(0),
"enable_llvm_abi_extensions" => *value == FlagValue::Bool(false),
"emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false),
-- "enable_pinned_reg" => *value == FlagValue::Bool(false),
++ "enable_pinned_reg" => *value == FlagValue::Bool(true),
"enable_probestack" => *value == FlagValue::Bool(false),
"use_colocated_libcalls" => *value == FlagValue::Bool(false),
"use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false), then the result would be like this:
i.e. the regression disappears. I'm confused because not sure how that could possibly be related. |
@pepyakin I'm happy to take a look at this next week (I'm out of office now); can you publish the .wasm somewhere in the meantime? |
Sure! |
@pepyakin I'm just getting to this now (sorry for the delay!) and I'm actually seeing a speedup from alias analysis, consistently, of about 6% (6.2s to 5.8s):
These were built with current
and I double-checked I wasn't swapping the two. Looking at a diff of the disassemblies, it's sort of what I expect: a few extra loads are removed and carried in registers instead, and this I would indeed expect to improve performance, as it does. Can you confirm you're still seeing a slowdown, not speedup, and if so at which commit and anything else about your environment and measurements? |
…elimination. This allows for experiments as in here [1] and also generally gives an option to anyone who is concerned that the extra optimization may be counterproductive or take too much time. The optimization remains enabled by default. [1] bytecodealliance#4163 (comment)
…elimination. (#4349) This allows for experiments as in here [1] and also generally gives an option to anyone who is concerned that the extra optimization may be counterproductive or take too much time. The optimization remains enabled by default. [1] #4163 (comment)
…elimination. (bytecodealliance#4349) This allows for experiments as in here [1] and also generally gives an option to anyone who is concerned that the extra optimization may be counterproductive or take too much time. The optimization remains enabled by default. [1] bytecodealliance#4163 (comment)
With the following script: ./wasmtime-0824abbae compile -g -O wasmibox.wasm -o wasmibox-0824abbae.cwasm
./wasmtime-89ccc56e4 compile -g -O wasmibox.wasm -o wasmibox-89ccc56e4.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=true -o wasmibox-a2197ebbe-aa.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=false -o wasmibox-a2197ebbe-no-aa.cwasm
hyperfine --show-output --warmup 2 "./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run"```
I am getting the following results:
```shell
++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
Time (mean ± σ): 6.887 s ± 0.046 s [User: 6.872 s, System: 0.013 s]
Range (min … max): 6.848 s … 6.990 s 10 runs
++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
Time (mean ± σ): 5.941 s ± 0.024 s [User: 5.927 s, System: 0.013 s]
Range (min … max): 5.914 s … 5.978 s 10 runs
++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
Time (mean ± σ): 4.666 s ± 0.037 s [User: 4.651 s, System: 0.013 s]
Range (min … max): 4.627 s … 4.743 s 10 runs
++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
Time (mean ± σ): 4.524 s ± 0.133 s [User: 4.510 s, System: 0.013 s]
Range (min … max): 4.472 s … 4.898 s 10 runs It confirms that on my machine the pre-AA commit performs considerably better than the commit when AA was introduced. The current main performs better than both of them regardless if AA is enabled or not. No AA performs better than with AA on main though. However, if I remove the
Now, it starts to make more sense, although I still don't understand how that could be possibly related. |
OK, I agree that it doesn't make any sense that |
This PR adds a basic alias analysis, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.
The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.
This subsumes two common optimizations:
Redundant load elimination: when the same memory address is loaded two
times, and it can be proven that no intervening operations will write
to that memory, then the second load is redundant and its result
must be the same as the first. We can use the first load's result and
remove the second load.
Store-to-load forwarding: when a load can be proven to access exactly
the memory written by a preceding store, we can replace the load's
result with the store's data operand, and remove the load.
Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)
Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.
This analysis provides an important invariant: if two operations access
memory with the same last-store, then no other store can alias in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is exactly
the same (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.
Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.
Note that this does work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.
Fixes #4131.
Passes tests and runs SpiderMonkey correctly locally; benchmarks TBD.
Creating this as a draft for early feedback; will likely refine the comments
and explanations a bit more, and benchmark this.