-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add epoch-based interruption for cooperative async timeslicing. #3699
Conversation
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
I think there is value in having this exist in addition to the fuel mechanism rather than as replacement. There are cases where determinism is essential, whether for reproducibility or to implement an rr like debugger. In other cases performance is indeed more important. In those cases the epoch-based interruption can be used. |
@bjorn3 yes, indeed, I guess I was thinking "replace many uses of fuel" rather than "actually remove fuel", but that was ambiguous, sorry. I agree that determinism is very useful in other circumstances and so we probably don't want to actually remove fuel; this PR keeps it around and perhaps I can add some documentation on the tradeoff and when to use each. |
a394558
to
6362d2b
Compare
@alexcrichton I think I've addressed most of your comments; thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks fantastic to me, I'm excited to see the impact of a lesser-cost fuel!
/// Configure whether async execution of WebAssembly will yield | ||
/// whenever the "epoch" is incremented on the Engine via | ||
/// [`Engine::increment_epoch`]. | ||
pub fn epoch_interruption(&mut self, enable: bool) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs here look great to me, thanks!
But yeah if you'd be up for adding an epoch-equivalent example that'd be awesome.
6362d2b
to
49d6e84
Compare
Updated and added an example -- will resolve the doc-comment broken links then merge, if all's well otherwise. Thanks! |
49d6e84
to
6f3f9d9
Compare
This PR introduces a new way of performing cooperative timeslicing that is intended to replace the "fuel" mechanism. The tradeoff is that this mechanism interrupts with less precision: not at deterministic points where fuel runs out, but rather when the Engine enters a new epoch. The generated code instrumentation is substantially faster, however, because it does not need to do as much work as when tracking fuel; it only loads the global "epoch counter" and does a compare-and-branch at backedges and function prologues. This change has been measured as ~twice as fast as fuel-based timeslicing for some workloads, especially control-flow-intensive workloads such as the SpiderMonkey JS interpreter on Wasm/WASI. The intended interface is that the embedder of the `Engine` performs an `engine.increment_epoch()` call periodically, e.g. once per millisecond. An async invocation of a Wasm guest on a `Store` can specify a number of epoch-ticks that are allowed before an async yield back to the executor's event loop. (The initial amount and automatic "refills" are configured on the `Store`, just as for fuel.) This call does only signal-safe work (it increments an `AtomicU64`) so could be invoked from a periodic signal, or from a thread that wakes up once per period.
6f3f9d9
to
8a55b5c
Compare
If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result. Depends on bytecodealliance#3708.
If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result. Depends on bytecodealliance#3708.
If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result.
If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result.
If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result.
(Builds on #3698)
This PR introduces a new way of performing cooperative timeslicing that
is intended to replace the "fuel" mechanism. The tradeoff is that this
mechanism interrupts with less precision: not at deterministic points
where fuel runs out, but rather when the Engine enters a new epoch. The
generated code instrumentation is substantially faster, however, because
it does not need to do as much work as when tracking fuel; it only loads
the global "epoch counter" and does a compare-and-branch at backedges
and function prologues.
This change has been measured as ~twice as fast as fuel-based
timeslicing for some workloads, especially control-flow-intensive
workloads such as the SpiderMonkey JS interpreter on Wasm/WASI.
The intended interface is that the embedder of the
Engine
performs anengine.increment_epoch()
call periodically, e.g. once per millisecond.An async invocation of a Wasm guest on a
Store
can specify a number ofepoch-ticks that are allowed before an async yield back to the
executor's event loop. (The initial amount and automatic "refills" are
configured on the
Store
, just as for fuel.) This call does onlysignal-safe work (it increments an
AtomicU64
) so could be invoked froma periodic signal, or from a thread that wakes up once per period.