-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
externref: implement stack map-based garbage collection #1832
externref: implement stack map-based garbage collection #1832
Conversation
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
crates/environ/src/cranelift.rs
Outdated
|
||
impl StackMapSink { | ||
fn finish(mut self) -> Vec<StackMapInformation> { | ||
self.infos.sort_by_key(|info| info.code_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.infos.sort_by_key(|info| info.code_offset); | |
self.infos.sort_unstable_by_key(|info| info.code_offset); |
crates/runtime/src/externref.rs
Outdated
/// | ||
/// Unsafe to drop earlier than its module is dropped. See | ||
/// `StackMapRegistry::register_stack_maps` for details. | ||
pub struct StackMapRegistration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe panic in the drop implementation instead and add an unsafe finish
function?
Very exciting to see this! \o/ One question: does this have any implications for deterministic behavior? ISTM that as long as we don't have weak references anywhere in the system it probably doesn't, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Before I dig too closely into the internal details I wanted to get a grasp of how this is all organized and such. The main thoughts I have is that this is adding a new registry and a lot of separate points where we pass around registries and such. I'm hoping we can perhaps unify all these with existing registries and such we have? Internally it feels better if we don't have an arc and rwlock per-thing that we need to keep track of.
One other thing I'm remembering now which I think would be good to happen here, I think this should include an implementation of WasmTy for ExternRef
. We'll want to enable closures with Func
that take ExternRef
as arguments and such, and it'd be cool to see what the monomorphize logic looks like for functions that take a number of ExternRef
or produce them.
Finally I wanted to write some thoughts about the GC aspect. It feels a bit weird to me that we're using reference counting but still have to have explicit GC points. To make sure I understand, GC only happens right now automatically when you call into a function, right? Or are there other auto-inserted points that GC happens? At a minimum I think we need to make sure that all long-running code is eventually GC'd, so I think both entry and exit needs GC'ing (calling into a host and returning back to wasm may already do the GC, I likely missed it!).
I wanted to dig a bit more into the decision though to use deferred reference counting rather than explicit. Is it possible to somewhat easily get performance number comparisons? For example do we have a handle on what we predict the overhead will be? I'd imagine there are possible optimizations where local.get/local.set don't do reference counting but calling a function does.
I'm personally always a bit wary of things like stack maps and such because of how strictly precise they must be but how the surrounding bits are often "mostly precise" like iterating the stack and/or getting the stack pointer. I don't think the implementation here is incorrect by any mean, mostly just that maintaining this over time and porting it to all sorts of new platforms is likely to get hairier over time.
crates/wasmtime/src/module.rs
Outdated
@@ -82,6 +83,7 @@ struct ModuleInner { | |||
store: Store, | |||
compiled: CompiledModule, | |||
frame_info_registration: Mutex<Option<Option<Arc<GlobalFrameInfoRegistration>>>>, | |||
stack_map_registration: Mutex<Option<Option<Arc<StackMapRegistration>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of continuing my comment from earlier, but this is an example where it would be great to not have this duplication. Ideally there'd be one "register stuff" call, although I'm not sure yet if this is fundamentally required to be two separate steps.
This has deterministic (albeit perhaps surprising if you don't know the implementation) behavior. We only GC when either the embedder calls FWIW. |
I originally did add this to the existing global frame info, but after our discussions about not wanting implicit global context, I moved it out to its own registry.
Yeah, I have this on my TODO list for a follow up PR. Felt like this was big enough as is.
Regarding when GC happens see my reply to Till: #1832 (comment) For long-running Wasm, in the absence of explicit Regarding the decision to go with deferred reference counting, let's look at the alternative: we would have to do explicit reference count increments and decrements for references inside Wasm frames, rather than deferring them. First, to get this working at all, this would require extending Note that we can't only do reference counting at function call boundaries because we need to handle the case where a Wasm function takes a reference and drops it. We need to decrement the reference count at the drop site, which is what instrumenting the spec stack pops does. So that is a bunch of infrastructural work on Cranelift to get something that will perform worse but at least works, followed by more work to start moving towards fewer ref count operations (but never getting as few as deferred reference counting: zero, albeit with occasional stack walking pauses). On the other hand, we already have stack maps produced by Cranelift, which are the necessary bits required for implementing deferred reference counting. We don't need to build any Cranelift infrastructure and Wasmtime's integration, just the latter. However, yes, this does come with occasional stack-walking pauses and reliance on So that was basically the calculus: an easier path to getting a better implementation (or at least an easier path to getting a good implementation, if we are comparing against the hypothetical best version of the ref counting coalescing static analysis). Unfortunately, I don't have benchmarks or performance numbers. It would be hard to get this information without implementing both approaches. And it is hard to know how many reference counting operations we could coalesce with the hypothetical static analysis. But I don't really have any doubt that naively doing all the increments and decrements for on-stack references would be quite slow: this is Well Known(tm) in the memory management world. |
Great, thank you for the explanation! One additional question: will it be possible to use the same infrastructure for stack tracing once we start implementing one of the GC proposals? Seems like that should be the case, but I might well be missing something. |
Yep, doing that should be a lot easier after this lands. |
Sorry I haven't had a chance to read and fully digest your response yet @fitzgen (will do later today), but I wanted to comment here with a thought before I forget it. In #1845 it was found that we can't actually backtrace through other host JIT code (e.g. the CLR) on all platforms. I think that this GC implementation requires getting a full backtrace at all times, right? If, for example, a host call in the CLR triggered a GC it would get a smaller view of the world than actually exists and could cause a use-after-free? |
Ok now to respond in full! One thing I'm still a bit murky and/or uncomfortable on is when GC is expected to happen. I understand that it automatically happens when tables fill up or if you explicitly call it, but my question was sort of largely do we expect wasmtime to grow some other time it automatically calls it? For example if an application never calls Another question is how we'd document this. For embedders using anyref, what would be the recommended way to call this? I think "call it after the 'main call'" is pretty sound advice, but that hingest on the previous question of it should never be required to call For alternative strategies of reference counting, to clarify I don't think that we should be reverting this and switching to explicit reference counting. I'm mostly probing because the rationale against reference counting feels a bit hand-wavy to me and given the cost of relying on stack unwinding and stack maps I'd just want to make sure we're set. To play devil's advocate a bit, I'm not convinced that function-boundary reference counting is impossible. I agree I agree that this may need some intrinsic work in Cranelift, but I don't think that this would require optimizations about coalescing reference counts. The main idea would be that Thinking more on this though unwinds are a really important thing to account for here too. I believe the stack walking strategy perfectly handles unwinds (since you'll just gc later and realize that rooted things aren't on the stack any more), but anything with explicit code generation will still need to do something on unwinding. And that 'something' is arguably just as hard to get right as stack maps themselves. Again though to be clear I'm trying to get it straight in my head why we're using deferred reference counting rather than explicit reference counting. I don't mean to seem like I'm challenging the conclusion, it's mostly that I just want to feel confident that we don't hand-wave too much by accident. For me, though, the nail in the coffin is that in a world of explicit reference counting unwinding needs to be handled somehow, and the solution seems like it'd be very similar if not just stack maps in one form or another. |
Deferred reference counting is more performant. |
Being this short and this absolute is not helpful. It isn't a 100% cut and dry issue, as the lengthy discussion weighing its trade offs shows. Please engage with nuance in the future. Thanks. |
Ok, sorry. |
This is worrying. You are correct that if we miss stack frames during GC, then we can accidentally think a reference is no longer in use even though it still is, which can then lead to use-after-free. Soundness relies on stack walking. I can think of two work arounds:
|
How does SpiderMonkey unwind frames within a single JitActivation? Do they have a custom unwinder which can sort of start halfway down the stack? Otherwise another possible alternative is we could perhaps have a flag in |
Yes, they have their own unwinder that understands only their own JIT'd frames.
This would avoid the unsoundness, but wouldn't let us clean up the garbage, so it isn't super attractive... |
.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted. I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder for a complete trace (we could still rely on a system unwinder for walking the wasm frames as we support libunwind and Windows; on Windows this is possible with This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered. It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of |
We could add a timer-based GC, but this does make the timing of when destructors are called non-deterministic. The only situation where there could be long-running wasm without any GC is if the Wasm goes into a long-running loop that doesn't use any references. If it did use references (i.e. put them in tables/globals) then the I don't think recommending a
I see more where you are going with this now. Cranelift already does this to some degree when generating stack maps for safepoints: it asks the register allocator which values are still live after this point, filters for just the references, and generates a stack map for them based on their locations. This is really late in the compilation pipeline though: it requires cooperation with the register allocator. I don't think we can do exactly that same approach for ref counting operations, because it is probably too late to insert new instructions (let alone whole new blocks for checking if the refcount reached zero inline, and only calling out to a VM function if so). I think we could do something similar in
I'm not sure what you're talking about here. What explicit code generation? What code generation do we do at all that isn't handled by Cranelift?
Yes, if we use non-deferred reference counting for on-stack references, traps will have to decrement reference counts as they unwind through Wasm frames. I hadn't thought about this either. This does end up looking very similar to stack maps, but also with personality routines thrown into the mix. The difference is that failure to unwind stacks properly leads to leaks in this case, rather than unsafety. Ultimately, I'm not 100% sure whether it makes more sense to keep investing in this stack maps-based approach to deferred reference counting, or to start fresh and try non-deferred reference counting. I had been leaning towards exploring non-deferred reference counting, but I hadn't thought of the need to decrement reference counts when unwinding through wasm. When considering the effort required to do that properly, I'm less sure now. |
Oh one other thing I wanted to mention: if we do keep going with this PR's approach and introduce a In order to keep using |
I assume we can capture the context at the wasm-to-host transition and use If the wasm-to-host transitions stack is empty (i.e. there's only Wasm frames on top or a Wasmtime function such as the signal or GC handler), then we can assume the walk from the current context, skipping any initial non-wasm frames, perhaps? There shouldn't be any foreign unregistered frames that would prevent the walk in that case. |
Yes, we could do this. Adding an out-of-line function call ( AFAICT, there is no blessed way of re-initializing an existing |
Agreed, if we go with such a design, I would want to limit the context capturing only to host functions not defined by Wasmtime (and perhaps only via the C API, where there is a chance of another JIT runtime being used). Perhaps it could even be something embedders opt-in to when defining a function. The bottom line is that we shouldn't have to pay the cost for capturing context when calling into Wasmtime's WASI implementation especially, as we are guaranteed to be able to unwind through those frames for both libunwind and Windows. |
All that said, I'm comfortable with landing a stack-walking-based GC before we fix the issue that currently only affects .NET hosts (the .NET API doesn't support reference types yet anyway). |
Maybe there is some way for the .NET Wasmtime implementation to insert a frame whose unwind info doesn't do normal unwinding, but sets the sp to the value before entering the .NET JITtes code, such that all .NET JITted code is skipped. Or has a special personality function that invokes the .NET unwinder and then sets the register state as the .NET unwinder gives back when unwinding past the JITted code. |
While I think that might be fine to do, I'm actually quite glad we have this embedding and thus caught this issue. What's more, even if we find a workaround specifically for .NET, I think it might make sense to take a closer look at whether we can then have sufficient confidence in our ability to make this approach work in all cases that might become relevant. I.e., would landing this begin painting ourselves into a corner that's increasingly hard to get out of at some later time when we realize we need to? |
Oh by "explicit code generation" I mean "the thing that isn't deferred reference counting". So overall I feel like my thinking above is still somewhat attractive. That scheme is where on a call to The downside of that is that runtimes like .NET will GC less, but that seems like it's not really that much of an issue given that our "GC" here is very small. The GC'd aspect is you gave a value to wasm and then it became unreachable during the execution of wasm. If we don't eagerly clean those up it doesn't seem like it's the end of the world, especially because we'll want to eventually fix this. I think that explicit reference counting is a defunkt strategy given our unwinding strategy. I don't really see how implementing unwinding correctly with explicit reference counting is going to be any less hazardous or any less work than this already was. If that's taken as an assumption then the final appearance of this feature will be exactly this PR except with more changes to the Personally I see a few directions to go from here:
I don't think (3) is feasible and I think 1/2 are pretty close. I think it might be good to sketch out in more detail how |
I find this convincing, and support (1), given that @peterhuene also voiced support for it. |
Ok so we talked about this in today's wasmtime meeting. The conclusion we came to was to implement a canary stack frame in this PR for now, so we can detect situations where libunwind cannot walk the full stack, and then we can skip GC rather than get potential unsoundness. But long term, we want to do the precise Thanks everyone for the discussion and design input! |
Subscribe to Label Actioncc @bnjbvr
This issue or pull request has been labeled: "cranelift"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looking great to me! Apart from the inline comments the only other thing I'd say is that I think it'd be good to look into not having two registrations per module (GlobalFrameInfoRegistration
and StackMapRegistration
) and trying to lump that all into one (also deduplicating the BTreeMap
lookups and such)
crates/runtime/src/instance.rs
Outdated
/// | ||
/// The `vmctx` also holds a raw pointer to the registry and relies on this | ||
/// member to keep it alive. | ||
pub(crate) stack_map_registry: Arc<StackMapRegistry>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally still prefer to avoid having this here if it's sole purpose is to keep the field alive. I think that memory management should be deferred to Store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same with externref_activations_table
above too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above regarding *mut
vs Arc
: https://github.com/bytecodealliance/wasmtime/pull/1832/files?file-filters%5B%5D=.rs#r439681507
Ah, I just remembered why it has to be this way: the |
2367c4a
to
6725d1c
Compare
For host VM code, we use plain reference counting, where cloning increments the reference count, and dropping decrements it. We can avoid many of the on-stack increment/decrement operations that typically plague the performance of reference counting via Rust's ownership and borrowing system. Moving a `VMExternRef` avoids mutating its reference count, and borrowing it either avoids the reference count increment or delays it until if/when the `VMExternRef` is cloned. When passing a `VMExternRef` into compiled Wasm code, we don't want to do reference count mutations for every compiled `local.{get,set}`, nor for every function call. Therefore, we use a variation of **deferred reference counting**, where we only mutate reference counts when storing `VMExternRef`s somewhere that outlives the activation: into a global or table. Simultaneously, we over-approximate the set of `VMExternRef`s that are inside Wasm function activations. Periodically, we walk the stack at GC safe points, and use stack map information to precisely identify the set of `VMExternRef`s inside Wasm activations. Then we take the difference between this precise set and our over-approximation, and decrement the reference count for each of the `VMExternRef`s that are in our over-approximation but not in the precise set. Finally, the over-approximation is replaced with the precise set. The `VMExternRefActivationsTable` implements the over-approximized set of `VMExternRef`s referenced by Wasm activations. Calling a Wasm function and passing it a `VMExternRef` moves the `VMExternRef` into the table, and the compiled Wasm function logically "borrows" the `VMExternRef` from the table. Similarly, `global.get` and `table.get` operations clone the gotten `VMExternRef` into the `VMExternRefActivationsTable` and then "borrow" the reference out of the table. When a `VMExternRef` is returned to host code from a Wasm function, the host increments the reference count (because the reference is logically "borrowed" from the `VMExternRefActivationsTable` and the reference count from the table will be dropped at the next GC). For more general information on deferred reference counting, see *An Examination of Deferred Reference Counting and Cycle Detection* by Quinane: https://openresearch-repository.anu.edu.au/bitstream/1885/42030/2/hon-thesis.pdf cc bytecodealliance#929 Fixes bytecodealliance#1804
This allows us to detect when stack walking has failed to walk the whole stack, and we are potentially missing on-stack roots, and therefore it would be unsafe to do a GC because we could free objects too early, leading to use-after-free. When we detect this scenario, we skip the GC.
b1410f0
to
6d88167
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll take a TODO item to clean up the registration stuff later. Let's go ahead and land with this current design. There's one safety issue below that needs fixing but otherwise I think this is basically good to go with a green CI!
Cargo.lock
Outdated
version = "0.3.46" | ||
source = "registry+https://github.com/rust-lang/crates.io-index" | ||
checksum = "b1e692897359247cc6bb902933361652380af0f1b7651ae5c5013407f30e109e" | ||
version = "0.3.48" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is likely to break CI until this is fixed
5019c48
to
43063cb
Compare
43063cb
to
7e167ca
Compare
5934185
to
ede0ed9
Compare
I had to |
a383fae
to
6a88d5c
Compare
Confused why the |
Gah, its because the |
Cranelift does not support reference types on other targets.
6a88d5c
to
683dc15
Compare
@@ -18,3 +18,7 @@ mod table; | |||
mod traps; | |||
mod use_after_drop; | |||
mod wast; | |||
|
|||
// Cranelift only supports reference types on x64. | |||
#[cfg(target_arch = "x86_64")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future instnaces of this, mind tagging this with a FIXME and an issue number so when we get around to aarch64 reference types we can make sure we run all the tests?
| ("reference_types", "externref_id_function") => { | ||
// Ignore if this isn't x64, because Cranelift only supports | ||
// reference types on x64. | ||
return env::var("CARGO_CFG_TARGET_ARCH").unwrap() != "x86_64"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here for the aarch64 testing
For host VM code, we use plain reference counting, where cloning increments the reference count, and dropping decrements it. We can avoid many of the on-stack increment/decrement operations that typically plague the performance of reference counting via Rust's ownership and borrowing system. Moving a
VMExternRef
avoids mutating its reference count, and borrowing it either avoids the reference count increment or delays it until if/when theVMExternRef
is cloned.When passing a
VMExternRef
into compiled Wasm code, we don't want to do reference count mutations for every compiledlocal.{get,set}
, nor for every function call. Therefore, we use a variation of deferred reference counting, where we only mutate reference counts when storingVMExternRef
s somewhere that outlives the activation: into a global or table. Simultaneously, we over-approximate the set ofVMExternRef
s that are inside Wasm function activations. Periodically, we walk the stack at GC safe points, and use stack map information to precisely identify the set ofVMExternRef
s inside Wasm activations. Then we take the difference between this precise set and our over-approximation, and decrement the reference count for each of theVMExternRef
s that are in our over-approximation but not in the precise set. Finally, the over-approximation is replaced with the precise set.The
VMExternRefActivationsTable
implements the over-approximized set ofVMExternRef
s referenced by Wasm activations. Calling a Wasm function and passing it aVMExternRef
moves theVMExternRef
into the table, and the compiled Wasm function logically "borrows" theVMExternRef
from the table. Similarly,global.get
andtable.get
operations clone the gottenVMExternRef
into theVMExternRefActivationsTable
and then "borrow" the reference out of the table.When a
VMExternRef
is returned to host code from a Wasm function, the host increments the reference count (because the reference is logically "borrowed" from theVMExternRefActivationsTable
and the reference count from the table will be dropped at the next GC).For more general information on deferred reference counting, see An Examination of Deferred Reference Counting and Cycle Detection by Quinane: https://openresearch-repository.anu.edu.au/bitstream/1885/42030/2/hon-thesis.pdf
cc #929
Fixes #1804
Depends on rust-lang/backtrace-rs#341