-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add copy-on-write based instance reuse mechanism #3691
Conversation
let vmmemory = memory.vmmemory(); | ||
instance.set_memory(index, vmmemory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this is necessary for correctness, but when I comment it out nothing fails. Any idea how I could write a test which would verify that this is here?
let mut page_first_index = None; | ||
unsafe { | ||
let mut fp = std::fs::File::open("/proc/self/pagemap") | ||
.context("failed to open /proc/self/pagemap")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work if /proc
isn't mounted. In addition it doesn't check /proc
is actually a mounted procfs. Rustix has code to check if the /proc
is sane. It doesn't seem like it is exported though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I've made a PR to rustix
here to export it: bytecodealliance/rustix#174
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
It looks like I guess ideally we should disable them when running under qemu-user? |
Hi @koute -- thanks so much for this PR and for bringing up the ideas behind it (in particular, the memfd mechanism)! Guilty admission on my part: after you mentioned memfd recently on Zulip, and madvise to reset a private mapping (throw away a CoW overlay), I threw together my own implementation as well and did a lot of internal experimentation. (I've been hacking in the pooling allocator and on performance-related things recently as well and your idea was a huge epiphany for me.) I need to clean it up a bit still but will put it up soon (with due credit to you for memfd/madvise/CoW ideas!). Perhaps we can get the best ideas out of both of these PRs :-) One additional realization I had was that, for performance, we don't want to do any Anyway, a few thoughts on this PR:
I think these are some things we should talk through after I've put up my PR and we can do comparisons. I'm really grateful for the ton of effort you put into this and look forward to comparing the approaches in more detail! |
Agreed with @cfallin that we should disentangle snapshots and instantiation here, and focus on a relatively transparent extension of the pooling instance allocator for now. That said, a snapshotting feature could be very useful for doing neat things like |
@koute would using the pooling instance allocator work for your embedding's use case? @cfallin's work right now I believe is entirely focused on that which means that by-default Wasmtime wouldn't have copy-on-write re-instantiation because Wasmtime by default (as you've seen and modified here) uses the on-demand instance allocator. If your embedding doesn't work well with the pooling instance allocator then I think we'll need to brainstorm a solution which "merges" your work here with @cfallin's on the pooling allocator, taking into account the feedback around snapshots (which I personally agree is best to separate and ideally make the copy-on-write business a simple config option of "go faster") |
This it technically optional and done entirely for performance, so it could be made allowed to fail. Even without using
That is, AFAIK, only applicable to the lower bits, which we don't need in this case. The higher bits (which we need) should be always readable when reading our own process' pagemap.
In my initial prototype implementation I actually tried to do this in the same vein as the current pooling allocator, but in the end decided to go with the current approach. Let me explain. Basically we have three main requirements:
One of the problems with the current pooling allocator (ignoring how it performs) is that it fails at (2) and somewhat at (3), and isn't a simple "go faster" option that you can just blindly toggle. You have to manually specify module and instance limits (and if the WASM blob changes too significantly you need to modify them), and you need to maintain a separate codepath (basically keep another separate So personally I think it just makes more sense (especially for something so fundamental and low level as I also considered integrating this into Basically, in our usecase we don't really care about snapshotting at all (it's just an implementation detail to make things go fast), and all we just need is to be able to instantiate clean instances as fast as possible. Would this be more acceptable to you if API-wise we'd make it less like it's snapshotting and more like an explicit way to pool instances?
Sounds good to me! I'll hook up your PR into our benchmarks so that we can compare the performance. |
Ok @koute so to follow up on comments and work from before, #3733 is the final major optimization for instantiation that we know of to implement. There's probably some very minor wins still remaining, but that's the lion's share of improvements that we're going to get into wasmtime (that plus memfd). Could you try re-running your benchmark numbers with the old strategy y'all are currently using, your proposal in this PR, and then #3733 as a PR? Note that when using #3733 using the on-demand allocator, while maybe a little bit slower than the pooling allocator, should still be fine. The intention with #3733 is that it's fast enough that you won't need to maintain a pool of instances, and each "reuse" of an instance can perform the full re-instantiation process. Note that for re-instantiation it's recommended to start from an My prediction is that the time-to-instantiate #3733 is likely quite close to the strategy outlined in this PR. It will probably look a little different one way or another, but that's what I'm curious to see if #3733 works for your use case in terms of robustness and performance. If you're up for it then it might be interesting to test the pooling allocator as well. I realize that the pooling allocator as-is isn't a great fit for your use case due to it being too constrained, but as a one-off measurement of numbers it might help give an idea of the performance tradeoff between the pooling allocator and the on-demand allocator. Also unless you're specifically interested in concurrent instantiation performance it's fine to only get single-threaded instantiation numbers. It's expected that #3733 does not scale well with cores (like this PR) due to the IPIs necessary at the kernel level with all the calls to |
Oh and for now memfd is disabled-by-default, so with #3733 you'll need to execute |
@alexcrichton Got it! I'll rerun all of the benchmarks next week and I'll get back to you. |
@alexcrichton Sorry for the delay! I updated to the newest In table form:
When a lot of memory is dirtied it is indeed competitive now, and when not a lot of memory is touched it also performs quite well now! Of course this is assuming both memfd and pooling is enabled. So I'd like to ask here - are there any plans to make the pooling less painful to use? Something like this would be ideal:
Basically completely get rid of the |
Hm so actually the "only memfd" line, which I'm assuming is using the default on-demand allocator, is performing much worse than expected. The cost of the on-demand allocator is an extra mmap-or-two and it's not quite as efficient on reuse (a few Otherwise though I definitely think we can improve the story with the usability of the pooling allocator. The reason that the module limits and such exist are so that we can create appropriate space for the internal Again though I'm surprised that the on-demand allocator is performing as bad as it did in your measurements, as my suspicion was that it would be sufficient for your use case. I think configuring the pooling allocator by allocation size is probably good to do no matter what, though. |
Actually thinking about this some more, my measurements and impression about the relative cost of these is primarily in the single-threaded case, I haven't looked too too closely at the multithreaded bits. Does your use case stress the multi-threaded aspect heavily? If so we can try to dig in some more, but I'm not sure if you're measuring the multi-threaded performance at the behest of an old request of ours or your own project's motivations as well. As a point of comparison for a 16-threaded scenario I get:
(hence some of my surprise, but again I haven't dug into these numbers much myself, especially the discrepancy between pooling/on-demand and how the speedup changes depending on the allocation strategy) |
Sure. This is the crate where we encapsulated our use of https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_2/client/executor/wasmtime Let me give you a quick step-by-step walkthrough of the code:
And my benchmarks basically just loop (9) over and over again.
In certain cases we do call into WASM from multiple threads at the same time, so we do care about the multi-threaded aspect, but an instantiated instance never leaves the thread on which the instantiation happened. (Basically the whole number 9 is always executed in one go without the concrete instance being sent to any other thread nor being called twice. [Unless our legacy instance reuse mechanism is used, but we want to get rid of that.]) |
Okay, so since the current memfd + pooling allocation strategy is fast enough (our primary goal was to remove our legacy instantiation scheme without compromising on performance, and ideally to also get a speedup if possible) I'm going to close this PR now. As I've said previously I'm not exactly thrilled by the API to be able to use the pooling allocator without introducing arbitrary hard limits, but ultimately that's not a dealbreaker and we can hack around it. (: (I'm of course still happy to answer questions and help out if necessary, so please feel free to ping me if needed.) |
Ok cool thanks for the links, it looks like nothing is amiss there. I also forget that the machine I'm working on is an 80-core arm64 machine where IPIs are likely more expensive than smaller-core-count machines, so that would likely help explain the discrepancy. If it helps I put up #3837 which removes That hopefully makes things a bit more usable! Oh also for imported memories, it's true that right now for copy-on-write-based initialization we do not apply the optimization to imported memories, only to locally defined memories. All of the "interesting" modules we've been optimizing and care about the runtime for all define their own memory and export it, but it's not impossible to support imported memories and if you've got a use case we can look to implement support for imported memories. |
That could potentially affect things, yes; I was testing this on a mere 32-core machine after all. (:
It is indeed a significant improvement! Thanks!
In our use case at this point we're fine with the way it is currently. We need to support WASM blobs of either type, but we can just easily patch one into the other if necessary so that the memory's always defined internally and exported. |
This PR adds a new copy-on-write based instance reuse mechanism on Linux.
Usage
The general idea is - you instantiate your instance once, and then you can reset its state back to how it was when it was initially instantiated.
How does it work?
After the instance is instantiated a snapshot of its state is taken. Tables and globals are simply cloned into a spare
Vec
, while the memory is copied into anmemfd
and remapped inplace in a copy-on-write fashion. Then whenreset
is called the tables and the globals are restored by simply copying them back over, and the memory is reset usingmadvise(MADV_DONTNEED)
which either clears the memory (for those pages which are not mapped to thememfd
) or restores the original contents (for those pages which are mapped to thememfd
).Benchmarks
In our benchmarks this is currently the fastest way to call into a WASM module assuming you rarely need to instantiate it from scratch.
Legend:
instance_pooling_with_uffd
: create a fresh instance withInstanceAllocationStrategy::Pooling
strategy withuffd
turned oninstance_pooling_without_uffd
: create a fresh instance withInstanceAllocationStrategy::Pooling
strategy withoutuffd
turned onrecreate_instance
: create a fresh instance withInstanceAllocationStrategy::OnDemand
strategynative_instance_reuse
: this PRinterpreted
: just for our own reference; an instance created through thewasmi
cratelegacy_instance_reuse
: just for our own reference; this is what we're currently using - it is an instance spawned withInstanceAllocationStrategy::OnDemand
strategy and then reused after manually clearing its memory and restoring its data segments and globalsThe two of the benchmarks shown here are:
call_empty_function
: an empty function is called in a loop, resetting (or recreating) the instance after each calldirty_1mb_of_memory
: a function which dirties 1MB of memory and then returns is called in a loop, resetting (or recreating) the instance after each callThe measurements are only for the main thread; thread count on the bottom signifies how many other threads were running in the background doing exactly the same thing as the main thread, e.g. for 4 threads there was 1 thread (the main thread) being benchmarked while other 3 threads were running in the background.
For your reference the benchmarks used to generate these graphs can be found here:
https://github.com/koute/substrate/tree/master_wasmtime_benchmarks
Which can be run like this after cloning the repository:
(The rustc version is just for reference as to what I used. Also please forgive the hacky way the benchmarks have to be launched for instance pooling; we don't intend to keep this codepath, so I quckly hacked it in only for the benchmarks.)
The benchmarks were run on the following machine:
AMD Threadripper 3970x (32-core)
64GB of RAM
Linux 5.14.16
Possible future work (missing features)