wasmtime: In-process sampling profiler #6282

jameysharp · 2023-04-25T01:06:37Z

This isn't ready for detailed review but the basic functionality seems to work so I'm putting it up in case anyone wants to play with it.

Unlike the existing profiling options, this works on all platforms and does not rely on any external profiling tools like perf or VTune. On the other hand, it can only profile time spent in the WebAssembly guest, not in Wasmtime itself or other host code. Also it can't measure time as precisely as platform-native tools can.

These audits are necessary for in-process guest profiling support, currently under development in PR bytecodealliance#6282.

These audits are necessary for in-process guest profiling support, currently under development in PR #6282.

jameysharp · 2023-04-26T00:32:29Z

This PR now produces output that works as expected with https://profiler.firefox.com/, and I don't think it's fundamentally missing anything.

Before merging, I'd like to move the GuestProfiler struct somewhere in the wasmtime crate, instead of leaving it in wasmtime-cli where only wasmtime run can use it.

I don't think any of the implementation will change though, just move elsewhere, so I'm marking this as ready for review.

src/commands/run.rs

github-actions · 2023-04-26T21:44:58Z

Subscribe to Label Action

cc @kubkon, @peterhuene

This issue or pull request has been labeled: "wasi", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

kubkon: wasi
peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

alexcrichton

I'm ok landing this personally, but I think we should be clear in the documentation of the type that it's not really at a state of implementation quality to be a high-fidelity general purpose guest profiler. For example some drawbacks I can think of are:

Requires epochs as the driver for stack collection, meaning the granularity of samples is function entries and loop headers. Additionally this means that you're not necessarily measuring the "raw" performance since it's code-based instrumentation.
Stacks are always collected from wasm meaning that any time spent in the host is not accounted for. This means that samples can be lost while in the host and additionally whether or not the host blocked or did compute I think is unclear. (although perhaps this is fixable)

I suspect that a large number of use cases are totally ok with these drawbacks, but at the same time I think it's important to be clear that we're not looking to get into the business of a "real" profiler at this time with off-thread stack sampling, host integration, etc.

Additionally would you also be up for updating the documentation on docs/*.md? Currently we have examples-profiling.md which delegates to perf and vtune and I think it'd be good to add an entry for this strategy of profiling. Ideally there could be at least a small paragraph of words explaining why one might choose one profiling strategy over another (e.g. perf captures everything but is Linux-specific and not always easy to interpret, while this is cross-platform but not as "high fidelity" as perf)

alexcrichton · 2023-04-27T14:16:11Z

crates/wasmtime/src/profiling.rs

+/// Collects profiling data for a single WebAssembly guest.
+///
+/// To use this, you'll need to arrange to call [`GuestProfiler::sample`] at
+/// regular intervals while the guest is on the stack. The most straightforward
+/// way to do that is to call it from a callback registered with
+/// [`Store::epoch_deadline_callback()`](crate::Store::epoch_deadline_callback).
+///
+/// This implementation requires that the `Engine` was constructed with
+/// [`Config::generate_address_map(true)`](crate::Config::generate_address_map).
+pub struct GuestProfiler {


I think this'd be a good place to have a doc-block example if you're up for it about how to configure a Store and such.

alexcrichton · 2023-04-27T14:17:53Z

crates/wasmtime/src/profiling.rs

+// - On non-Windows, measure thread-local CPU usage between events with
+//   rustix::time::clock_gettime(ClockId::ThreadCPUTime)


This was actually a question I thought of, what happens if you block in a hostcall for a second?

I think you'd get a stack sample, much later, in probably the wrong location (e.g. the next function call after the host import or the loop that's calling the host import). I'm not sure how those really-far-apart stack samples are interpreted by the firefox profiler, but perhaps this CPU time measurement would help?

alexcrichton · 2023-04-27T14:20:23Z

src/commands/run.rs

+    fn setup_epoch_handler(&self, store: &mut Store<Host>, module_name: &str) -> Box<dyn FnOnce()> {
+        if let Some(path) = &self.profile_guest {
+            let interval = self.profile_guest_interval;
+            let profiler = Arc::new(Mutex::new(GuestProfiler::new(module_name, interval)));


API question: instead of requiring an Arc and Mutex here, which technically aren't necessary since there aren't any threads, could this instead be stored in the T of Store<T>? As-is the GuestProfiler::sample method would not work since it would be stored in T but additionally takes a store as a parameter, but I think that could be fixed by taking a&WasmBacktrace as an argument since that's all the store is used for internally.

"Wait, I could just keep it in the Store," I thought, as I fell asleep last night; then wake up to find this comment. 😆

I don't want to make WasmBacktrace part of the public API here because I want to re-implement this in terms of the lower-level wasmtime-runtime stack-walking API. That doesn't directly need access to the store at all, but I do need some information from the store at some point to figure out how native PCs map to individual modules and to functions within those modules.

It might work though to pass in the store when constructing a GuestProfiler, collect just the necessary information about all the modules and symbols from it, and then not need it in sample.

Otherwise, the only alternative I've thought of is to define fn sample(store: impl AsContextMut, get_self: impl FnOnce(&mut Store) -> &mut Self). Not the nicest interface but I think it should work, right?

How about:

impl<T> Store<T> { pub fn sample(&mut self, get_profiler: impl FnOnce(&mut T) -> &mut GuestProfiler) { ... } }

with similar impls for StoreContextMut?

which is more-or-less the same thing, but just as a method instead of a function on GuestProfiler now that I reread your comment

alexcrichton · 2023-04-27T14:24:57Z

crates/wasmtime/src/profiling.rs

+    /// When the guest finishes running, call this function to write the
+    /// profile to the given `output`.


Could this be expanded with more information about what to do with the output? For example explaining that it's a JSON blob to upload to https://profiler.firefox.com I think would be good.

src/commands/run.rs

fitzgen

Really excited for this!

When profiling is enabled in the CLI, it probably makes sense to write some output after running the Wasm that repeats where the profile file was written to and how to upload it to https://profiler.firefox.com/

fitzgen · 2023-04-27T16:52:46Z

crates/wasmtime/src/profiling.rs

+/// To use this, you'll need to arrange to call [`GuestProfiler::sample`] at
+/// regular intervals while the guest is on the stack. The most straightforward
+/// way to do that is to call it from a callback registered with
+/// [`Store::epoch_deadline_callback()`](crate::Store::epoch_deadline_callback).


It could be good to have a full example here, if we are just going to expose building blocks rather than a single enable-profiling-and-get-dump-profiles-to-this-file kind of API.

fitzgen · 2023-04-27T16:56:27Z

crates/wasmtime/src/profiling.rs

+
+    /// When the guest finishes running, call this function to write the
+    /// profile to the given `output`.
+    pub fn finish(&mut self, output: impl std::io::Write) -> Result<()> {


Should this method take self? Would it be valid to add more samples after calling finish?

If you call finish again it'll just update the thread end-time and write out the whole profile again, including any new samples. So it's probably not useful to add more samples after calling finish, but it should be harmless.

That said, I only did it this way because there's still an extra reference to the GuestProfiler in the epoch callback after the instance finishes, and I didn't want to have to clone the profile. But if I can make Alex's suggestion work, of keeping the GuestProfiler in the store instead of in an Arc, then that won't be an issue and I could make the change you're suggesting too.

fitzgen · 2023-04-27T17:05:23Z

src/commands/run.rs

+            return Box::new(move || {
+                let mut profiler = profiler.lock().unwrap();
+                if let Err(e) = std::fs::File::create(&path)
+                    .map_err(anyhow::Error::new)
+                    .and_then(|output| profiler.finish(std::io::BufWriter::new(output)))
+                {
+                    eprintln!("failed writing profile at {}: {e:#}", path.display());
+                }
+            });


Doesn't matter a ton in this case, but one pattern I've used to make APIs like this slightly harder to get wrong is to return an impl Drop:

struct FinishEpochHandler<F: FnOnce()>(Option<F>); impl<F: FnOnce()> Drop for FinishEpochHandler<F> { fn drop(&mut self) { if let Some(f) = self.0.take() { f(); } } } return FinishEpochHandler(Some(move || { // what you have now... }));

This way, callers don't need to actually remember to call the resulting function, at most they just need to add a scope.

But I guess if you do Alex's suggestion of moving the profiler into Store, then this result function would require a store argument, and the RAII/impl Drop approach won't really work anymore...

These audits are necessary for in-process guest profiling support, currently under development in PR bytecodealliance#6282.

jameysharp · 2023-04-29T01:22:04Z

I got rid of the need for Store in sample(). Then I discovered that I can't put a GuestProfiler directly in the Host structure because some of the WASI extensions require that Host must be cloneable.

As far as I can tell I end up needing an Arc and Mutex anyway. I would love for somebody to tell me how to avoid that.

It was still worth doing this because now any particular PC is only looked up once per profile and cached by fxprof_processed_profile. Different PCs within the same function need to be looked up separately but since we only collect stack traces at epoch interruption points there aren't so many distinct PCs. This should significantly reduce profiling overhead but I haven't measured how much impact it actually has.

I struggled with this a bunch because I was trying to make it automatically use all modules which have been registered within the current Store's ModuleRegistry. But for WASI commands the module isn't registered in the store until you actually try to invoke it, and I tried all sorts of ways to hook into that registration before realizing that the caller knows which modules they care about and I can just make it their problem to tell me.

Then I decided that it's a feature to be able to exclude modules from profiles: if you have some proprietary module that you're linking with your customer's modules, you may not want to reveal anything about the implementation of your module when providing them with a profile.

alexcrichton · 2023-05-01T15:26:04Z

As far as I can tell I end up needing an Arc and Mutex anyway.

Ah this is due to the integration of wasi-threads where that's modeled by cloning the store's context to duplicate it on another thread. Otherwise though the cloning is not required. You can take a similar route as the WasiNnCtx integration, for example, which additionally does not implement Clone so what it does is it's bundled into an Arc where whenever it's required it calls Arc::get_mut. That fails if there's more than one clone, or more than one thread, and the error message says "this doesn't support threads yet". I think that profiling will fall into the same bucket?

In the long-term future I could imagine that there's a standalone profiler struct where you acquire thread-specific profilers from that structure. The Host could store the standalone profiler (which internally probably has an arc and a mutex) and additionally thread-specific profilers. The Clone for Host impl would then use the standalone profiler to get a new thread-specific profiler, allowing each thread to have mutable access to its profiler but no others. Something like that. No need to do here of course, mostly just modeling how this may look in the future with support for threads.

I can just make it their problem to tell me.

This seems like a neat idea to me! I like the idea of precomputing symbol tables. I do think this has the downside of that it has more setup required, but that's also perhaps not a bad thing and it's not exactly onerous at this time, so it seems fine by me.

I got rid of the need for Store in sample().

Personally I'm a bit wary of this because to me I wouldn't want to commit long-term to being able to capture a stack trace without any input. I'm not sure how we'll ever move away from those thread locals but I sure would love to one day if someone gets a good idea. In such a situation there's no way to start the backtrace without a contextual store being passed in. All that being said though it's not like we need to get the API 100% prefect to begin with, that's why we have major version bumps after all. I mostly wanted to put this out there, but I additionally think it's fine to land as-is with your current strategy. If and when we remove TLS (fingers crossed) we can figure out how to change this API at this point. I realize that passing the store into something stored within the store is not easy, so let's stick with what you've got.

Unlike the existing profiling options, this works on all platforms and does not rely on any external profiling tools like perf or VTune. On the other hand, it can only profile time spent in the WebAssembly guest, not in Wasmtime itself or other host code. Also it can't measure time as precisely as platform-native tools can. The profile is saved in the Firefox processed format, which can be viewed using https://profiler.firefox.com/.

Use fxprof_processed_profile's support for looking up symbols to avoid looking up the same PC more than once per profile.

Also extend the documentation based on review comments.

jameysharp · 2023-05-02T00:41:11Z

Thanks for the tips, Alex: that worked!

Okay, I've addressed most of the issues that you all have raised. What I haven't done is added documentation to the Wasmtime book, or API usage examples to the doc-comments. I'd kind of like to merge the implementation and then come back to that, if that's okay. I hope the doc-comments I've written so far will be a good start for the book, and the usage in run.rs will be a good start for the usage examples…

There are also still several to-do items, like measuring CPU time and noting wasm entry/leave events, which I also want to defer to future work.

alexcrichton

Sounds like a good plan to me 👍

jameysharp added a commit to jameysharp/wasmtime that referenced this pull request Apr 25, 2023

cargo vet audits for bytecodealliance#6282

193494d

jameysharp added a commit to jameysharp/wasmtime that referenced this pull request Apr 25, 2023

cargo vet audits for guest profiling support

7336dc4

These audits are necessary for in-process guest profiling support, currently under development in PR bytecodealliance#6282.

jameysharp mentioned this pull request Apr 25, 2023

cargo vet audits for guest profiling support #6284

Merged

alexcrichton pushed a commit that referenced this pull request Apr 25, 2023

cargo vet audits for guest profiling support (#6284)

9616ead

These audits are necessary for in-process guest profiling support, currently under development in PR #6282.

jameysharp force-pushed the guest-profiler branch from b362390 to f616e5d Compare April 26, 2023 00:28

jameysharp marked this pull request as ready for review April 26, 2023 00:31

jameysharp requested review from a team as code owners April 26, 2023 00:31

jameysharp requested review from fitzgen and removed request for a team April 26, 2023 00:31

elliottt removed the request for review from a team April 26, 2023 01:20

bjorn3 reviewed Apr 26, 2023

View reviewed changes

src/commands/run.rs Outdated Show resolved Hide resolved

alexcrichton reviewed Apr 26, 2023

View reviewed changes

src/commands/run.rs Outdated Show resolved Hide resolved

src/commands/run.rs Outdated Show resolved Hide resolved

src/commands/run.rs Outdated Show resolved Hide resolved

src/commands/run.rs Outdated Show resolved Hide resolved

github-actions bot added wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself labels Apr 26, 2023

alexcrichton reviewed Apr 27, 2023

View reviewed changes

fitzgen reviewed Apr 27, 2023

View reviewed changes

jameysharp added 8 commits May 1, 2023 12:57

Ensure func_offset is populated

bb158b6

Refactor

b8353ba

Review comments

b1c4e71

Move GuestProfiler to the wasmtime crate

b6c5b30

Document the new GuestProfiler API

6754d2a

Add TODO comments for future work

82ba1b7

Use module_offset, not func_offset, as fallback PC

a7fbe12

jameysharp added 2 commits May 1, 2023 14:53

Minimize work done during sample()

e9d3fb3

Use fxprof_processed_profile's support for looking up symbols to avoid looking up the same PC more than once per profile.

Keep profiler state in the store

970e849

Also extend the documentation based on review comments.

jameysharp force-pushed the guest-profiler branch from f77f626 to 970e849 Compare May 1, 2023 22:10

Import debugid audit from Mozilla again

dc58488

alexcrichton approved these changes May 2, 2023

View reviewed changes

jameysharp added this pull request to the merge queue May 2, 2023

Merged via the queue into bytecodealliance:main with commit fea6ab9 May 2, 2023

jameysharp deleted the guest-profiler branch May 2, 2023 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wasmtime: In-process sampling profiler #6282

wasmtime: In-process sampling profiler #6282

jameysharp commented Apr 25, 2023

jameysharp commented Apr 26, 2023

github-actions bot commented Apr 26, 2023

alexcrichton left a comment

alexcrichton Apr 27, 2023

alexcrichton Apr 27, 2023

alexcrichton Apr 27, 2023

jameysharp Apr 27, 2023

alexcrichton Apr 27, 2023

alexcrichton Apr 27, 2023

fitzgen left a comment

fitzgen Apr 27, 2023

fitzgen Apr 27, 2023

jameysharp Apr 27, 2023

fitzgen Apr 27, 2023

jameysharp commented Apr 29, 2023

alexcrichton commented May 1, 2023

jameysharp commented May 2, 2023

alexcrichton left a comment

		// - On non-Windows, measure thread-local CPU usage between events with
		// rustix::time::clock_gettime(ClockId::ThreadCPUTime)

		/// When the guest finishes running, call this function to write the
		/// profile to the given `output`.

wasmtime: In-process sampling profiler #6282

wasmtime: In-process sampling profiler #6282

Conversation

jameysharp commented Apr 25, 2023

jameysharp commented Apr 26, 2023

github-actions bot commented Apr 26, 2023

Subscribe to Label Action

alexcrichton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fitzgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameysharp commented Apr 29, 2023

alexcrichton commented May 1, 2023

jameysharp commented May 2, 2023

alexcrichton left a comment

Choose a reason for hiding this comment