Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmtime: In-process sampling profiler #6282

Merged
merged 11 commits into from
May 2, 2023

Conversation

jameysharp
Copy link
Contributor

This isn't ready for detailed review but the basic functionality seems to work so I'm putting it up in case anyone wants to play with it.

Unlike the existing profiling options, this works on all platforms and does not rely on any external profiling tools like perf or VTune. On the other hand, it can only profile time spent in the WebAssembly guest, not in Wasmtime itself or other host code. Also it can't measure time as precisely as platform-native tools can.

jameysharp added a commit to jameysharp/wasmtime that referenced this pull request Apr 25, 2023
jameysharp added a commit to jameysharp/wasmtime that referenced this pull request Apr 25, 2023
These audits are necessary for in-process guest profiling support,
currently under development in PR bytecodealliance#6282.
alexcrichton pushed a commit that referenced this pull request Apr 25, 2023
These audits are necessary for in-process guest profiling support,
currently under development in PR #6282.
@jameysharp jameysharp marked this pull request as ready for review April 26, 2023 00:31
@jameysharp jameysharp requested review from a team as code owners April 26, 2023 00:31
@jameysharp jameysharp requested review from fitzgen and removed request for a team April 26, 2023 00:31
@jameysharp
Copy link
Contributor Author

This PR now produces output that works as expected with https://profiler.firefox.com/, and I don't think it's fundamentally missing anything.

Before merging, I'd like to move the GuestProfiler struct somewhere in the wasmtime crate, instead of leaving it in wasmtime-cli where only wasmtime run can use it.

I don't think any of the implementation will change though, just move elsewhere, so I'm marking this as ready for review.

@elliottt elliottt removed the request for review from a team April 26, 2023 01:20
src/commands/run.rs Outdated Show resolved Hide resolved
src/commands/run.rs Outdated Show resolved Hide resolved
src/commands/run.rs Outdated Show resolved Hide resolved
src/commands/run.rs Outdated Show resolved Hide resolved
src/commands/run.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself labels Apr 26, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @kubkon, @peterhuene

This issue or pull request has been labeled: "wasi", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • kubkon: wasi
  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok landing this personally, but I think we should be clear in the documentation of the type that it's not really at a state of implementation quality to be a high-fidelity general purpose guest profiler. For example some drawbacks I can think of are:

  • Requires epochs as the driver for stack collection, meaning the granularity of samples is function entries and loop headers. Additionally this means that you're not necessarily measuring the "raw" performance since it's code-based instrumentation.
  • Stacks are always collected from wasm meaning that any time spent in the host is not accounted for. This means that samples can be lost while in the host and additionally whether or not the host blocked or did compute I think is unclear. (although perhaps this is fixable)

I suspect that a large number of use cases are totally ok with these drawbacks, but at the same time I think it's important to be clear that we're not looking to get into the business of a "real" profiler at this time with off-thread stack sampling, host integration, etc.

Additionally would you also be up for updating the documentation on docs/*.md? Currently we have examples-profiling.md which delegates to perf and vtune and I think it'd be good to add an entry for this strategy of profiling. Ideally there could be at least a small paragraph of words explaining why one might choose one profiling strategy over another (e.g. perf captures everything but is Linux-specific and not always easy to interpret, while this is cross-platform but not as "high fidelity" as perf)

Comment on lines 18 to 75
/// Collects profiling data for a single WebAssembly guest.
///
/// To use this, you'll need to arrange to call [`GuestProfiler::sample`] at
/// regular intervals while the guest is on the stack. The most straightforward
/// way to do that is to call it from a callback registered with
/// [`Store::epoch_deadline_callback()`](crate::Store::epoch_deadline_callback).
///
/// This implementation requires that the `Engine` was constructed with
/// [`Config::generate_address_map(true)`](crate::Config::generate_address_map).
pub struct GuestProfiler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'd be a good place to have a doc-block example if you're up for it about how to configure a Store and such.

Comment on lines +12 to +19
// - On non-Windows, measure thread-local CPU usage between events with
// rustix::time::clock_gettime(ClockId::ThreadCPUTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a question I thought of, what happens if you block in a hostcall for a second?

I think you'd get a stack sample, much later, in probably the wrong location (e.g. the next function call after the host import or the loop that's calling the host import). I'm not sure how those really-far-apart stack samples are interpreted by the firefox profiler, but perhaps this CPU time measurement would help?

fn setup_epoch_handler(&self, store: &mut Store<Host>, module_name: &str) -> Box<dyn FnOnce()> {
if let Some(path) = &self.profile_guest {
let interval = self.profile_guest_interval;
let profiler = Arc::new(Mutex::new(GuestProfiler::new(module_name, interval)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API question: instead of requiring an Arc and Mutex here, which technically aren't necessary since there aren't any threads, could this instead be stored in the T of Store<T>? As-is the GuestProfiler::sample method would not work since it would be stored in T but additionally takes a store as a parameter, but I think that could be fixed by taking a&WasmBacktrace as an argument since that's all the store is used for internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wait, I could just keep it in the Store," I thought, as I fell asleep last night; then wake up to find this comment. 😆

I don't want to make WasmBacktrace part of the public API here because I want to re-implement this in terms of the lower-level wasmtime-runtime stack-walking API. That doesn't directly need access to the store at all, but I do need some information from the store at some point to figure out how native PCs map to individual modules and to functions within those modules.

It might work though to pass in the store when constructing a GuestProfiler, collect just the necessary information about all the modules and symbols from it, and then not need it in sample.

Otherwise, the only alternative I've thought of is to define fn sample(store: impl AsContextMut, get_self: impl FnOnce(&mut Store) -> &mut Self). Not the nicest interface but I think it should work, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

impl<T> Store<T> {
    pub fn sample(&mut self, get_profiler: impl FnOnce(&mut T) -> &mut GuestProfiler) { ... }
}

with similar impls for StoreContextMut?

which is more-or-less the same thing, but just as a method instead of a function on GuestProfiler now that I reread your comment

Comment on lines 94 to 95
/// When the guest finishes running, call this function to write the
/// profile to the given `output`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be expanded with more information about what to do with the output? For example explaining that it's a JSON blob to upload to https://profiler.firefox.com I think would be good.

src/commands/run.rs Show resolved Hide resolved
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited for this!

When profiling is enabled in the CLI, it probably makes sense to write some output after running the Wasm that repeats where the profile file was written to and how to upload it to https://profiler.firefox.com/

Comment on lines +20 to +33
/// To use this, you'll need to arrange to call [`GuestProfiler::sample`] at
/// regular intervals while the guest is on the stack. The most straightforward
/// way to do that is to call it from a callback registered with
/// [`Store::epoch_deadline_callback()`](crate::Store::epoch_deadline_callback).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be good to have a full example here, if we are just going to expose building blocks rather than a single enable-profiling-and-get-dump-profiles-to-this-file kind of API.


/// When the guest finishes running, call this function to write the
/// profile to the given `output`.
pub fn finish(&mut self, output: impl std::io::Write) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method take self? Would it be valid to add more samples after calling finish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call finish again it'll just update the thread end-time and write out the whole profile again, including any new samples. So it's probably not useful to add more samples after calling finish, but it should be harmless.

That said, I only did it this way because there's still an extra reference to the GuestProfiler in the epoch callback after the instance finishes, and I didn't want to have to clone the profile. But if I can make Alex's suggestion work, of keeping the GuestProfiler in the store instead of in an Arc, then that won't be an issue and I could make the change you're suggesting too.

Comment on lines 431 to 449
return Box::new(move || {
let mut profiler = profiler.lock().unwrap();
if let Err(e) = std::fs::File::create(&path)
.map_err(anyhow::Error::new)
.and_then(|output| profiler.finish(std::io::BufWriter::new(output)))
{
eprintln!("failed writing profile at {}: {e:#}", path.display());
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter a ton in this case, but one pattern I've used to make APIs like this slightly harder to get wrong is to return an impl Drop:

struct FinishEpochHandler<F: FnOnce()>(Option<F>);
impl<F: FnOnce()> Drop for FinishEpochHandler<F> {
    fn drop(&mut self) {
        if let Some(f) = self.0.take() {
            f();
        }
    }
}
return FinishEpochHandler(Some(move || {
    // what you have now...
}));

This way, callers don't need to actually remember to call the resulting function, at most they just need to add a scope.

But I guess if you do Alex's suggestion of moving the profiler into Store, then this result function would require a store argument, and the RAII/impl Drop approach won't really work anymore...

eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Apr 28, 2023
These audits are necessary for in-process guest profiling support,
currently under development in PR bytecodealliance#6282.
@jameysharp
Copy link
Contributor Author

I got rid of the need for Store in sample(). Then I discovered that I can't put a GuestProfiler directly in the Host structure because some of the WASI extensions require that Host must be cloneable.

As far as I can tell I end up needing an Arc and Mutex anyway. I would love for somebody to tell me how to avoid that.

It was still worth doing this because now any particular PC is only looked up once per profile and cached by fxprof_processed_profile. Different PCs within the same function need to be looked up separately but since we only collect stack traces at epoch interruption points there aren't so many distinct PCs. This should significantly reduce profiling overhead but I haven't measured how much impact it actually has.

I struggled with this a bunch because I was trying to make it automatically use all modules which have been registered within the current Store's ModuleRegistry. But for WASI commands the module isn't registered in the store until you actually try to invoke it, and I tried all sorts of ways to hook into that registration before realizing that the caller knows which modules they care about and I can just make it their problem to tell me.

Then I decided that it's a feature to be able to exclude modules from profiles: if you have some proprietary module that you're linking with your customer's modules, you may not want to reveal anything about the implementation of your module when providing them with a profile.

@alexcrichton
Copy link
Member

As far as I can tell I end up needing an Arc and Mutex anyway.

Ah this is due to the integration of wasi-threads where that's modeled by cloning the store's context to duplicate it on another thread. Otherwise though the cloning is not required. You can take a similar route as the WasiNnCtx integration, for example, which additionally does not implement Clone so what it does is it's bundled into an Arc where whenever it's required it calls Arc::get_mut. That fails if there's more than one clone, or more than one thread, and the error message says "this doesn't support threads yet". I think that profiling will fall into the same bucket?

In the long-term future I could imagine that there's a standalone profiler struct where you acquire thread-specific profilers from that structure. The Host could store the standalone profiler (which internally probably has an arc and a mutex) and additionally thread-specific profilers. The Clone for Host impl would then use the standalone profiler to get a new thread-specific profiler, allowing each thread to have mutable access to its profiler but no others. Something like that. No need to do here of course, mostly just modeling how this may look in the future with support for threads.

I can just make it their problem to tell me.

This seems like a neat idea to me! I like the idea of precomputing symbol tables. I do think this has the downside of that it has more setup required, but that's also perhaps not a bad thing and it's not exactly onerous at this time, so it seems fine by me.

I got rid of the need for Store in sample().

Personally I'm a bit wary of this because to me I wouldn't want to commit long-term to being able to capture a stack trace without any input. I'm not sure how we'll ever move away from those thread locals but I sure would love to one day if someone gets a good idea. In such a situation there's no way to start the backtrace without a contextual store being passed in. All that being said though it's not like we need to get the API 100% prefect to begin with, that's why we have major version bumps after all. I mostly wanted to put this out there, but I additionally think it's fine to land as-is with your current strategy. If and when we remove TLS (fingers crossed) we can figure out how to change this API at this point. I realize that passing the store into something stored within the store is not easy, so let's stick with what you've got.

jameysharp added 8 commits May 1, 2023 12:57
Unlike the existing profiling options, this works on all platforms and
does not rely on any external profiling tools like perf or VTune. On the
other hand, it can only profile time spent in the WebAssembly guest, not
in Wasmtime itself or other host code. Also it can't measure time as
precisely as platform-native tools can.

The profile is saved in the Firefox processed format, which can be
viewed using https://profiler.firefox.com/.
jameysharp added 2 commits May 1, 2023 14:53
Use fxprof_processed_profile's support for looking up symbols to avoid
looking up the same PC more than once per profile.
Also extend the documentation based on review comments.
@jameysharp
Copy link
Contributor Author

Thanks for the tips, Alex: that worked!

Okay, I've addressed most of the issues that you all have raised. What I haven't done is added documentation to the Wasmtime book, or API usage examples to the doc-comments. I'd kind of like to merge the implementation and then come back to that, if that's okay. I hope the doc-comments I've written so far will be a good start for the book, and the usage in run.rs will be a good start for the usage examples…

There are also still several to-do items, like measuring CPU time and noting wasm entry/leave events, which I also want to defer to future work.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan to me 👍

@jameysharp jameysharp added this pull request to the merge queue May 2, 2023
Merged via the queue into bytecodealliance:main with commit fea6ab9 May 2, 2023
@jameysharp jameysharp deleted the guest-profiler branch May 2, 2023 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants