-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasmtime: In-process sampling profiler #6282
Conversation
These audits are necessary for in-process guest profiling support, currently under development in PR bytecodealliance#6282.
b362390
to
f616e5d
Compare
This PR now produces output that works as expected with https://profiler.firefox.com/, and I don't think it's fundamentally missing anything. Before merging, I'd like to move the I don't think any of the implementation will change though, just move elsewhere, so I'm marking this as ready for review. |
Subscribe to Label Actioncc @kubkon, @peterhuene
This issue or pull request has been labeled: "wasi", "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok landing this personally, but I think we should be clear in the documentation of the type that it's not really at a state of implementation quality to be a high-fidelity general purpose guest profiler. For example some drawbacks I can think of are:
- Requires epochs as the driver for stack collection, meaning the granularity of samples is function entries and loop headers. Additionally this means that you're not necessarily measuring the "raw" performance since it's code-based instrumentation.
- Stacks are always collected from wasm meaning that any time spent in the host is not accounted for. This means that samples can be lost while in the host and additionally whether or not the host blocked or did compute I think is unclear. (although perhaps this is fixable)
I suspect that a large number of use cases are totally ok with these drawbacks, but at the same time I think it's important to be clear that we're not looking to get into the business of a "real" profiler at this time with off-thread stack sampling, host integration, etc.
Additionally would you also be up for updating the documentation on docs/*.md
? Currently we have examples-profiling.md
which delegates to perf
and vtune
and I think it'd be good to add an entry for this strategy of profiling. Ideally there could be at least a small paragraph of words explaining why one might choose one profiling strategy over another (e.g. perf
captures everything but is Linux-specific and not always easy to interpret, while this is cross-platform but not as "high fidelity" as perf
)
crates/wasmtime/src/profiling.rs
Outdated
/// Collects profiling data for a single WebAssembly guest. | ||
/// | ||
/// To use this, you'll need to arrange to call [`GuestProfiler::sample`] at | ||
/// regular intervals while the guest is on the stack. The most straightforward | ||
/// way to do that is to call it from a callback registered with | ||
/// [`Store::epoch_deadline_callback()`](crate::Store::epoch_deadline_callback). | ||
/// | ||
/// This implementation requires that the `Engine` was constructed with | ||
/// [`Config::generate_address_map(true)`](crate::Config::generate_address_map). | ||
pub struct GuestProfiler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this'd be a good place to have a doc-block example if you're up for it about how to configure a Store
and such.
// - On non-Windows, measure thread-local CPU usage between events with | ||
// rustix::time::clock_gettime(ClockId::ThreadCPUTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a question I thought of, what happens if you block in a hostcall for a second?
I think you'd get a stack sample, much later, in probably the wrong location (e.g. the next function call after the host import or the loop that's calling the host import). I'm not sure how those really-far-apart stack samples are interpreted by the firefox profiler, but perhaps this CPU time measurement would help?
src/commands/run.rs
Outdated
fn setup_epoch_handler(&self, store: &mut Store<Host>, module_name: &str) -> Box<dyn FnOnce()> { | ||
if let Some(path) = &self.profile_guest { | ||
let interval = self.profile_guest_interval; | ||
let profiler = Arc::new(Mutex::new(GuestProfiler::new(module_name, interval))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API question: instead of requiring an Arc
and Mutex
here, which technically aren't necessary since there aren't any threads, could this instead be stored in the T
of Store<T>
? As-is the GuestProfiler::sample
method would not work since it would be stored in T
but additionally takes a store as a parameter, but I think that could be fixed by taking a&WasmBacktrace
as an argument since that's all the store is used for internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Wait, I could just keep it in the Store," I thought, as I fell asleep last night; then wake up to find this comment. 😆
I don't want to make WasmBacktrace
part of the public API here because I want to re-implement this in terms of the lower-level wasmtime-runtime
stack-walking API. That doesn't directly need access to the store at all, but I do need some information from the store at some point to figure out how native PCs map to individual modules and to functions within those modules.
It might work though to pass in the store when constructing a GuestProfiler
, collect just the necessary information about all the modules and symbols from it, and then not need it in sample
.
Otherwise, the only alternative I've thought of is to define fn sample(store: impl AsContextMut, get_self: impl FnOnce(&mut Store) -> &mut Self)
. Not the nicest interface but I think it should work, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
impl<T> Store<T> {
pub fn sample(&mut self, get_profiler: impl FnOnce(&mut T) -> &mut GuestProfiler) { ... }
}
with similar impls for StoreContextMut
?
which is more-or-less the same thing, but just as a method instead of a function on GuestProfiler
now that I reread your comment
crates/wasmtime/src/profiling.rs
Outdated
/// When the guest finishes running, call this function to write the | ||
/// profile to the given `output`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be expanded with more information about what to do with the output? For example explaining that it's a JSON blob to upload to https://profiler.firefox.com I think would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really excited for this!
When profiling is enabled in the CLI, it probably makes sense to write some output after running the Wasm that repeats where the profile file was written to and how to upload it to https://profiler.firefox.com/
/// To use this, you'll need to arrange to call [`GuestProfiler::sample`] at | ||
/// regular intervals while the guest is on the stack. The most straightforward | ||
/// way to do that is to call it from a callback registered with | ||
/// [`Store::epoch_deadline_callback()`](crate::Store::epoch_deadline_callback). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be good to have a full example here, if we are just going to expose building blocks rather than a single enable-profiling-and-get-dump-profiles-to-this-file kind of API.
crates/wasmtime/src/profiling.rs
Outdated
|
||
/// When the guest finishes running, call this function to write the | ||
/// profile to the given `output`. | ||
pub fn finish(&mut self, output: impl std::io::Write) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method take self
? Would it be valid to add more samples after calling finish
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call finish
again it'll just update the thread end-time and write out the whole profile again, including any new samples. So it's probably not useful to add more samples after calling finish
, but it should be harmless.
That said, I only did it this way because there's still an extra reference to the GuestProfiler
in the epoch callback after the instance finishes, and I didn't want to have to clone the profile. But if I can make Alex's suggestion work, of keeping the GuestProfiler
in the store instead of in an Arc
, then that won't be an issue and I could make the change you're suggesting too.
src/commands/run.rs
Outdated
return Box::new(move || { | ||
let mut profiler = profiler.lock().unwrap(); | ||
if let Err(e) = std::fs::File::create(&path) | ||
.map_err(anyhow::Error::new) | ||
.and_then(|output| profiler.finish(std::io::BufWriter::new(output))) | ||
{ | ||
eprintln!("failed writing profile at {}: {e:#}", path.display()); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter a ton in this case, but one pattern I've used to make APIs like this slightly harder to get wrong is to return an impl Drop
:
struct FinishEpochHandler<F: FnOnce()>(Option<F>);
impl<F: FnOnce()> Drop for FinishEpochHandler<F> {
fn drop(&mut self) {
if let Some(f) = self.0.take() {
f();
}
}
}
return FinishEpochHandler(Some(move || {
// what you have now...
}));
This way, callers don't need to actually remember to call the resulting function, at most they just need to add a scope.
But I guess if you do Alex's suggestion of moving the profiler into Store
, then this result function would require a store argument, and the RAII/impl Drop
approach won't really work anymore...
These audits are necessary for in-process guest profiling support, currently under development in PR bytecodealliance#6282.
I got rid of the need for As far as I can tell I end up needing an It was still worth doing this because now any particular PC is only looked up once per profile and cached by I struggled with this a bunch because I was trying to make it automatically use all modules which have been registered within the current Then I decided that it's a feature to be able to exclude modules from profiles: if you have some proprietary module that you're linking with your customer's modules, you may not want to reveal anything about the implementation of your module when providing them with a profile. |
Ah this is due to the integration of In the long-term future I could imagine that there's a standalone profiler struct where you acquire thread-specific profilers from that structure. The
This seems like a neat idea to me! I like the idea of precomputing symbol tables. I do think this has the downside of that it has more setup required, but that's also perhaps not a bad thing and it's not exactly onerous at this time, so it seems fine by me.
Personally I'm a bit wary of this because to me I wouldn't want to commit long-term to being able to capture a stack trace without any input. I'm not sure how we'll ever move away from those thread locals but I sure would love to one day if someone gets a good idea. In such a situation there's no way to start the backtrace without a contextual store being passed in. All that being said though it's not like we need to get the API 100% prefect to begin with, that's why we have major version bumps after all. I mostly wanted to put this out there, but I additionally think it's fine to land as-is with your current strategy. If and when we remove TLS (fingers crossed) we can figure out how to change this API at this point. I realize that passing the store into something stored within the store is not easy, so let's stick with what you've got. |
Unlike the existing profiling options, this works on all platforms and does not rely on any external profiling tools like perf or VTune. On the other hand, it can only profile time spent in the WebAssembly guest, not in Wasmtime itself or other host code. Also it can't measure time as precisely as platform-native tools can. The profile is saved in the Firefox processed format, which can be viewed using https://profiler.firefox.com/.
Use fxprof_processed_profile's support for looking up symbols to avoid looking up the same PC more than once per profile.
Also extend the documentation based on review comments.
Thanks for the tips, Alex: that worked! Okay, I've addressed most of the issues that you all have raised. What I haven't done is added documentation to the Wasmtime book, or API usage examples to the doc-comments. I'd kind of like to merge the implementation and then come back to that, if that's okay. I hope the doc-comments I've written so far will be a good start for the book, and the usage in There are also still several to-do items, like measuring CPU time and noting wasm entry/leave events, which I also want to defer to future work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan to me 👍
This isn't ready for detailed review but the basic functionality seems to work so I'm putting it up in case anyone wants to play with it.
Unlike the existing profiling options, this works on all platforms and does not rely on any external profiling tools like perf or VTune. On the other hand, it can only profile time spent in the WebAssembly guest, not in Wasmtime itself or other host code. Also it can't measure time as precisely as platform-native tools can.