You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To use Wasmtime in production, we will need to gather statistics from the runtime to feed into monitoring systems.
We have inserted some metrics in a private fork of Lucet in order to support our current production use, and while this has proven useful, we would like to have first-class, open source support for such things in Wasmtime.
Benefit
The ability to monitor the runtime performance and load of Wasmtime is a requirement for it to be used in many production environments.
Implementation
I'll describe the stats we gather from Lucet. I'm less sure how best to fit them into the Wasmtime API, but I have described the kind of callback interface I would like to provide as a client of Wasmtime.
The ones in bold are ones that we've found very important for monitoring platform health and performance. The others would be nice to have, but less critical. This list also shouldn't rule out other opportunities for stat gathering, this is solely what we've found useful in Lucet.
Counters
Sometimes we just need to count how many times something has happened. For Lucet, this is a handful of internal error conditions that we are able to handle without presenting an error to the end user, but want to keep track of internally nonetheless. There is certainly room for more of these:
Number of retries needed on userfaultfd operations due to ENOENT errors
Number of EEXIST errors that the userfaultfd fault handler saw and tracked
Number of times the userfaultfd fault handler got a read event on its file descriptor, but was not able to read an event.
This would be a simple callback per event:
fnrecord_event(&self){// Bump a counter}
Gauges and timers
For measuring operations with distinct start and end points, we use gauges and timing histograms. A gauge is a number that, if incremented, usually has a corresponding decrement at some point in the future. A timer adds a timing component to this, so that the time between the beginning and end of an operation can be measured. These could be implemented with a callback that returns an RAII-style guard:
fnstart_operation(&self) -> Guard{// Increment gauge// Create `Guard` with initial timestamp}structGuard;implGuard{fnfinish(self){// The operation finished normally, so record timing information}}implDropforGuard{fndrop(&mutself){// Decrement gauge// Optionally record timing information}}
Using the drop for the gauge provides some assurance that the gauge remains accurate even if an error occurs between the start and end of an operation. For the timing information, though, we do not necessarily want the timing of errors to be recorded, so having an explicit finish method lets us know that the operation was successful.
In Lucet we currently use gauges and timers to measure:
Evaluating a future on behalf of a Wasm program (similarly to RFC 2)
Instantiating a module (setting up memory protections, copying in initial heap values)
Freeing an instance (resetting memory protections, freeing other resources)
Expanding a Wasm heap on behalf of an instance
Acquiring an instance slot from a memory region
Returning a freed instance slot to a region
Alternatives
Most of the stats we gather for production are taken from outside the boundaries of the Lucet runtime API. To the extent that these operations can be exposed as discrete steps that the library client could measure them, we do not need to add invasive stats interfaces. The stats described here are the ones where in Lucet a significant API refactoring would be required to expose as discrete measurable operations, and would potentially be undesirable for safety or ergonomics.
Instead of a callback-based approach, we could maintain stats internally within Wasmtime and let them be queried by the embedding application. This would put more of a maintenance and design burden on Wasmtime, however, and would limit the flexibility of the client's stat-gathering interfaces.
The text was updated successfully, but these errors were encountered:
It looks like metrics ticks many of the boxes for the requirements we have, but it doesn't appear to have an RAII interface for gauges. In practice we have found that to be very useful to avoid missing decrements due to surprise control flow. Maybe they'd be open to an upstream contribution, though?
Feature
To use Wasmtime in production, we will need to gather statistics from the runtime to feed into monitoring systems.
We have inserted some metrics in a private fork of Lucet in order to support our current production use, and while this has proven useful, we would like to have first-class, open source support for such things in Wasmtime.
Benefit
The ability to monitor the runtime performance and load of Wasmtime is a requirement for it to be used in many production environments.
Implementation
I'll describe the stats we gather from Lucet. I'm less sure how best to fit them into the Wasmtime API, but I have described the kind of callback interface I would like to provide as a client of Wasmtime.
The ones in bold are ones that we've found very important for monitoring platform health and performance. The others would be nice to have, but less critical. This list also shouldn't rule out other opportunities for stat gathering, this is solely what we've found useful in Lucet.
Counters
Sometimes we just need to count how many times something has happened. For Lucet, this is a handful of internal error conditions that we are able to handle without presenting an error to the end user, but want to keep track of internally nonetheless. There is certainly room for more of these:
userfaultfd
operations due toENOENT
errorsEEXIST
errors that theuserfaultfd
fault handler saw and trackeduserfaultfd
fault handler got a read event on its file descriptor, but was not able to read an event.This would be a simple callback per event:
Gauges and timers
For measuring operations with distinct start and end points, we use gauges and timing histograms. A gauge is a number that, if incremented, usually has a corresponding decrement at some point in the future. A timer adds a timing component to this, so that the time between the beginning and end of an operation can be measured. These could be implemented with a callback that returns an RAII-style guard:
Using the drop for the gauge provides some assurance that the gauge remains accurate even if an error occurs between the start and end of an operation. For the timing information, though, we do not necessarily want the timing of errors to be recorded, so having an explicit
finish
method lets us know that the operation was successful.In Lucet we currently use gauges and timers to measure:
Alternatives
Most of the stats we gather for production are taken from outside the boundaries of the Lucet runtime API. To the extent that these operations can be exposed as discrete steps that the library client could measure them, we do not need to add invasive stats interfaces. The stats described here are the ones where in Lucet a significant API refactoring would be required to expose as discrete measurable operations, and would potentially be undesirable for safety or ergonomics.
Instead of a callback-based approach, we could maintain stats internally within Wasmtime and let them be queried by the embedding application. This would put more of a maintenance and design burden on Wasmtime, however, and would limit the flexibility of the client's stat-gathering interfaces.
The text was updated successfully, but these errors were encountered: