-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code lgtm.
Left a few questions and suggestions. One thing I wonder is if it's easy to add a histogram over payload sizes (both in and out)?
client/rpc-servers/src/middleware.rs
Outdated
GaugeVec::new( | ||
Opts::new( | ||
"rpc_calls_finished", | ||
"Number of processed RPC calls (unique un-batched requests)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that one batch call of 10 calls counts as 11 calls? Or 10? It'd be good to count batch calls as well maybe, so that a batch of 10 calls would increase the call count by 10 and also a separate batch call count by 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The middleware here uses on_call
hook (unlike previously on_request
), so here we don't deal with Batch/Single requests, but rather with individual calls that were part of that request. So if we have a batch of 10 calls it will not be distinguishable from 10 separate requests. I can add separate metric for batch calls, but I'm not sure exactly if it's that useful:
- correlating calls with batches will be quite hard with current library design (i.e. shared state between
on_request
andon_call
hooks) - I don't think it's used often in polkadot.js/api
- I feel like the imposed load should be exactly the same - batch is there just to minimize the number of requests from the client perspective, but on the server side it should be roughly the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I wonder is if it's easy to add a histogram over payload sizes (both in and out)?
That will be a bit harder / computationally expensive without changes in jsonrpc
library. On this level we deal with deserialized calls already, so the only way to asses the size of the payloads would be to serialize them again.
Is there a way to enable a metric only conditionally? I could maybe use some hacks around reporting a metric only if some log level/target is enabled?
client/rpc-servers/src/middleware.rs
Outdated
GaugeVec::new( | ||
Opts::new( | ||
"rpc_calls_finished", | ||
"Number of processed RPC calls (unique un-batched requests)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The middleware here uses on_call
hook (unlike previously on_request
), so here we don't deal with Batch/Single requests, but rather with individual calls that were part of that request. So if we have a batch of 10 calls it will not be distinguishable from 10 separate requests. I can add separate metric for batch calls, but I'm not sure exactly if it's that useful:
- correlating calls with batches will be quite hard with current library design (i.e. shared state between
on_request
andon_call
hooks) - I don't think it's used often in polkadot.js/api
- I feel like the imposed load should be exactly the same - batch is there just to minimize the number of requests from the client perspective, but on the server side it should be roughly the same.
@dvdplm added tracking |
If I'm not mistaken by looking at the code, these will be labeled by |
@lovelaced indeed. We don't correlate sessions though, so what this does not give us is getting some understanding of regular usage pattern (we can only average over all sessions) or finding per-session anomalies. Session-correlation would require a bit more work (changing the metadata/session to contain some unique session id or exposing one from the transport crates), but is totally doable too. |
client/rpc-servers/src/lib.rs
Outdated
session_opened: register( | ||
Gauge::new( | ||
"rpc_sessions_opened", | ||
"Number of persistent RPC sessions opened", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general it is more suitable to use a counter
instead of a gauge
in case the value only ever is supposed to increase. this will make prometheus interpret resets in the "right" way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed, thanks!
It's probably because a lot of calls runs sub millisecond. I changed the histogram to use microseconds now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks very good to me. thank you!
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
bot merge |
Trying merge. |
* Better RPC prometehus metrics. * Add session metrics. * Add counting requests as well. * Fix type for web build. * Fix browser-node * Filter out unknown method names. * Change Gauge to Counters * Use micros instead of millis. * cargo fmt * Update client/rpc-servers/src/lib.rs Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * move log to separate lines. * Fix compilation. * cargo +nightly fmt --all Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Resolves: #8677 (hopefuly)
This PR extends the existing
rpc_calls_total
metric to few that are more detailed (based on calls rather than requests and including method names).I'm not super prometheus-savvy, so maybe there are better ways to structure that, but afaict it would now be possible to figure out:
started
).On top of this there is
rpc_metrics
logging target added which outputs processing time per method name - similar torpc
target but without the fullresponse
bloat.CC @gabreal @lovelaced let me know if that's useful enough for you guys.