Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Better RPC prometheus metrics. #9358

Merged
18 commits merged into from
Aug 24, 2021
Merged

Better RPC prometheus metrics. #9358

18 commits merged into from
Aug 24, 2021

Conversation

tomusdrw
Copy link
Contributor

Resolves: #8677 (hopefuly)

This PR extends the existing rpc_calls_total metric to few that are more detailed (based on calls rather than requests and including method names).

I'm not super prometheus-savvy, so maybe there are better ways to structure that, but afaict it would now be possible to figure out:

  1. Histogram of time (in millis) required to process RPC calls, divided by method name&transport.
  2. In flight requests (started - finished)
  3. Sudden growth of requests (increase in started).

On top of this there is rpc_metrics logging target added which outputs processing time per method name - similar to rpc target but without the full response bloat.

CC @gabreal @lovelaced let me know if that's useful enough for you guys.

@tomusdrw tomusdrw added A0-please_review Pull request needs code review. B3-apinoteworthy C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit. labels Jul 15, 2021
Copy link
Contributor

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm.

Left a few questions and suggestions. One thing I wonder is if it's easy to add a histogram over payload sizes (both in and out)?

GaugeVec::new(
Opts::new(
"rpc_calls_finished",
"Number of processed RPC calls (unique un-batched requests)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that one batch call of 10 calls counts as 11 calls? Or 10? It'd be good to count batch calls as well maybe, so that a batch of 10 calls would increase the call count by 10 and also a separate batch call count by 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The middleware here uses on_call hook (unlike previously on_request), so here we don't deal with Batch/Single requests, but rather with individual calls that were part of that request. So if we have a batch of 10 calls it will not be distinguishable from 10 separate requests. I can add separate metric for batch calls, but I'm not sure exactly if it's that useful:

  1. correlating calls with batches will be quite hard with current library design (i.e. shared state between on_request and on_call hooks)
  2. I don't think it's used often in polkadot.js/api
  3. I feel like the imposed load should be exactly the same - batch is there just to minimize the number of requests from the client perspective, but on the server side it should be roughly the same.

client/rpc-servers/src/middleware.rs Show resolved Hide resolved
Copy link
Contributor Author

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wonder is if it's easy to add a histogram over payload sizes (both in and out)?

That will be a bit harder / computationally expensive without changes in jsonrpc library. On this level we deal with deserialized calls already, so the only way to asses the size of the payloads would be to serialize them again.

Is there a way to enable a metric only conditionally? I could maybe use some hacks around reporting a metric only if some log level/target is enabled?

client/rpc-servers/src/middleware.rs Show resolved Hide resolved
GaugeVec::new(
Opts::new(
"rpc_calls_finished",
"Number of processed RPC calls (unique un-batched requests)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The middleware here uses on_call hook (unlike previously on_request), so here we don't deal with Batch/Single requests, but rather with individual calls that were part of that request. So if we have a batch of 10 calls it will not be distinguishable from 10 separate requests. I can add separate metric for batch calls, but I'm not sure exactly if it's that useful:

  1. correlating calls with batches will be quite hard with current library design (i.e. shared state between on_request and on_call hooks)
  2. I don't think it's used often in polkadot.js/api
  3. I feel like the imposed load should be exactly the same - batch is there just to minimize the number of requests from the client perspective, but on the server side it should be roughly the same.

@tomusdrw
Copy link
Contributor Author

@dvdplm added tracking requests along with calls and also the WS&IPC servers are now reporting open sessions. There is no distinction between these two, but for our purposes it's fine, since we only use WS.

@lovelaced
Copy link

If I'm not mistaken by looking at the code, these will be labeled by method as well as procotol so we'll be able to ascertain how many of which RPC method were called and how? For example, if someone called rotateKeys via ws that would be filterable?

@tomusdrw
Copy link
Contributor Author

@lovelaced indeed. We don't correlate sessions though, so what this does not give us is getting some understanding of regular usage pattern (we can only average over all sessions) or finding per-session anomalies. Session-correlation would require a bit more work (changing the metadata/session to contain some unique session id or exposing one from the transport crates), but is totally doable too.

session_opened: register(
Gauge::new(
"rpc_sessions_opened",
"Number of persistent RPC sessions opened",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general it is more suitable to use a counter instead of a gauge in case the value only ever is supposed to increase. this will make prometheus interpret resets in the "right" way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed, thanks!

@PierreBesson
Copy link
Contributor

When I try to get the histogram quantile of the RPC metrics. I get a lot of NaN values.

image

I think it might be better not to return the timeseries if the bucket value is equals to 0.

In any case, I am able to filter out the 0 value buckets with:

image

@tomusdrw
Copy link
Contributor Author

When I try to get the histogram quantile of the RPC metrics. I get a lot of NaN values.

It's probably because a lot of calls runs sub millisecond. I changed the histogram to use microseconds now.

@tomusdrw tomusdrw requested review from dvdplm and gabreal July 26, 2021 11:01
Copy link
Contributor

@gabreal gabreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks very good to me. thank you!

client/rpc-servers/src/lib.rs Outdated Show resolved Hide resolved
client/rpc-servers/src/lib.rs Outdated Show resolved Hide resolved
client/rpc-servers/src/middleware.rs Outdated Show resolved Hide resolved
client/rpc-servers/src/middleware.rs Outdated Show resolved Hide resolved
client/rpc-servers/src/middleware.rs Outdated Show resolved Hide resolved
tomusdrw and others added 3 commits July 28, 2021 11:27
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
@tomusdrw
Copy link
Contributor Author

bot merge

@ghost
Copy link

ghost commented Aug 24, 2021

Trying merge.

@ghost ghost merged commit 72aaab6 into master Aug 24, 2021
@ghost ghost deleted the td-prometheus branch August 24, 2021 10:48
Wizdave97 pushed a commit to Wizdave97/substrate that referenced this pull request Aug 25, 2021
* Better RPC prometehus metrics.

* Add session metrics.

* Add counting requests as well.

* Fix type for web build.

* Fix browser-node

* Filter out unknown method names.

* Change Gauge to Counters

* Use micros instead of millis.

* cargo fmt

* Update client/rpc-servers/src/lib.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* move log to separate lines.

* Fix compilation.

* cargo +nightly fmt --all

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpc metrics
6 participants