Better RPC prometheus metrics. #9358

tomusdrw · 2021-07-15T23:47:39Z

Resolves: #8677 (hopefuly)

This PR extends the existing rpc_calls_total metric to few that are more detailed (based on calls rather than requests and including method names).

I'm not super prometheus-savvy, so maybe there are better ways to structure that, but afaict it would now be possible to figure out:

Histogram of time (in millis) required to process RPC calls, divided by method name&transport.
In flight requests (started - finished)
Sudden growth of requests (increase in started).

On top of this there is rpc_metrics logging target added which outputs processing time per method name - similar to rpc target but without the full response bloat.

CC @gabreal @lovelaced let me know if that's useful enough for you guys.

dvdplm

Code lgtm.

Left a few questions and suggestions. One thing I wonder is if it's easy to add a histogram over payload sizes (both in and out)?

dvdplm · 2021-07-16T07:31:36Z

client/rpc-servers/src/middleware.rs

+					GaugeVec::new(
+						Opts::new(
+							"rpc_calls_finished",
+							"Number of processed RPC calls (unique un-batched requests)"


Does this mean that one batch call of 10 calls counts as 11 calls? Or 10? It'd be good to count batch calls as well maybe, so that a batch of 10 calls would increase the call count by 10 and also a separate batch call count by 1?

The middleware here uses on_call hook (unlike previously on_request), so here we don't deal with Batch/Single requests, but rather with individual calls that were part of that request. So if we have a batch of 10 calls it will not be distinguishable from 10 separate requests. I can add separate metric for batch calls, but I'm not sure exactly if it's that useful:

correlating calls with batches will be quite hard with current library design (i.e. shared state between on_request and on_call hooks)

I don't think it's used often in polkadot.js/api

I feel like the imposed load should be exactly the same - batch is there just to minimize the number of requests from the client perspective, but on the server side it should be roughly the same.

client/rpc-servers/src/middleware.rs

tomusdrw

One thing I wonder is if it's easy to add a histogram over payload sizes (both in and out)?

That will be a bit harder / computationally expensive without changes in jsonrpc library. On this level we deal with deserialized calls already, so the only way to asses the size of the payloads would be to serialize them again.

Is there a way to enable a metric only conditionally? I could maybe use some hacks around reporting a metric only if some log level/target is enabled?

client/rpc-servers/src/middleware.rs

tomusdrw · 2021-07-16T12:43:49Z

client/rpc-servers/src/middleware.rs

+					GaugeVec::new(
+						Opts::new(
+							"rpc_calls_finished",
+							"Number of processed RPC calls (unique un-batched requests)"


The middleware here uses on_call hook (unlike previously on_request), so here we don't deal with Batch/Single requests, but rather with individual calls that were part of that request. So if we have a batch of 10 calls it will not be distinguishable from 10 separate requests. I can add separate metric for batch calls, but I'm not sure exactly if it's that useful:

correlating calls with batches will be quite hard with current library design (i.e. shared state between on_request and on_call hooks)

I don't think it's used often in polkadot.js/api

I feel like the imposed load should be exactly the same - batch is there just to minimize the number of requests from the client perspective, but on the server side it should be roughly the same.

tomusdrw · 2021-07-16T14:32:36Z

@dvdplm added tracking requests along with calls and also the WS&IPC servers are now reporting open sessions. There is no distinction between these two, but for our purposes it's fine, since we only use WS.

lovelaced · 2021-07-19T13:08:40Z

If I'm not mistaken by looking at the code, these will be labeled by method as well as procotol so we'll be able to ascertain how many of which RPC method were called and how? For example, if someone called rotateKeys via ws that would be filterable?

tomusdrw · 2021-07-19T13:17:31Z

@lovelaced indeed. We don't correlate sessions though, so what this does not give us is getting some understanding of regular usage pattern (we can only average over all sessions) or finding per-session anomalies. Session-correlation would require a bit more work (changing the metadata/session to contain some unique session id or exposing one from the transport crates), but is totally doable too.

gabreal · 2021-07-19T19:06:56Z

client/rpc-servers/src/lib.rs

+			session_opened: register(
+				Gauge::new(
+					"rpc_sessions_opened",
+					"Number of persistent RPC sessions opened",


in general it is more suitable to use a counter instead of a gauge in case the value only ever is supposed to increase. this will make prometheus interpret resets in the "right" way.

Good point, fixed, thanks!

PierreBesson · 2021-07-20T12:55:34Z

When I try to get the histogram quantile of the RPC metrics. I get a lot of NaN values.

I think it might be better not to return the timeseries if the bucket value is equals to 0.

In any case, I am able to filter out the 0 value buckets with:

tomusdrw · 2021-07-21T15:22:18Z

When I try to get the histogram quantile of the RPC metrics. I get a lot of NaN values.

It's probably because a lot of calls runs sub millisecond. I changed the histogram to use microseconds now.

gabreal

looks very good to me. thank you!

client/rpc-servers/src/lib.rs

client/rpc-servers/src/middleware.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

tomusdrw · 2021-08-24T10:48:18Z

bot merge

ghost · 2021-08-24T10:48:22Z

Trying merge.

* Better RPC prometehus metrics. * Add session metrics. * Add counting requests as well. * Fix type for web build. * Fix browser-node * Filter out unknown method names. * Change Gauge to Counters * Use micros instead of millis. * cargo fmt * Update client/rpc-servers/src/lib.rs Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * move log to separate lines. * Fix compilation. * cargo +nightly fmt --all Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

Better RPC prometehus metrics.

fd05925

tomusdrw added A0-please_review Pull request needs code review. B3-apinoteworthy C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit. labels Jul 15, 2021

dvdplm approved these changes Jul 16, 2021

View reviewed changes

tomusdrw commented Jul 16, 2021

View reviewed changes

tomusdrw added 2 commits July 16, 2021 16:08

Add session metrics.

c2e26a3

Add counting requests as well.

fd8f3d8

tomusdrw added 3 commits July 16, 2021 16:38

Fix type for web build.

eddc6df

Fix browser-node

e90f9f0

Filter out unknown method names.

40a19b3

gabreal reviewed Jul 19, 2021

View reviewed changes

Change Gauge to Counters

b3fbfcf

tomusdrw added 3 commits July 21, 2021 17:13

Merge branch 'master' into td-prometheus

6cba90f

Use micros instead of millis.

c52fc2c

cargo fmt

68702b4

tomusdrw requested review from dvdplm and gabreal July 26, 2021 11:01

Merge branch 'master' into td-prometheus

2dd167a

gabreal approved these changes Jul 27, 2021

View reviewed changes

bkchr reviewed Jul 28, 2021

View reviewed changes

tomusdrw and others added 3 commits July 28, 2021 11:27

Update client/rpc-servers/src/lib.rs

76659a1

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

Apply suggestions from code review

2d7b9ee

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

Merge branch 'master' into td-prometheus

7071b00

move log to separate lines.

db11dd8

bkchr approved these changes Jul 28, 2021

View reviewed changes

tomusdrw added 3 commits July 28, 2021 11:40

Fix compilation.

c223047

Merge branch 'master' into td-prometheus

c8eb9e7

cargo +nightly fmt --all

6ac9cf6

ghost merged commit 72aaab6 into master Aug 24, 2021

ghost deleted the td-prometheus branch August 24, 2021 10:48

dvdplm mentioned this pull request Sep 22, 2021

Allow substrate-based chains to define their own rpc middleware #9458

Open

github-actions bot mentioned this pull request Oct 11, 2021

Update substrate/polkadot/cumulus from v0.9.10 to v0.9.11 moonbeam-foundation/moonbeam#892

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better RPC prometheus metrics. #9358

Better RPC prometheus metrics. #9358

tomusdrw commented Jul 15, 2021

dvdplm left a comment

dvdplm Jul 16, 2021

tomusdrw Jul 16, 2021

tomusdrw left a comment

tomusdrw Jul 16, 2021

tomusdrw commented Jul 16, 2021

lovelaced commented Jul 19, 2021

tomusdrw commented Jul 19, 2021

gabreal Jul 19, 2021

tomusdrw Jul 20, 2021

PierreBesson commented Jul 20, 2021

tomusdrw commented Jul 21, 2021

gabreal left a comment

tomusdrw commented Aug 24, 2021

ghost commented Aug 24, 2021

Better RPC prometheus metrics. #9358

Better RPC prometheus metrics. #9358

Conversation

tomusdrw commented Jul 15, 2021

dvdplm left a comment

Choose a reason for hiding this comment

dvdplm Jul 16, 2021

Choose a reason for hiding this comment

tomusdrw Jul 16, 2021

Choose a reason for hiding this comment

tomusdrw left a comment

Choose a reason for hiding this comment

tomusdrw Jul 16, 2021

Choose a reason for hiding this comment

tomusdrw commented Jul 16, 2021

lovelaced commented Jul 19, 2021

tomusdrw commented Jul 19, 2021

gabreal Jul 19, 2021

Choose a reason for hiding this comment

tomusdrw Jul 20, 2021

Choose a reason for hiding this comment

PierreBesson commented Jul 20, 2021

tomusdrw commented Jul 21, 2021

gabreal left a comment

Choose a reason for hiding this comment

tomusdrw commented Aug 24, 2021

ghost commented Aug 24, 2021