feat: Out of the box metrics for resource manager #54

MarcoPolo · 2022-06-21T01:23:00Z

This adds metrics for resource manager that allows an end user to see what's happening with their go-libp2p application.

This hooks into the existing tracing setup to emit metrics and aggregate them when appropriate.

Possible improvements:

Record limits

Example graphs:

Note the histogram is not rendered correctly in this snapshot. See the screenshot below for how it should look.
https://snapshots-origin.grafana.net/dashboard/snapshot/nDCjmTCnHpmEpXgtcWO7ThEDS7B5DLLn?orgId=2

How the histogram should look like:

I'll see if I can spin up a public ipfs dashboard to demo.

MarcoPolo · 2022-06-21T02:57:35Z

obs/grafana-dashboards/resource-manager.json

+            "exemplar": true,
+            "expr": "histogram_quantile(0.50, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket)) - 0.1",
+            "interval": "",
+            "legendFormat": "p50 {{dir}} connections per peer",


Typo should be streams

mxinden · 2022-06-21T05:48:52Z

obs/stats.go

+	fibLikeDistribution = []float64{
+		1.1, 2.1, 3.1, 5.1, 8.1, 13.1, 21.1, 34.1, 55.1, 100.1, 200.1,
+	}


Out of curiosity, why this distribution?

fibonaccci with +0.1 because of a quirk in opencensus that does bound non-inclusive bound checking: https://github.com/census-instrumentation/opencensus-go/blob/v0.23.0/stats/view/aggregation_data.go#L195.

But the fib sequence itself is totally arbitrary. Happy to change this to something else if you have a better idea.

No better ideas. The go-to distributions are linear and exponential in the Prometheus ecosystem:

https://github.com/prometheus/client_golang/blob/4ad265f1b4ee8d5c239b3f36081529bbe3a838db/prometheus/histogram.go#L75-L145

marten-seemann · 2022-06-21T09:52:14Z

Looking at the dashboard, the p90 values are below the p50 values. That doesn't seem correct, or am I missing something?

MarcoPolo · 2022-06-21T17:26:49Z

Public dashboard: https://ipfs.marcopolo.io/d/MgmGIjjnk/resource-manager?orgId=1

MarcoPolo · 2022-06-21T22:32:07Z

Looking at the dashboard, the p90 values are below the p50 values. That doesn't seem correct, or am I missing something?

Resolved out of band. The aggregated number of peers was misinterpreted. I changed the color to make it more obvious this isn't p50

vyzo

Can we make sure this is optional, enabled with a constructor option?
We shouldn't be forcing down the throat of developers any metrics solution.

Note that with regards to UX, this can be easily aggregated in a go-libp2p constructor metrics option, which initializes various components (including the rcmgr) with metrics.

MarcoPolo · 2022-06-27T15:02:45Z

Can we make sure this is optional, enabled with a constructor option?
We shouldn't be forcing down the throat of developers any metrics solution.

Yep. There's little value to these if they aren't hooked up to an exporter like prometheus.

Note that with regards to UX, this can be easily aggregated in a go-libp2p constructor metrics option, which initializes various components (including the rcmgr) with metrics.

Good idea!

MarcoPolo · 2022-06-27T22:00:54Z

obs/stats.go

+var log = logging.Logger("rcmgrObs")
+
+var (
+	systemOutboundConns = stats.Int64("system/outbound/conn", "Number of outbound Connections", stats.UnitDimensionless)


MarcoPolo · 2022-06-27T23:40:28Z

I tried getting exemplars through, but while you can attach an exemplar via opencensus the opencensus prometheus exporter doesn't support emitting it.

marten-seemann

We should probably unify the scope name parsing at some point (doesn't need to be this PR).
#52 added a lot of similar logic for generating trace events.

marten-seemann · 2022-06-29T09:55:55Z

trace.go

+func WithTraceReporter(reporter TraceReporter) Option {
+	return func(r *resourceManager) error {
+		if r.trace == nil {
+			r.trace = &trace{reporters: []TraceReporter{reporter}}


Suggested change

r.trace = &trace{reporters: []TraceReporter{reporter}}

r.trace = &trace{}

Otherwise you'll add this reporter twice.

marten-seemann · 2022-06-29T09:59:17Z

obs/stats.go

+
+	StreamView             = &view.View{Measure: streams, Aggregation: view.Sum(), TagKeys: []tag.Key{directionTag, scopeTag, serviceTag, protocolTag}}
+	PeerStreamsView        = &view.View{Measure: peerStreams, Aggregation: view.Distribution(oneTenThenExpDistribution...), TagKeys: []tag.Key{directionTag}}
+	PeerStreamNegativeView = &view.View{Measure: peerStreamsNegative, Aggregation: view.Distribution(oneTenThenExpDistribution...), TagKeys: []tag.Key{directionTag}}


I realize we've talked about this before, but I'm still confused by it. Seems weird that it takes this convoluted way to get a bucketed view. Is this standard practice?

I think so. From https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.

Our metrics don't have a notion of removing a previous observation. The best we can do is add a new negative observation that subtracts the previous observation, then add a new observation.

So we need to make two observations:

We want to record the old number of streams this peer had (this is negative).

We want to record the current number of streams this peer has.

In order to make a negative observation with histograms we need to record it in a separate measurement, then combine it later.

That's my understanding, but it's possible I'm misunderstanding something. @mxinden does the above seem correct to you?

Our metrics don't have a notion of removing a previous observation.

I am not familiar enough with OpenCensus, thus I am not sure what their best practice is.

From https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.

I don't think this is referring to undoing a previous observation, but only to record both negative and positive observations. That is not to say that your trick here won't work. Mind sharing the queries you are using here? Are you actually using the histogram_quantile function? Note that histogram_quantile calculates the quantiles over time, thus I am not sure the above undo logic is compatible with it.

To come back to the original goal, do I understand correctly that you would like to answer the following question:

How many currently connected peers have 1, 10, 100, 1000, 10000 (arbitrary distribution) streams?

In case you only want to visualize the above with a bar chart, but not calculate quantiles, how about exposing a Prometheus Gauge for each of these buckets. You can inc and dec a Prometheus Gauge.

Note that histogram_quantile calculates the quantiles over time

To be clear https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile uses an instant-vector so it calculates the quantile at each point in time. As opposed to something like rate which uses a range vector that can look at multiple timestamps for each point. (I'm sure you know this, just adding it here for completeness).

Mind sharing the queries you are using here? Are you actually using the histogram_quantile function?

histogram_quantile(0.90, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket))

This is similar to the commonly used example: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])). Except instead of using rate to get the rate of increase (by taking the current value and comparing it with a previous value) and calculating quantiles from that I'm using these two histograms to create a histogram of the current state and calculating quantiles with that.

Let me rephrase:

We agree this example is correct: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])). Since it's from the docs.

This example calculates quantiles on a new histogram. This new histogram is generated by doing http_request_duration_seconds_bucket - http_request_duration_seconds_bucket[previous_time] with the rate function (https://github.com/prometheus/prometheus/blob/main/promql/functions.go#L74)

What we're doing is keeping track of our last observation (the negative bucket), and generating a new histogram with our current state of the world with rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_bucket@previous_observation (aka negative bucket). Which I'm arguing is the same as 2, except we are explicit in what our previous observation was.
a. Aside, we could use rate here as well and only use the one bucket, but then we'd be only seeing the differences of when a peer moves to a different bucket which is not as useful as seeing the current distribution of streams per peer. There might be a way with promql to do what I want, but I haven't figured it out (essentially what's the difference between the current value and the last different value).

Or another way to think about this is:

Assume rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket generates a valid histogram, let's call it H.

We should be able to caluclate the quantiles from this histogram H with the histogram_quantile function. (Which is actually quite simple: https://github.com/prometheus/prometheus/blob/main/promql/quantile.go#L101.)

So the question becomes, is our assumption of 1 here correct? I'm arguing it is because it is the histogram of our current state of streams per peer. It also looks exactly like a histogram metric promql would expect (cumulative le distributions).

I know that's quite a lot, and I appreciate your input here! Thank you for thinking about this and checking my work :)

Note that histogram_quantile calculates the quantiles over time

To be clear https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile uses an instant-vector so it calculates the quantile at each point in time. As opposed to something like rate which uses a range vector that can look at multiple timestamps for each point. (I'm sure you know this, just adding it here for completeness).

Mind sharing the queries you are using here? Are you actually using the histogram_quantile function?

histogram_quantile(0.90, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket))

This is similar to the commonly used example: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])). Except instead of using rate to get the rate of increase (by taking the current value and comparing it with a previous value) and calculating quantiles from that I'm using these two histograms to create a histogram of the current state and calculating quantiles with that.

This is really clever! To be honest, I didn't think about not using rate with a histogram_quantile. Thanks for the explainer! 🙇

3. Aside, we could use rate here as well and only use the one bucket, but then we'd be only seeing the differences of when a peer moves to a different bucket which is not as useful as seeing the current distribution of streams per peer.

👍

3. There might be a way with promql to do what I want, but I haven't figured it out (essentially what's the difference between the current value and the last different value).

Isn't this impossible at the data level as one would need to know which decrease in one bucket matches a corresponding increase in another bucket and which increase is a new peer, thus does not have a decrease in another bucket?

@mxinden does the above seem correct to you?

The above seems correct to me!

In case you can influence the Prometheus metric # HELP text, I think it is worth a mention of the concept you are using here.

obs/stats.go

rcmgr.go

marten-seemann

CI is still unhappy about this PR though.

This reverts commit 81fb49d.

MarcoPolo · 2022-07-01T19:02:40Z

(rebased off master to fix conflict)

MarcoPolo requested a review from marten-seemann June 21, 2022 01:23

MarcoPolo force-pushed the marco/50-obs branch from 269f388 to ed83337 Compare June 21, 2022 01:26

MarcoPolo linked an issue Jun 21, 2022 that may be closed by this pull request

feat: Usable out of the box metrics #50

Closed

MarcoPolo commented Jun 21, 2022

View reviewed changes

mxinden reviewed Jun 21, 2022

View reviewed changes

vyzo reviewed Jun 27, 2022

View reviewed changes

MarcoPolo commented Jun 27, 2022

View reviewed changes

BigLep mentioned this pull request Jun 28, 2022

go-libp2p v0.21.0 libp2p/go-libp2p#1514

Closed

41 tasks

marten-seemann reviewed Jun 29, 2022

View reviewed changes

marten-seemann approved these changes Jul 1, 2022

View reviewed changes

MarcoPolo added 15 commits July 1, 2022 12:00

Add multimetrics reporter

1c12460

Add trace reporter to expose traces to users

35abda3

Flip sign of delta

9ce3760

Add scope name helpers

7efd81c

Add obs to resource manager

19b6d58

Add dep to go.mod/sum

3a06e88

Revert "Add multimetrics reporter"

1c172ac

This reverts commit 81fb49d.

Remove connid from scopename in blocked conns

2385a33

Add Grafana dashboard

c9413d7

Fix merge conflicts

04b1793

Update dashboard json

bc012d3

Update dashboard blocked resources

04e1088

Update bucket distribution and blocked resources

d34b9cb

Use bytes rather than decbytes (power of 2)

6d5898a

Unexport tag names, remove unused var

e70da09

MarcoPolo added 6 commits July 1, 2022 12:02

Filter out spans

5abae10

Rewrite constants to be more obvious

ac716c1

Don't add reporter twice

662c094

Update help text

ccb0c4d

Remove unused var

0f38042

Fix go vet issue

836d53a

MarcoPolo force-pushed the marco/50-obs branch from 21e9efe to 836d53a Compare July 1, 2022 19:02

MarcoPolo merged commit b8369c6 into master Jul 1, 2022

MarcoPolo deleted the marco/50-obs branch July 1, 2022 19:33

ajnavarro mentioned this pull request Aug 24, 2022

Release 0.15 ipfs/kubo#9152

Closed

72 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Out of the box metrics for resource manager #54

feat: Out of the box metrics for resource manager #54

MarcoPolo commented Jun 21, 2022 •

edited

Loading

MarcoPolo Jun 21, 2022

mxinden Jun 21, 2022

MarcoPolo Jun 21, 2022

mxinden Jun 23, 2022

marten-seemann commented Jun 21, 2022

MarcoPolo commented Jun 21, 2022

MarcoPolo commented Jun 21, 2022

vyzo left a comment

MarcoPolo commented Jun 27, 2022

MarcoPolo Jun 27, 2022

MarcoPolo commented Jun 27, 2022

marten-seemann left a comment

marten-seemann Jun 29, 2022

marten-seemann Jun 29, 2022

MarcoPolo Jun 29, 2022

mxinden Jun 30, 2022

MarcoPolo Jun 30, 2022

mxinden Jul 1, 2022

marten-seemann left a comment

MarcoPolo commented Jul 1, 2022

	r.trace = &trace{reporters: []TraceReporter{reporter}}
	r.trace = &trace{}

feat: Out of the box metrics for resource manager #54

feat: Out of the box metrics for resource manager #54

Conversation

MarcoPolo commented Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marten-seemann commented Jun 21, 2022

MarcoPolo commented Jun 21, 2022

MarcoPolo commented Jun 21, 2022

vyzo left a comment

Choose a reason for hiding this comment

MarcoPolo commented Jun 27, 2022

Choose a reason for hiding this comment

MarcoPolo commented Jun 27, 2022

marten-seemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marten-seemann left a comment

Choose a reason for hiding this comment

MarcoPolo commented Jul 1, 2022

MarcoPolo commented Jun 21, 2022 •

edited

Loading