-
Notifications
You must be signed in to change notification settings - Fork 4
feat: Out of the box metrics for resource manager #54
Conversation
"exemplar": true, | ||
"expr": "histogram_quantile(0.50, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket)) - 0.1", | ||
"interval": "", | ||
"legendFormat": "p50 {{dir}} connections per peer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo should be streams
obs/stats.go
Outdated
fibLikeDistribution = []float64{ | ||
1.1, 2.1, 3.1, 5.1, 8.1, 13.1, 21.1, 34.1, 55.1, 100.1, 200.1, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why this distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fibonaccci with +0.1 because of a quirk in opencensus that does bound non-inclusive bound checking: https://github.com/census-instrumentation/opencensus-go/blob/v0.23.0/stats/view/aggregation_data.go#L195.
But the fib sequence itself is totally arbitrary. Happy to change this to something else if you have a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No better ideas. The go-to distributions are linear and exponential in the Prometheus ecosystem:
Looking at the dashboard, the p90 values are below the p50 values. That doesn't seem correct, or am I missing something? |
Public dashboard: https://ipfs.marcopolo.io/d/MgmGIjjnk/resource-manager?orgId=1 |
Resolved out of band. The aggregated number of peers was misinterpreted. I changed the color to make it more obvious this isn't p50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure this is optional, enabled with a constructor option?
We shouldn't be forcing down the throat of developers any metrics solution.
Note that with regards to UX, this can be easily aggregated in a go-libp2p constructor metrics option, which initializes various components (including the rcmgr) with metrics.
Yep. There's little value to these if they aren't hooked up to an exporter like prometheus.
Good idea! |
obs/stats.go
Outdated
var log = logging.Logger("rcmgrObs") | ||
|
||
var ( | ||
systemOutboundConns = stats.Int64("system/outbound/conn", "Number of outbound Connections", stats.UnitDimensionless) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
I tried getting exemplars through, but while you can attach an exemplar via opencensus the opencensus prometheus exporter doesn't support emitting it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably unify the scope name parsing at some point (doesn't need to be this PR).
#52 added a lot of similar logic for generating trace events.
trace.go
Outdated
func WithTraceReporter(reporter TraceReporter) Option { | ||
return func(r *resourceManager) error { | ||
if r.trace == nil { | ||
r.trace = &trace{reporters: []TraceReporter{reporter}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.trace = &trace{reporters: []TraceReporter{reporter}} | |
r.trace = &trace{} |
Otherwise you'll add this reporter twice.
|
||
StreamView = &view.View{Measure: streams, Aggregation: view.Sum(), TagKeys: []tag.Key{directionTag, scopeTag, serviceTag, protocolTag}} | ||
PeerStreamsView = &view.View{Measure: peerStreams, Aggregation: view.Distribution(oneTenThenExpDistribution...), TagKeys: []tag.Key{directionTag}} | ||
PeerStreamNegativeView = &view.View{Measure: peerStreamsNegative, Aggregation: view.Distribution(oneTenThenExpDistribution...), TagKeys: []tag.Key{directionTag}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize we've talked about this before, but I'm still confused by it. Seems weird that it takes this convoluted way to get a bucketed view. Is this standard practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. From https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations
In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.
Our metrics don't have a notion of removing a previous observation. The best we can do is add a new negative observation that subtracts the previous observation, then add a new observation.
So we need to make two observations:
- We want to record the old number of streams this peer had (this is negative).
- We want to record the current number of streams this peer has.
In order to make a negative observation with histograms we need to record it in a separate measurement, then combine it later.
That's my understanding, but it's possible I'm misunderstanding something. @mxinden does the above seem correct to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our metrics don't have a notion of removing a previous observation.
I am not familiar enough with OpenCensus, thus I am not sure what their best practice is.
From https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations
In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.
I don't think this is referring to undoing a previous observation, but only to record both negative and positive observations. That is not to say that your trick here won't work. Mind sharing the queries you are using here? Are you actually using the histogram_quantile
function? Note that histogram_quantile
calculates the quantiles over time, thus I am not sure the above undo logic is compatible with it.
To come back to the original goal, do I understand correctly that you would like to answer the following question:
How many currently connected peers have 1, 10, 100, 1000, 10000 (arbitrary distribution) streams?
In case you only want to visualize the above with a bar chart, but not calculate quantiles, how about exposing a Prometheus Gauge for each of these buckets. You can inc
and dec
a Prometheus Gauge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that histogram_quantile calculates the quantiles over time
To be clear https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
uses an instant-vector so it calculates the quantile at each point in time. As opposed to something like rate
which uses a range vector that can look at multiple timestamps for each point. (I'm sure you know this, just adding it here for completeness).
Mind sharing the queries you are using here? Are you actually using the histogram_quantile function?
histogram_quantile(0.90, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket))
This is similar to the commonly used example: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
. Except instead of using rate to get the rate of increase (by taking the current value and comparing it with a previous value) and calculating quantiles from that I'm using these two histograms to create a histogram of the current state and calculating quantiles with that.
Let me rephrase:
- We agree this example is correct:
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
. Since it's from the docs. - This example calculates quantiles on a new histogram. This new histogram is generated by doing
http_request_duration_seconds_bucket - http_request_duration_seconds_bucket[previous_time]
with therate
function (https://github.com/prometheus/prometheus/blob/main/promql/functions.go#L74) - What we're doing is keeping track of our last observation (the negative bucket), and generating a new histogram with our current state of the world with
rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_bucket@previous_observation (aka negative bucket)
. Which I'm arguing is the same as 2, except we are explicit in what our previous observation was.
a. Aside, we could use rate here as well and only use the one bucket, but then we'd be only seeing the differences of when a peer moves to a different bucket which is not as useful as seeing the current distribution of streams per peer. There might be a way with promql to do what I want, but I haven't figured it out (essentially what's the difference between the current value and the last different value).
Or another way to think about this is:
- Assume
rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket
generates a valid histogram, let's call itH
. - We should be able to caluclate the quantiles from this histogram
H
with thehistogram_quantile
function. (Which is actually quite simple: https://github.com/prometheus/prometheus/blob/main/promql/quantile.go#L101.)
So the question becomes, is our assumption of 1 here correct? I'm arguing it is because it is the histogram of our current state of streams per peer. It also looks exactly like a histogram metric promql would expect (cumulative le distributions).
I know that's quite a lot, and I appreciate your input here! Thank you for thinking about this and checking my work :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that histogram_quantile calculates the quantiles over time
To be clear
https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
uses an instant-vector so it calculates the quantile at each point in time. As opposed to something likerate
which uses a range vector that can look at multiple timestamps for each point. (I'm sure you know this, just adding it here for completeness).Mind sharing the queries you are using here? Are you actually using the histogram_quantile function?
histogram_quantile(0.90, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket))
This is similar to the commonly used example:
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
. Except instead of using rate to get the rate of increase (by taking the current value and comparing it with a previous value) and calculating quantiles from that I'm using these two histograms to create a histogram of the current state and calculating quantiles with that.
This is really clever! To be honest, I didn't think about not using rate
with a histogram_quantile
. Thanks for the explainer! 🙇
3. Aside, we could use rate here as well and only use the one bucket, but then we'd be only seeing the differences of when a peer moves to a different bucket which is not as useful as seeing the current distribution of streams per peer.
👍
3. There might be a way with promql to do what I want, but I haven't figured it out (essentially what's the difference between the current value and the last different value).
Isn't this impossible at the data level as one would need to know which decrease in one bucket matches a corresponding increase in another bucket and which increase is a new peer, thus does not have a decrease in another bucket?
@mxinden does the above seem correct to you?
The above seems correct to me!
In case you can influence the Prometheus metric # HELP
text, I think it is worth a mention of the concept you are using here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is still unhappy about this PR though.
This reverts commit 81fb49d.
(rebased off master to fix conflict) |
This adds metrics for resource manager that allows an end user to see what's happening with their go-libp2p application.
This hooks into the existing tracing setup to emit metrics and aggregate them when appropriate.
Possible improvements:
Example graphs:
Note the histogram is not rendered correctly in this snapshot. See the screenshot below for how it should look.
https://snapshots-origin.grafana.net/dashboard/snapshot/nDCjmTCnHpmEpXgtcWO7ThEDS7B5DLLn?orgId=2
How the histogram should look like:
I'll see if I can spin up a public ipfs dashboard to demo.