Expose metrics via Prometheus #1422

ghost · 2015-06-25T15:54:34Z

I've been approaching #1418 from an ops perspective, where prometheus looks very promising. (Disclaimer: it comes from my old employer SoundCloud, where we had plenty of pain with Graphite, Ganglia, Statsd, and all kinds of custom metrics libraries and tools.)

This PR is massive because of Godeps (did I do this right?), but the actual integration is uninvasive, complete, and incremental.

uninvasive (dont change all the interfaces everywhere)

complete (allow anything to be able to add metrics)

incremental (dont add everything just yet, add as we need)

Prometheus exposes metrics via a net/http handler, which is then aggregated by the prometheus daemon, which serves graphs and timeseries, and can back proper dashboards like Graphana.

Currently, this handler is registered with the API (http://127.0.0.1:5001/metrics) and could be used by command line tools, in order to get metrics snapshots, e.g. "number of peers right now". These metrics can be tagged with "labels" to add more dimensions, e.g. "http responses with status code 502", "peers connected via cjdns", "metrics from pluto.i.ipfs.io" etc.

GitCop · 2015-06-25T15:54:35Z

There were the following issues with your Pull Request

Commit: 4dd5f7f
- Invalid signoff. Commit message must end with License: MIT
  Signed-off-by: .* <.*>
Commit: f7b373f
- Invalid signoff. Commit message must end with License: MIT
  Signed-off-by: .* <.*>
Commit: 7a9c6c5
- Invalid signoff. Commit message must end with License: MIT
  Signed-off-by: .* <.*>

Guidelines are available to help. Your feedback on GitCop is welcome on this issue

This message was auto-generated by https://gitcop.com

ghost · 2015-06-25T15:59:21Z

I haven't really covered how other use cases mentioned in #1418 could be solved, just wanted to finally get this out here for feedback, and will add more info later tonight.

jbenet · 2015-06-25T21:26:50Z

cmd/ipfs/daemon.go

@@ -12,6 +12,7 @@ import (
 	_ "github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics/runtime"
 	ma "github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-multiaddr"
 	"github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-multiaddr-net"
+	prom "github.com/prometheus/client_golang/prometheus"


want to run make vendor to make sure imports are re-written. if godeps yells at you, see #1385 (comment)

jbenet · 2015-06-25T21:29:53Z

couple comments above, but this LGTM.

and, prometheus comes well recommended. (and i love the name).

maybe, actually, could put the metrics under

/<prefix like debug or sys>/metrics/prometheus

so that we could add other metrics there if we needed to

GitCop · 2015-06-26T14:27:45Z

There were the following issues with your Pull Request

Commit: 05d80dd
- Invalid signoff. Commit message must end with License: MIT
  Signed-off-by: .* <.*>
Commit: 3b2aa58
- Invalid signoff. Commit message must end with License: MIT
  Signed-off-by: .* <.*>
Commit: d68bbda
- Invalid signoff. Commit message must end with License: MIT
  Signed-off-by: .* <.*>

Guidelines are available to help. Your feedback on GitCop is welcome on this issue

This message was auto-generated by https://gitcop.com

ghost · 2015-06-26T15:13:54Z

changed the handler mount to /debug/metrics/prometheus
moved the ServeOption into its own parameterized function in core/corehttp/prometheus.go
ran make vendor

I agree about the distinct namespace for "internal" handlers.

jbenet · 2015-06-26T20:09:07Z

@lgierth oh uh: https://travis-ci.org/ipfs/go-ipfs/jobs/68493979#L194 this means the metrics clashed when the tests tried to run.

looks like we need to add some uniqueness to the metrics names. This could be done either by prefixing with a ptr, or a monotonically increasing number (global for the proc :/)

ghost · 2015-06-26T22:34:04Z

It's using prometheus.MustRegisterOrGet() now, which gracefully skips over duplicate metric collectors -- build's green now

jbenet · 2015-06-27T05:54:24Z

p2p/net/swarm/swarm.go

@@ -82,6 +90,8 @@ func NewSwarm(ctx context.Context, listenAddrs []ma.Multiaddr,
 	s.cg.SetTeardown(s.teardown)
 	s.SetConnHandler(nil) // make sure to setup our own conn handler.

+	prom.MustRegisterOrGet(peersTotal)


oh now i understand. sorry, i missed this earlier. this is getting initialized on init. I believe this needs to be on the swarm itself (i.e. a gauge per swarm). in many circumstances we create multiple swarms per process.

or if not stored on the swarm, then create a new one per swarm? point it, it should be independent gauges per swarm

That's for another PR, imo. Right now this basically counts "total number of peers over all swarms", which is a good simple metric to get started with.

If the swarm itself is an interesting dimension of the peers count, we can replace the Counter with a CounterVec, which allows supplying labels (i.e. dimensions) with each Inc()/Dec() call. More info on labels.

CounterVec is a Collector that bundles a set of Counters that all share the same Desc, but have different values for their variable labels. This is used if you want to count the same thing partitioned by various dimensions (e.g. number of HTTP requests, partitioned by response code and method). Create instances with NewCounterVec.

peersTotal.GetMetricWithLabelValues(Labels{"swarm": "foo"}).Inc(1)

Labels should only be used with dimensions that are more or less bounded. For example, a unix process ID is a bad label.

"total number of peers over all swarms", which is a good simple metric to get started with.

What would this metric be used for?

If the swarm itself is an interesting dimension of the peers count,

It's not just an interesting dimension. not conditioning on the swarm is incorrect.

not sure you see what i mean: there's no relationship implied between two different swarms in the same process (they will most likely belong to different nodes, but not necessarily), and it will double count any connections, between swarms in the same process, and from the multiple swarms to the same peer. (This happens all the time in tests, for example). so, in an in-proc network of 5 nodes, you'd see "25 Number of connected peers" which is an incorrect statement, because there's "25 individual connections, across only 5 peers".

I don't want to merge things to the surface area (interface) that are either (partially) incorrect or that we'll just have to change right away. anything we merge is stuff that someone might start using. I'd probably start using this metric just because it's there, but it would be misleading.

If want to merge just the vendoring of the lib with a test metric, sure let's do it, but maybe go for something else, that isn't a singleton?

Labels should only be used with dimensions that are more or less bounded. For example, a unix process ID is a bad label.

What do you mean by bounded in this case? (cause unix pids are bounded, but not in the way you mean)

jbenet · 2015-06-27T06:31:34Z

build's green now

but i think it's incorrect. see #1422 (comment)

License: MIT Signed-off-by: Lars Gierth <larsg@systemli.org>

ghost · 2015-06-29T21:52:09Z

Updated, it now adds the local PeerID as a dimension (label in prometheus lingo). With prometheus' automatic labels included, this ends up as:

ipfs_p2p_peers_total{instance="localhost:5001",job="ipfs",monitor="peppy",peer_id="QmW3UhWb9a79PquifrA9r37ji6RAojpLBPkzgf7cmnwjov"} => int

The instance will be something like h.pluto.ipfs.io:5001 on the gateways.

What do you mean by bounded in this case? (cause unix pids are bounded, but not in the way you mean)

I meant something that doesn't end up with a huge cardinality after a short amount of time. PeerIDs are constant on the nodes that we want to monitor. I can imagine use cases with throw-away identities, but these cases will just have to live with the fact that this particular metric might not perform that well.

jbenet · 2015-06-29T22:10:40Z

p2p/net/swarm/swarm.go

 }

 func (n *ps2netNotifee) Disconnected(c *ps.Conn) {
 	n.not.Disconnected(n.net, inet.Conn((*Conn)(c)))
+	if metric, err := peersTotalGauge(n.net.LocalPeer()); err == nil {


wonder, do we want a log.Debug if err != nil ? shrug maybe not.

// An error is returned if the number and names of the Labels are inconsistent
// with those of the VariableLabels in Desc.

I can swap it for the other func that panics 🤘

yeah, we shouldnt ship things with inconsistent labels. im on the fence (a debug log may be ok), but dont care much either way. i think @whyrusleeping would say panic.

jbenet · 2015-06-29T22:10:55Z

ok sounds good!

@lgierth i think this LGTM and is ready for merge. ready?

ghost · 2015-06-29T22:12:48Z

Not yet, this currently increments/decrements once per notifee, which is wrong -- I'll put the increment/decrement into a distinct notifee, which is anyhow the cleanest way. (Also just noticed that Travis is red).

jbenet · 2015-06-29T22:17:52Z

i think it's an intermittent travis failure. we've still to get these to run perfectly right. (it's extra paranoid, which helps, but we really should fix these to be able to just get safe greens)

License: MIT Signed-off-by: Lars Gierth <larsg@systemli.org>

ghost · 2015-06-30T15:01:02Z

Okay, it's good now: 8b164f9 -- the count is reliably the same as ipfs swarm peers | wc -l. I also changed it to panic in case the label is wrong. That'll annoy people so much that they'll fix it, and is imho a good precendence for other metrics.

The Travis failure looks unrelated (mock and mock_notif).

jbenet · 2015-06-30T19:32:20Z

@lgierth LGTM! 👍

Expose metrics via Prometheus

Expose metrics via Prometheus This commit was moved from ipfs/kubo@74a331d

jbenet added the backlog label Jun 25, 2015

jbenet reviewed Jun 25, 2015
View reviewed changes

whyrusleeping mentioned this pull request Jun 26, 2015

Sprint M ipfs/team-mgmt#17

Closed

34 tasks

jbenet reviewed Jun 27, 2015
View reviewed changes

api: add /metrics endpoint for prometheus

ed8d3ae

License: MIT Signed-off-by: Lars Gierth <larsg@systemli.org>

jbenet reviewed Jun 29, 2015
View reviewed changes

Lars Gierth added 2 commits June 30, 2015 16:32

swarm: export ipfs_p2p_peers_total metric

8b164f9

License: MIT Signed-off-by: Lars Gierth <larsg@systemli.org>

godep: add prometheus

e0a4b3a

License: MIT Signed-off-by: Lars Gierth <larsg@systemli.org>

jbenet added a commit that referenced this pull request Jun 30, 2015

Merge pull request #1422 from lgierth/metrics

74a331d

Expose metrics via Prometheus

jbenet merged commit 74a331d into ipfs:master Jun 30, 2015

jbenet removed the backlog label Jun 30, 2015

whyrusleeping mentioned this pull request Jul 1, 2015

Sprint N ipfs/team-mgmt#20

Closed

44 tasks

ghost deleted the metrics branch July 13, 2015 15:36

ghost mentioned this pull request Oct 5, 2015

daemon: instrument the gateway and api HTTP handlers #1799

Merged

autonome mentioned this pull request Mar 8, 2021

Explain what the Prometheus API is, and add some examples. ipfs/ipfs-docs#682

Closed

hacdias pushed a commit to ipfs/boxo that referenced this pull request Jan 27, 2023

Merge pull request ipfs/kubo#1422 from lgierth/metrics

650267e

Expose metrics via Prometheus This commit was moved from ipfs/kubo@74a331d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose metrics via Prometheus #1422

Expose metrics via Prometheus #1422

ghost commented Jun 25, 2015

GitCop commented Jun 25, 2015

ghost commented Jun 25, 2015

jbenet Jun 25, 2015

jbenet commented Jun 25, 2015

GitCop commented Jun 26, 2015

ghost commented Jun 26, 2015

jbenet commented Jun 26, 2015

ghost commented Jun 26, 2015

jbenet Jun 27, 2015

jbenet Jun 27, 2015

ghost Jun 27, 2015

jbenet Jun 27, 2015

jbenet commented Jun 27, 2015

ghost commented Jun 29, 2015

jbenet Jun 29, 2015

ghost Jun 29, 2015

jbenet Jun 29, 2015

jbenet commented Jun 29, 2015

ghost commented Jun 29, 2015

jbenet commented Jun 29, 2015

ghost commented Jun 30, 2015

jbenet commented Jun 30, 2015

Expose metrics via Prometheus #1422

Expose metrics via Prometheus #1422

Conversation

ghost commented Jun 25, 2015

GitCop commented Jun 25, 2015

ghost commented Jun 25, 2015

Choose a reason for hiding this comment

jbenet commented Jun 25, 2015

GitCop commented Jun 26, 2015

ghost commented Jun 26, 2015

jbenet commented Jun 26, 2015

ghost commented Jun 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbenet commented Jun 27, 2015

ghost commented Jun 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbenet commented Jun 29, 2015

ghost commented Jun 29, 2015

jbenet commented Jun 29, 2015

ghost commented Jun 30, 2015

jbenet commented Jun 30, 2015