Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics via Prometheus #1422

Merged
merged 3 commits into from Jun 30, 2015
Merged

Expose metrics via Prometheus #1422

merged 3 commits into from Jun 30, 2015

Conversation

ghost
Copy link

@ghost ghost commented Jun 25, 2015

I've been approaching #1418 from an ops perspective, where prometheus looks very promising. (Disclaimer: it comes from my old employer SoundCloud, where we had plenty of pain with Graphite, Ganglia, Statsd, and all kinds of custom metrics libraries and tools.)

This PR is massive because of Godeps (did I do this right?), but the actual integration is uninvasive, complete, and incremental.

  • uninvasive (dont change all the interfaces everywhere)
  • complete (allow anything to be able to add metrics)
  • incremental (dont add everything just yet, add as we need)

Prometheus exposes metrics via a net/http handler, which is then aggregated by the prometheus daemon, which serves graphs and timeseries, and can back proper dashboards like Graphana.

Currently, this handler is registered with the API (http://127.0.0.1:5001/metrics) and could be used by command line tools, in order to get metrics snapshots, e.g. "number of peers right now". These metrics can be tagged with "labels" to add more dimensions, e.g. "http responses with status code 502", "peers connected via cjdns", "metrics from pluto.i.ipfs.io" etc.

@jbenet jbenet added the backlog label Jun 25, 2015
@GitCop
Copy link

GitCop commented Jun 25, 2015

There were the following issues with your Pull Request

  • Commit: 4dd5f7f
    • Invalid signoff. Commit message must end with License: MIT
      Signed-off-by: .* <.*>
  • Commit: f7b373f
    • Invalid signoff. Commit message must end with License: MIT
      Signed-off-by: .* <.*>
  • Commit: 7a9c6c5
    • Invalid signoff. Commit message must end with License: MIT
      Signed-off-by: .* <.*>

Guidelines are available to help. Your feedback on GitCop is welcome on this issue


This message was auto-generated by https://gitcop.com

@ghost
Copy link
Author

ghost commented Jun 25, 2015

I haven't really covered how other use cases mentioned in #1418 could be solved, just wanted to finally get this out here for feedback, and will add more info later tonight.

@@ -12,6 +12,7 @@ import (
_ "github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics/runtime"
ma "github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-multiaddr"
"github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-multiaddr-net"
prom "github.com/prometheus/client_golang/prometheus"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to run make vendor to make sure imports are re-written. if godeps yells at you, see #1385 (comment)

@jbenet
Copy link
Member

jbenet commented Jun 25, 2015

couple comments above, but this LGTM.

and, prometheus comes well recommended. (and i love the name).

maybe, actually, could put the metrics under

/<prefix like debug or sys>/metrics/prometheus

so that we could add other metrics there if we needed to

@GitCop
Copy link

GitCop commented Jun 26, 2015

There were the following issues with your Pull Request

  • Commit: 05d80dd
    • Invalid signoff. Commit message must end with License: MIT
      Signed-off-by: .* <.*>
  • Commit: 3b2aa58
    • Invalid signoff. Commit message must end with License: MIT
      Signed-off-by: .* <.*>
  • Commit: d68bbda
    • Invalid signoff. Commit message must end with License: MIT
      Signed-off-by: .* <.*>

Guidelines are available to help. Your feedback on GitCop is welcome on this issue


This message was auto-generated by https://gitcop.com

@ghost
Copy link
Author

ghost commented Jun 26, 2015

  • changed the handler mount to /debug/metrics/prometheus
  • moved the ServeOption into its own parameterized function in core/corehttp/prometheus.go
  • ran make vendor

I agree about the distinct namespace for "internal" handlers.

@whyrusleeping whyrusleeping mentioned this pull request Jun 26, 2015
34 tasks
@jbenet
Copy link
Member

jbenet commented Jun 26, 2015

@lgierth oh uh: https://travis-ci.org/ipfs/go-ipfs/jobs/68493979#L194 this means the metrics clashed when the tests tried to run.

looks like we need to add some uniqueness to the metrics names. This could be done either by prefixing with a ptr, or a monotonically increasing number (global for the proc :/)

@ghost
Copy link
Author

ghost commented Jun 26, 2015

It's using prometheus.MustRegisterOrGet() now, which gracefully skips over duplicate metric collectors -- build's green now

@@ -82,6 +90,8 @@ func NewSwarm(ctx context.Context, listenAddrs []ma.Multiaddr,
s.cg.SetTeardown(s.teardown)
s.SetConnHandler(nil) // make sure to setup our own conn handler.

prom.MustRegisterOrGet(peersTotal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh now i understand. sorry, i missed this earlier. this is getting initialized on init. I believe this needs to be on the swarm itself (i.e. a gauge per swarm). in many circumstances we create multiple swarms per process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if not stored on the swarm, then create a new one per swarm? point it, it should be independent gauges per swarm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for another PR, imo. Right now this basically counts "total number of peers over all swarms", which is a good simple metric to get started with.

If the swarm itself is an interesting dimension of the peers count, we can replace the Counter with a CounterVec, which allows supplying labels (i.e. dimensions) with each Inc()/Dec() call. More info on labels.

CounterVec is a Collector that bundles a set of Counters that all share the same Desc, but have different values for their variable labels. This is used if you want to count the same thing partitioned by various dimensions (e.g. number of HTTP requests, partitioned by response code and method). Create instances with NewCounterVec.

peersTotal.GetMetricWithLabelValues(Labels{"swarm": "foo"}).Inc(1)

Labels should only be used with dimensions that are more or less bounded. For example, a unix process ID is a bad label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"total number of peers over all swarms", which is a good simple metric to get started with.

What would this metric be used for?

If the swarm itself is an interesting dimension of the peers count,

It's not just an interesting dimension. not conditioning on the swarm is incorrect.

not sure you see what i mean: there's no relationship implied between two different swarms in the same process (they will most likely belong to different nodes, but not necessarily), and it will double count any connections, between swarms in the same process, and from the multiple swarms to the same peer. (This happens all the time in tests, for example). so, in an in-proc network of 5 nodes, you'd see "25 Number of connected peers" which is an incorrect statement, because there's "25 individual connections, across only 5 peers".

I don't want to merge things to the surface area (interface) that are either (partially) incorrect or that we'll just have to change right away. anything we merge is stuff that someone might start using. I'd probably start using this metric just because it's there, but it would be misleading.

If want to merge just the vendoring of the lib with a test metric, sure let's do it, but maybe go for something else, that isn't a singleton?

Labels should only be used with dimensions that are more or less bounded. For example, a unix process ID is a bad label.

What do you mean by bounded in this case? (cause unix pids are bounded, but not in the way you mean)

@jbenet
Copy link
Member

jbenet commented Jun 27, 2015

build's green now

but i think it's incorrect. see #1422 (comment)

License: MIT
Signed-off-by: Lars Gierth <larsg@systemli.org>
@ghost
Copy link
Author

ghost commented Jun 29, 2015

Updated, it now adds the local PeerID as a dimension (label in prometheus lingo). With prometheus' automatic labels included, this ends up as:

ipfs_p2p_peers_total{instance="localhost:5001",job="ipfs",monitor="peppy",peer_id="QmW3UhWb9a79PquifrA9r37ji6RAojpLBPkzgf7cmnwjov"} => int

The instance will be something like h.pluto.ipfs.io:5001 on the gateways.

What do you mean by bounded in this case? (cause unix pids are bounded, but not in the way you mean)

I meant something that doesn't end up with a huge cardinality after a short amount of time. PeerIDs are constant on the nodes that we want to monitor. I can imagine use cases with throw-away identities, but these cases will just have to live with the fact that this particular metric might not perform that well.

}

func (n *ps2netNotifee) Disconnected(c *ps.Conn) {
n.not.Disconnected(n.net, inet.Conn((*Conn)(c)))
if metric, err := peersTotalGauge(n.net.LocalPeer()); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder, do we want a log.Debug if err != nil ? shrug maybe not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// An error is returned if the number and names of the Labels are inconsistent
// with those of the VariableLabels in Desc.

I can swap it for the other func that panics 🤘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we shouldnt ship things with inconsistent labels. im on the fence (a debug log may be ok), but dont care much either way. i think @whyrusleeping would say panic.

@jbenet
Copy link
Member

jbenet commented Jun 29, 2015

ok sounds good!

@lgierth i think this LGTM and is ready for merge. ready?

@ghost
Copy link
Author

ghost commented Jun 29, 2015

Not yet, this currently increments/decrements once per notifee, which is wrong -- I'll put the increment/decrement into a distinct notifee, which is anyhow the cleanest way. (Also just noticed that Travis is red).

@jbenet
Copy link
Member

jbenet commented Jun 29, 2015

i think it's an intermittent travis failure. we've still to get these to run perfectly right. (it's extra paranoid, which helps, but we really should fix these to be able to just get safe greens)

Lars Gierth added 2 commits June 30, 2015 16:32
License: MIT
Signed-off-by: Lars Gierth <larsg@systemli.org>
License: MIT
Signed-off-by: Lars Gierth <larsg@systemli.org>
@ghost
Copy link
Author

ghost commented Jun 30, 2015

Okay, it's good now: 8b164f9 -- the count is reliably the same as ipfs swarm peers | wc -l. I also changed it to panic in case the label is wrong. That'll annoy people so much that they'll fix it, and is imho a good precendence for other metrics.

The Travis failure looks unrelated (mock and mock_notif).

@jbenet
Copy link
Member

jbenet commented Jun 30, 2015

@lgierth LGTM! 👍

jbenet added a commit that referenced this pull request Jun 30, 2015
Expose metrics via Prometheus
@jbenet jbenet merged commit 74a331d into ipfs:master Jun 30, 2015
@jbenet jbenet removed the backlog label Jun 30, 2015
@whyrusleeping whyrusleeping mentioned this pull request Jul 1, 2015
44 tasks
@ghost ghost deleted the metrics branch July 13, 2015 15:36
hacdias pushed a commit to ipfs/boxo that referenced this pull request Jan 27, 2023
Expose metrics via Prometheus

This commit was moved from ipfs/kubo@74a331d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants