Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metrics to be more accurate and usefull #68

Closed
banks opened this issue Feb 10, 2016 · 9 comments
Closed

Refactor metrics to be more accurate and usefull #68

banks opened this issue Feb 10, 2016 · 9 comments

Comments

@banks
Copy link
Member

banks commented Feb 10, 2016

I'm looking into the metrics that Centrifugo exposes via stats api call as we move to put it into production.

I have a couple of issues that are not show-stoppers but I think can be better:

  • Our typical monitoring is to use diamond collector on localhost to gather metrics once every 5 seconds from process. Current stats API is awkward because:
    • Output includes all nodes in cluster and to figure out which is current one means poking around trying to match name to _, That's not just fiddly, it feels like a hack to rely on knowing how centrifugo chooses to name nodes, and if --name is given at runtime in someone's setup it will break. Relying on --name is marginally better but requires that your centrifugo config and diamond config agree on the name to look for, and that you coordinate unique names across the cluster somehow. All of this is possible with good config management but it seems way to hard for something so simple.
    • The counter values reset periodically (every node_metrics_interval). Which is OK if you are consuming them directly, but for counters it makes it impossible for external process like diamond to track actual number of requests in between polls. I.e. you can't get better than configured granularity.
    • In addition some of the actual stats (the times) are not really very useful. Not a show stopper (I don't need to monitor those right now), but:
    • exponential decay sampled histogram is already a little difficult to reason about - it shows "recently" weighted results but you don't know _how_recent. But then Centrifugo only shows you the value once every node_metrics_interval of a moving decaying sample... So what you actually see is the "recent weighted" mean/max value as of up to 1 minute ago.
    • Actually using these for latencies is plain wrong, I went into detail but it's better to just link to: https://groups.google.com/forum/#!msg/mechanical-sympathy/I4JfZQ1GYi8/ocuzIyC3N9EJ which is talking directly about the Dropwizard implementation that go-metrics is a port of.

As well as that, there is also a minor race condition:

  • updateMetricsOnce locks the metrics mutex and copies values from counters and histograms
  • then it unlocks
  • then it clears the counters
  • any request that came in in between reading counter values and clearing will be lost (we correctly don't lock mutex to increment since go-metrics already handles concurrency)

Proposal

Here is one option that I think solves all my issues, it's quite a big change and there are certainly variations that would still work for our case if there are downsides I don't see with this solution.

  • Stop using go-metrics. It's useful and has some interesting properties, but we are already losing almost all the value by adding the interval based update to publically visible values. In addition, sampled histograms are wrong way to measure latency, and without those, the rest of what we use can just be replaced with integer counters.
  • Make there be a local_node_only param to stats OR create new api endpoint localstats
  • Get rid of node_metrics_interval and periodic updating - just return current counter values on each call to stats (or each ping message sent to other servers)
    • If others really rely on the 1 minutes summary counts we could include both, but simple counters are much more useful in general
    • Just use plain old atomic uint64 for all counters and don't worry about locking
  • Get rid of the timers. If we really want them we should use something like https://www.godoc.org/github.com/vulcand/vulcan/metrics#RollingHistogram which uses a robust measurement technique (HDRHistrogram) and supports rolling window of these to have an accurate distribution of the last N minutes etc. If we did that, I'd recommend having something like 15 buckets of 1 min, with each histogram having range 100µs to 10 mins (600mill µs) with 3 significant figures. That gives you very high definition for last 15 mins in a bit over 1MB per timer (on my mac). That seems very reasonable if we have ~20 timers on a modern machine. We could make it configurable if we want to support people on super low memory VPS or something. Then expose in the results from stats at least:
    • mean
    • 50%ile
    • 99%ile
    • 99.99%ile
    • max
    • Looks like it's not threadsafe though so would need to put a mutex around it or send results on a lightly buffered channel and have it update in single goroutine
@FZambia
Copy link
Member

FZambia commented Feb 10, 2016

Hello! Thanks for detailed proposal!

I agree with almost everything, the only thing: at work we send metrics directly to Graphite and just want to see actual values aggregated over one minute interval, so for us it's important to keep current behaviour with counters. So it seems we need to support both ways

And I especially agree with you thoughts about current timer behaviour - absolutely useless:)

@banks
Copy link
Member Author

banks commented Feb 10, 2016

send metrics directly to Graphite

Interesting, how does that work - I haven't seen options to configure output to graphite in the code?

Keeping optional time-based counters seems reasonable - I did originally suggest that but my ticket got too long so I simplified ;)

@banks
Copy link
Member Author

banks commented Feb 10, 2016

Hm I see go-metrics supports graphite output but I don't see any code that configures that, plus if you did it at go-metrics level it wouldn't use the existing metrics interval stuff anyway...

@FZambia
Copy link
Member

FZambia commented Feb 10, 2016

Our main site written in Django and we use Celery to do async or periodic jobs, so we just ask stats from Centrifugo every minute and send returned metrics into Graphite.

@app.task
def export_centrifugo_metrics():
    client = Client(settings.CENTRIFUGO_ADDRESS, settings.CENTRIFUGO_SECRET, timeout=2)
    stats, error = client.stats()
    if error:
        raise error

    fields = (
        "cpu_usage", "num_goroutine", "num_msg_published",
        "time_api_max", "num_api_requests", "time_client_max",
        "time_client_mean", "num_unique_clients", "num_msg_queued",
        "memory_sys", "bytes_client_in", "num_cpu", "num_clients",
        "time_api_mean", "num_client_requests", "num_msg_sent",
        "bytes_client_out", "gomaxprocs", "num_channels"
    )

    nodes = stats["nodes"]
    for node_info in nodes:
        name = node_info["name"]
        for field in fields:
            value = node_info.get(field)
            graphite_key = "centrifugo.%s.%s" % (name, field)
            send_to_graphite([graphite_key, value, time.time()])

@FZambia
Copy link
Member

FZambia commented Feb 10, 2016

lol, just thought that we have a problem with counters with this approach:)

UPD: no problem - forgot how metrics in Centrifugo work))

@banks
Copy link
Member Author

banks commented Feb 10, 2016

Yeah that is equivalent to what Diamond does although diamond already supports remembering the last value of a counter and only submitting the differnce each 10 seconds (or however often it runs).

In your setup, sync between when celery job runs and when Centrifugo aggregates could cause races where same values get submitted fro 2 minutes and one real minute get lost.

This is why plain counters are more general and the external tool should manage figuring out rate of change base don when it actually executes.

@banks
Copy link
Member Author

banks commented Feb 10, 2016

Anyway we can keep the 1 min summaries if you want - it does save you from keeping persistent state elsewhere in your setup.

@banks
Copy link
Member Author

banks commented Feb 10, 2016

From gitter discussion:

First option would be to just keep single stats method with no params and satisfy the requirements here by:

  • keeping aggregated counts
  • also have raw counts
  • include a top level key local_node_name or UUID such that consumer can easily filter to just local node without more options or relying on coordinated config.

But this has one unpleasant drawback: raw counts from non-local nodes will be subtly different semantically from the local node since they will only update once every ping interval. It's also generally quite messy.

The alternative which I think is cleaner and I prefer is to add a new method stats_raw or similar which has following semantics:

  • Only for local node
  • Outputs raw counter values (plus the other things like num cpus/goroutines etc.)

Either way we should also:

  • remove timers (keep them with 0 value in stats output for now to not upset any code relying on them)
  • remove go-metrics and just store struct of uint64 for counters with atomic increments and loads
    • metrics struct will need to keep separate set of counters for aggregation (i.e. remember the total value last interval as well as the value we return)
    • each interval we simply atomic.LoadUint64(&counter) the raw counter and store the difference between that and the count we saved last update, then save the new raw value
    • then when we generate stats response we return only the difference

@FZambia
Copy link
Member

FZambia commented Feb 20, 2016

@banks created separate issue for timers, closing this one, thanks a lot!

@FZambia FZambia closed this as completed Feb 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants