latencies for http api responses and client api responses #97

FZambia · 2016-07-24T09:50:20Z

Here is first (without tests yet) implementation of latencies for HTTP API response times and client API response times. I am planning to finish this soon - here just showing first concept.

There are some caveats - for example we include latencies for different types of methods (publish, broadcast, presence etc) into one histogram. It's rather hard to separate these all into different histograms. Maybe if we had different API handlers for all methods it would be more logical. But anyway those resulting numbers can be useful for monitoring.

Also I should still think about proper synchronization.

Example of stats:

{
    "uid": "2",
    "method": "stats",
    "error": null,
    "body": {
        "data": {
            "nodes": [
                {
                    "uid": "918600b0-69e8-4380-8d86-644e4eb9068b",
                    "name": "MacAir.local_8000",
                    "num_goroutine": 37,
                    "num_clients": 1,
                    "num_unique_clients": 1,
                    "num_channels": 4,
                    "started_at": 1469740187,
                    "gomaxprocs": 4,
                    "num_cpu": 4,
                    "num_msg_published": 0,
                    "num_msg_queued": 0,
                    "num_msg_sent": 0,
                    "num_api_requests": 0,
                    "num_client_requests": 0,
                    "bytes_client_in": 0,
                    "bytes_client_out": 0,
                    "time_api_mean": 0,
                    "time_client_mean": 0,
                    "time_api_max": 0,
                    "time_client_max": 0,
                    "latencies": {
                        "client_api_15_count": 22,
                        "client_api_15_microseconds_50%ile": 184,
                        "client_api_15_microseconds_90%ile": 1880,
                        "client_api_15_microseconds_99%ile": 3185,
                        "client_api_15_microseconds_99.99%ile": 3185,
                        "client_api_15_microseconds_max": 3185,
                        "client_api_15_microseconds_mean": 560,
                        "client_api_15_microseconds_min": 116,
                        "client_api_1_count": 22,
                        "client_api_1_microseconds_50%ile": 184,
                        "client_api_1_microseconds_90%ile": 1880,
                        "client_api_1_microseconds_99%ile": 3185,
                        "client_api_1_microseconds_99.99%ile": 3185,
                        "client_api_1_microseconds_max": 3185,
                        "client_api_1_microseconds_mean": 560,
                        "client_api_1_microseconds_min": 116,
                        "http_api_15_count": 484579,
                        "http_api_15_microseconds_50%ile": 52,
                        "http_api_15_microseconds_90%ile": 125,
                        "http_api_15_microseconds_99%ile": 1221,
                        "http_api_15_microseconds_99.99%ile": 51295,
                        "http_api_15_microseconds_max": 173567,
                        "http_api_15_microseconds_mean": 130,
                        "http_api_15_microseconds_min": 35,
                        "http_api_1_count": 484579,
                        "http_api_1_microseconds_50%ile": 52,
                        "http_api_1_microseconds_90%ile": 125,
                        "http_api_1_microseconds_99%ile": 1218,
                        "http_api_1_microseconds_99.99%ile": 46687,
                        "http_api_1_microseconds_max": 173567,
                        "http_api_1_microseconds_mean": 130,
                        "http_api_1_microseconds_min": 35
                    },
                    "memory_sys": 0,
                    "cpu_usage": 0
                }
            ],
            "metrics_interval": 60
        }
    }
}

banks · 2016-07-28T08:38:01Z

libcentrifugo/metrics.go

+func newMetricsHistogramRegistry() *hdrhistogram.HDRHistogramRegistry {
+	quantiles := []float64{50, 90, 99, 99.99}
+	var minValue int64 = 1        // as we record latencies in microseconds, min resolution 1mks
+	var maxValue int64 = 10000000 // as we record latencies in microseconds, max resolution 10s


I suggest larger max value - a thing over 10 seconds is bad but if you have heavily loaded server for a few mins and all you can see is that 50% of requests took more than 10s it's hard to know if that's a bit bad (they all took 11 seconds for the peak) or awful (they all hung and took 5 mins before client timed out). HDR histograms are pretty space efficient so going up to 10 mins should be negligible. if space is a big deal then you can reduce low resolution - nothing is likely t be faster than a few hundred microseconds so you could start at 10 or 100.

ok, so something like 60sec as upper limit? And at moment mean POST request time is about 70mks on my old macbook, so 1mks looks as good lower limit.

I don't think that seeing at things that took more than 60sec makes sense - because at moment we have sth > 1sec - we are already in trouble

60 would be ok but I'd still prefer more - it's relatively cheap and the cases where it gets high are probably exactly the times you need to know details.

For example Elasticache redis can take up to a minute to failover. I'd like to know if the requests took that long because they were waiting for failover and then completed or if they all hung because they queued up and overloaded new redis when it came up for example.

It's quite contrived and more than a minute probably you would get clients timing out too, just saying in general is tend to keep as much data as possible about extremes and I think it's not that expensive to do it with hdr histogram... If 60s is significantly cheaper than 10mins fair enough but my vote is for more where possible - who know what you might cat h and what insight you might have never had when you throw data away...

Actually, I guess we have lower timeout on redis connect and return error to client rather than waiting... so maybe this is moot. The principle stands but 60s probably is good enough for this case.

Just measured, 15 histograms:
10sec -> 1.85mb
60sec -> 2.08mb
10min -> 2.58mb

So if you agree on 60sec I'd prefer to make 60sec

banks · 2016-07-28T11:27:34Z

LGTM :) great job.

FZambia · 2016-07-28T21:01:56Z

Updated stats in first comment here to reflect final metric names (added microseconds quantity to names) and not confuse someone who will read this pr later.

FZambia added 5 commits July 24, 2016 12:52

latencies for http api responses and client api responses

24442a2

fix tests, comments for hdrhistogram package

f060ac2

start writing tests for hdrhistogram

cf45d39

histogram snapshots

3dcec0b

explicit unlocking instead of defer stmt

ed2a3fd

banks reviewed Jul 28, 2016
View reviewed changes

FZambia added 2 commits July 28, 2016 15:43

custom diagrams values, tests

dba7a21

max value for histograms now 60s instead of 10s

20c3abd

add quantity to histogram metric names

5f51490

FZambia merged commit 3621209 into master Jul 28, 2016

FZambia mentioned this pull request Jul 28, 2016

Consider rewriting latency metrics #72

Closed

FZambia deleted the hdr_histogram branch August 2, 2016 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latencies for http api responses and client api responses #97

latencies for http api responses and client api responses #97

FZambia commented Jul 24, 2016 •

edited

Loading

banks Jul 28, 2016

FZambia Jul 28, 2016

FZambia Jul 28, 2016

banks Jul 28, 2016

banks Jul 28, 2016

FZambia Jul 28, 2016

banks commented Jul 28, 2016

FZambia commented Jul 28, 2016

latencies for http api responses and client api responses #97

latencies for http api responses and client api responses #97

Conversation

FZambia commented Jul 24, 2016 • edited Loading

banks Jul 28, 2016

Choose a reason for hiding this comment

FZambia Jul 28, 2016

Choose a reason for hiding this comment

FZambia Jul 28, 2016

Choose a reason for hiding this comment

banks Jul 28, 2016

Choose a reason for hiding this comment

banks Jul 28, 2016

Choose a reason for hiding this comment

FZambia Jul 28, 2016

Choose a reason for hiding this comment

banks commented Jul 28, 2016

FZambia commented Jul 28, 2016

FZambia commented Jul 24, 2016 •

edited

Loading