Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latencies for http api responses and client api responses #97

Merged
merged 8 commits into from
Jul 28, 2016

Conversation

FZambia
Copy link
Member

@FZambia FZambia commented Jul 24, 2016

Here is first (without tests yet) implementation of latencies for HTTP API response times and client API response times. I am planning to finish this soon - here just showing first concept.

There are some caveats - for example we include latencies for different types of methods (publish, broadcast, presence etc) into one histogram. It's rather hard to separate these all into different histograms. Maybe if we had different API handlers for all methods it would be more logical. But anyway those resulting numbers can be useful for monitoring.

Also I should still think about proper synchronization.

Example of stats:

{
    "uid": "2",
    "method": "stats",
    "error": null,
    "body": {
        "data": {
            "nodes": [
                {
                    "uid": "918600b0-69e8-4380-8d86-644e4eb9068b",
                    "name": "MacAir.local_8000",
                    "num_goroutine": 37,
                    "num_clients": 1,
                    "num_unique_clients": 1,
                    "num_channels": 4,
                    "started_at": 1469740187,
                    "gomaxprocs": 4,
                    "num_cpu": 4,
                    "num_msg_published": 0,
                    "num_msg_queued": 0,
                    "num_msg_sent": 0,
                    "num_api_requests": 0,
                    "num_client_requests": 0,
                    "bytes_client_in": 0,
                    "bytes_client_out": 0,
                    "time_api_mean": 0,
                    "time_client_mean": 0,
                    "time_api_max": 0,
                    "time_client_max": 0,
                    "latencies": {
                        "client_api_15_count": 22,
                        "client_api_15_microseconds_50%ile": 184,
                        "client_api_15_microseconds_90%ile": 1880,
                        "client_api_15_microseconds_99%ile": 3185,
                        "client_api_15_microseconds_99.99%ile": 3185,
                        "client_api_15_microseconds_max": 3185,
                        "client_api_15_microseconds_mean": 560,
                        "client_api_15_microseconds_min": 116,
                        "client_api_1_count": 22,
                        "client_api_1_microseconds_50%ile": 184,
                        "client_api_1_microseconds_90%ile": 1880,
                        "client_api_1_microseconds_99%ile": 3185,
                        "client_api_1_microseconds_99.99%ile": 3185,
                        "client_api_1_microseconds_max": 3185,
                        "client_api_1_microseconds_mean": 560,
                        "client_api_1_microseconds_min": 116,
                        "http_api_15_count": 484579,
                        "http_api_15_microseconds_50%ile": 52,
                        "http_api_15_microseconds_90%ile": 125,
                        "http_api_15_microseconds_99%ile": 1221,
                        "http_api_15_microseconds_99.99%ile": 51295,
                        "http_api_15_microseconds_max": 173567,
                        "http_api_15_microseconds_mean": 130,
                        "http_api_15_microseconds_min": 35,
                        "http_api_1_count": 484579,
                        "http_api_1_microseconds_50%ile": 52,
                        "http_api_1_microseconds_90%ile": 125,
                        "http_api_1_microseconds_99%ile": 1218,
                        "http_api_1_microseconds_99.99%ile": 46687,
                        "http_api_1_microseconds_max": 173567,
                        "http_api_1_microseconds_mean": 130,
                        "http_api_1_microseconds_min": 35
                    },
                    "memory_sys": 0,
                    "cpu_usage": 0
                }
            ],
            "metrics_interval": 60
        }
    }
}

func newMetricsHistogramRegistry() *hdrhistogram.HDRHistogramRegistry {
quantiles := []float64{50, 90, 99, 99.99}
var minValue int64 = 1 // as we record latencies in microseconds, min resolution 1mks
var maxValue int64 = 10000000 // as we record latencies in microseconds, max resolution 10s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest larger max value - a thing over 10 seconds is bad but if you have heavily loaded server for a few mins and all you can see is that 50% of requests took more than 10s it's hard to know if that's a bit bad (they all took 11 seconds for the peak) or awful (they all hung and took 5 mins before client timed out). HDR histograms are pretty space efficient so going up to 10 mins should be negligible. if space is a big deal then you can reduce low resolution - nothing is likely t be faster than a few hundred microseconds so you could start at 10 or 100.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so something like 60sec as upper limit? And at moment mean POST request time is about 70mks on my old macbook, so 1mks looks as good lower limit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that seeing at things that took more than 60sec makes sense - because at moment we have sth > 1sec - we are already in trouble

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 would be ok but I'd still prefer more - it's relatively cheap and the cases where it gets high are probably exactly the times you need to know details.

For example Elasticache redis can take up to a minute to failover. I'd like to know if the requests took that long because they were waiting for failover and then completed or if they all hung because they queued up and overloaded new redis when it came up for example.

It's quite contrived and more than a minute probably you would get clients timing out too, just saying in general is tend to keep as much data as possible about extremes and I think it's not that expensive to do it with hdr histogram... If 60s is significantly cheaper than 10mins fair enough but my vote is for more where possible - who know what you might cat h and what insight you might have never had when you throw data away...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess we have lower timeout on redis connect and return error to client rather than waiting... so maybe this is moot. The principle stands but 60s probably is good enough for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just measured, 15 histograms:
10sec -> 1.85mb
60sec -> 2.08mb
10min -> 2.58mb

So if you agree on 60sec I'd prefer to make 60sec

@banks
Copy link
Member

banks commented Jul 28, 2016

LGTM :) great job.

@FZambia
Copy link
Member Author

FZambia commented Jul 28, 2016

Updated stats in first comment here to reflect final metric names (added microseconds quantity to names) and not confuse someone who will read this pr later.

@FZambia FZambia merged commit 3621209 into master Jul 28, 2016
@FZambia FZambia deleted the hdr_histogram branch August 2, 2016 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants