-
-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
latencies for http api responses and client api responses #97
Conversation
func newMetricsHistogramRegistry() *hdrhistogram.HDRHistogramRegistry { | ||
quantiles := []float64{50, 90, 99, 99.99} | ||
var minValue int64 = 1 // as we record latencies in microseconds, min resolution 1mks | ||
var maxValue int64 = 10000000 // as we record latencies in microseconds, max resolution 10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest larger max value - a thing over 10 seconds is bad but if you have heavily loaded server for a few mins and all you can see is that 50% of requests took more than 10s it's hard to know if that's a bit bad (they all took 11 seconds for the peak) or awful (they all hung and took 5 mins before client timed out). HDR histograms are pretty space efficient so going up to 10 mins should be negligible. if space is a big deal then you can reduce low resolution - nothing is likely t be faster than a few hundred microseconds so you could start at 10 or 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so something like 60sec as upper limit? And at moment mean POST request time is about 70mks on my old macbook, so 1mks looks as good lower limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that seeing at things that took more than 60sec makes sense - because at moment we have sth > 1sec - we are already in trouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 would be ok but I'd still prefer more - it's relatively cheap and the cases where it gets high are probably exactly the times you need to know details.
For example Elasticache redis can take up to a minute to failover. I'd like to know if the requests took that long because they were waiting for failover and then completed or if they all hung because they queued up and overloaded new redis when it came up for example.
It's quite contrived and more than a minute probably you would get clients timing out too, just saying in general is tend to keep as much data as possible about extremes and I think it's not that expensive to do it with hdr histogram... If 60s is significantly cheaper than 10mins fair enough but my vote is for more where possible - who know what you might cat h and what insight you might have never had when you throw data away...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I guess we have lower timeout on redis connect and return error to client rather than waiting... so maybe this is moot. The principle stands but 60s probably is good enough for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just measured, 15 histograms:
10sec -> 1.85mb
60sec -> 2.08mb
10min -> 2.58mb
So if you agree on 60sec I'd prefer to make 60sec
LGTM :) great job. |
Updated stats in first comment here to reflect final metric names (added microseconds quantity to names) and not confuse someone who will read this pr later. |
Here is first (without tests yet) implementation of latencies for HTTP API response times and client API response times. I am planning to finish this soon - here just showing first concept.
There are some caveats - for example we include latencies for different types of methods (publish, broadcast, presence etc) into one histogram. It's rather hard to separate these all into different histograms. Maybe if we had different API handlers for all methods it would be more logical. But anyway those resulting numbers can be useful for monitoring.
Also I should still think about proper synchronization.
Example of stats: