-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread saturation metrics 📈 #489
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is looking good.
Mostly I have questions about the saturationMetric
implementation. Originally you had a goroutine to report the times. Did you notice any difference in bias toward idle vs work time by removing it and reporting the time synchronously?
I'm wondering if by recording both idle time and work time we reduce the likelihood that the metric values goes the 1.0. I'm wondering if we could track only idle time, and assume anything not in idle time is time spent doing work. I started to experiment with that approach in this branch, but it's definitely not working yet (the tests fail).
Do you think an approach that only tracks idle time could work? My hope is that better represents the full saturation and that the metric will go to 1.0 more easily.
My second question about the implementation is if we are missing some time. We have 256 samples for a 1 second reporting period. I think that means if the mean time is shorter than 4s we will clobber some of the earlier samples. If we start the interval with lots of work, then a bunch of fast work comes in afterward. I think we would under-report the saturation.
In the branch I linked I continued to use a fixed size array, but use a different value for the index.
Do you think an approach of using now.UnixNano() / int64(10*time.Millisecond) % math.MaxUint8
to find the right sample bucket, and adding the idle time to that bucket could work? I think that by dividing by 10ms we can track up to 2.56s work of times. If we're idle for longer than that then we can still report a low saturation time on the next cycle because values won't have been erased yet.
Those are great points @dnephin, thanks! 🙌🏻 I think the approach in your branch makes a lot of sense. It also made me realise that if we kept a simple running accumulator of sleep time (rather than a slice of samples) we could subtract it from the time elapsed since Here's a rough implementation: 25ba633, let me know what you think! |
Ahh, interesting! I think this approach is a good simplification of what we had, but I wonder if it points out a potential improvement we could make by using a slice of samples. If we only record elapsed time (no buckets/slice), then every time we report the metric we will always be reporting just the change since last report. If we continued to use buckets, we could potentially take the saturation for an arbitrary period that includes time before the last report. That might help stabalize the value a little bit and make it less spiky. I guess with 10ms buckets we don't do much normalization, but if we were to change the buckets to 100ms, then we could easily keep 10s+ of samples. Each time we report the metric the samples from previous buckets would still be included in the idle/total ratio, which I think would help smooth out the values a bit. |
Nice, yeah I think that would help a lot! Do you think we'd need to weight/bias towards more recent values to prevent it from hiding genuine spikes? I guess if the bucket-size is smaller it wouldn't matter as much? |
Good question! Ya, it totally depends on bucket-size. As long we the bucket size is small enough we should have a good balance without having to apply weights I think. |
Ok, cool! I've just pushed a version based on the accumulator approach from my other branch, but keeping 5 previous measurements to smooth out the spikes. When I tried the time-based array indexing, I found that it was possible to end up with an odd mix of old and new measurements because we wouldn't overwrite the previous measurements uniformly (i.e. we could have a measurement from Sort of hard to explain in words, so here's a rough ASCII diagram 😅
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really neat! LGTM with minor nits about comments
// Note: it is expected that the caller is single-threaded and is not safe for | ||
// concurrent use by multiple goroutines. | ||
type saturationMetric struct { | ||
reportInterval time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could this be simply interval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I don't feel strongly either way 😄
My github is acting up and left duplicate comments everywhere 😕 Sorry if it confused you as well |
No worries, thanks for the review @kisunji 🙌🏻 I really appreciate the clarifications you made to the doc comments. |
On top of my previous comment, I think we should push sampling/bucket logic into go-metrics which will already do that for us. All we need to do is calculate the saturation for some interval and then report that as a sample to go-metrics which will perform all the necessary aggregations and end up showing a smoothed view if you look at the average instead of the max metric. |
Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle (so at worst we'll leave the gauge on the previous low value).
- Much simpler implementation based on an accumulator of sleep time. We no longer drop samples when the buffer is full. - We now keep 5 previous measurements to smooth out spikes. - Rename metrics to `raft.thread.fsm.saturation` and `raft.thread.main.saturation`. - Remove FSM saturation metric from the `Raft` struct.
a09e93e
to
1f8b1ad
Compare
Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits.
We update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle; so at worst we'll leave the gauge on the previous (low) value for a while.