Thread saturation metrics 📈 #489

boxofrad · 2022-02-02T17:27:21Z

Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits.

We update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle; so at worst we'll leave the gauge on the previous (low) value for a while.

saturation.go

api.go

dnephin

Nice! This is looking good.

Mostly I have questions about the saturationMetric implementation. Originally you had a goroutine to report the times. Did you notice any difference in bias toward idle vs work time by removing it and reporting the time synchronously?

I'm wondering if by recording both idle time and work time we reduce the likelihood that the metric values goes the 1.0. I'm wondering if we could track only idle time, and assume anything not in idle time is time spent doing work. I started to experiment with that approach in this branch, but it's definitely not working yet (the tests fail).

Do you think an approach that only tracks idle time could work? My hope is that better represents the full saturation and that the metric will go to 1.0 more easily.

My second question about the implementation is if we are missing some time. We have 256 samples for a 1 second reporting period. I think that means if the mean time is shorter than 4s we will clobber some of the earlier samples. If we start the interval with lots of work, then a bunch of fast work comes in afterward. I think we would under-report the saturation.

In the branch I linked I continued to use a fixed size array, but use a different value for the index.
Do you think an approach of using now.UnixNano() / int64(10*time.Millisecond) % math.MaxUint8 to find the right sample bucket, and adding the idle time to that bucket could work? I think that by dividing by 10ms we can track up to 2.56s work of times. If we're idle for longer than that then we can still report a low saturation time on the next cycle because values won't have been erased yet.

api.go

saturation.go

boxofrad · 2022-02-03T12:09:13Z

Those are great points @dnephin, thanks! 🙌🏻

I think the approach in your branch makes a lot of sense. It also made me realise that if we kept a simple running accumulator of sleep time (rather than a slice of samples) we could subtract it from the time elapsed since lastReport, and presumably wouldn't need to lose any time at all?

Here's a rough implementation: 25ba633, let me know what you think!

dnephin · 2022-02-03T17:28:58Z

Ahh, interesting! I think this approach is a good simplification of what we had, but I wonder if it points out a potential improvement we could make by using a slice of samples.

If we only record elapsed time (no buckets/slice), then every time we report the metric we will always be reporting just the change since last report.

If we continued to use buckets, we could potentially take the saturation for an arbitrary period that includes time before the last report. That might help stabalize the value a little bit and make it less spiky. I guess with 10ms buckets we don't do much normalization, but if we were to change the buckets to 100ms, then we could easily keep 10s+ of samples. Each time we report the metric the samples from previous buckets would still be included in the idle/total ratio, which I think would help smooth out the values a bit.

boxofrad · 2022-02-07T11:22:03Z

Nice, yeah I think that would help a lot! Do you think we'd need to weight/bias towards more recent values to prevent it from hiding genuine spikes? I guess if the bucket-size is smaller it wouldn't matter as much?

dnephin · 2022-02-07T19:46:50Z

Good question! Ya, it totally depends on bucket-size. As long we the bucket size is small enough we should have a good balance without having to apply weights I think.

boxofrad · 2022-02-10T17:23:03Z

Ok, cool! I've just pushed a version based on the accumulator approach from my other branch, but keeping 5 previous measurements to smooth out the spikes.

When I tried the time-based array indexing, I found that it was possible to end up with an odd mix of old and new measurements because we wouldn't overwrite the previous measurements uniformly (i.e. we could have a measurement from T1 and T3 but be missing T2) so we couldn't calculate the total time elapsed based on T1.

Sort of hard to explain in words, so here's a rough ASCII diagram 😅

=> Start
buckets: [ | | | | ]

=> Record Sample 1:
now / 10ms % len(buckets) = 3
buckets: [ | | |1| ]

=> Record Sample 2:
now / 10ms % len(buckets) = 0
buckets: [2| | |1| ]

=> Record Sample 3:
now / 10ms % len(buckets) = 0
buckets: [3| | |1| ]

=> Record Sample 4:
now / 10ms % len(buckets) = 4
buckets: [3| | |1|4]

hashicorp-cla · 2022-03-12T16:36:48Z

All committers have signed the CLA.

kisunji

This is really neat! LGTM with minor nits about comments

saturation.go

kisunji · 2022-03-18T19:06:04Z

saturation.go

+// Note: it is expected that the caller is single-threaded and is not safe for
+// concurrent use by multiple goroutines.
+type saturationMetric struct {
+	reportInterval time.Duration


nit: could this be simply interval?

Sure! I don't feel strongly either way 😄

saturation.go

kisunji · 2022-03-18T19:53:52Z

My github is acting up and left duplicate comments everywhere 😕 Sorry if it confused you as well

boxofrad · 2022-03-18T20:00:40Z

No worries, thanks for the review @kisunji 🙌🏻 I really appreciate the clarifications you made to the doc comments.

saturation.go

mkeeler · 2022-04-26T19:30:35Z

On top of my previous comment, I think we should push sampling/bucket logic into go-metrics which will already do that for us. All we need to do is calculate the saturation for some interval and then report that as a sample to go-metrics which will perform all the necessary aggregations and end up showing a smoothed view if you look at the average instead of the max metric.

Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle (so at worst we'll leave the gauge on the previous low value).

- Much simpler implementation based on an accumulator of sleep time. We no longer drop samples when the buffer is full. - We now keep 5 previous measurements to smooth out spikes. - Rename metrics to `raft.thread.fsm.saturation` and `raft.thread.main.saturation`. - Remove FSM saturation metric from the `Raft` struct.

boxofrad requested a review from dnephin February 2, 2022 17:27

boxofrad commented Feb 2, 2022

View reviewed changes

saturation.go Outdated Show resolved Hide resolved

boxofrad commented Feb 2, 2022

View reviewed changes

api.go Outdated Show resolved Hide resolved

dnephin reviewed Feb 2, 2022

View reviewed changes

api.go Outdated Show resolved Hide resolved

api.go Outdated Show resolved Hide resolved

saturation.go Outdated Show resolved Hide resolved

boxofrad requested a review from dnephin February 11, 2022 11:34

kisunji approved these changes Mar 18, 2022

View reviewed changes

mkeeler requested changes Apr 26, 2022

View reviewed changes

saturation.go Outdated Show resolved Hide resolved

boxofrad added 4 commits April 27, 2022 13:59

Round saturation to 2dp to report integer percentages

5c3d87e

Remove smoothing/bucketing in favor of using a sample metric

1f8b1ad

boxofrad force-pushed the boxofrad/saturation-metrics branch from a09e93e to 1f8b1ad Compare April 27, 2022 13:22

Incorporate @kisunji's comment improvements

0e9a7c5

boxofrad requested review from mkeeler and removed request for dnephin April 27, 2022 13:28

mkeeler approved these changes Apr 27, 2022

View reviewed changes

boxofrad merged commit 44124c2 into main Apr 27, 2022

boxofrad deleted the boxofrad/saturation-metrics branch April 27, 2022 14:45

boxofrad mentioned this pull request Apr 27, 2022

Upgrade Raft to v1.3.9 for saturation metrics hashicorp/consul#12865

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread saturation metrics 📈 #489

Thread saturation metrics 📈 #489

boxofrad commented Feb 2, 2022 •

edited

Loading

dnephin left a comment •

edited

Loading

boxofrad commented Feb 3, 2022

dnephin commented Feb 3, 2022

boxofrad commented Feb 7, 2022

dnephin commented Feb 7, 2022

boxofrad commented Feb 10, 2022

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

kisunji left a comment

kisunji Mar 18, 2022

boxofrad Mar 18, 2022

kisunji commented Mar 18, 2022

boxofrad commented Mar 18, 2022

mkeeler commented Apr 26, 2022

Thread saturation metrics 📈 #489

Thread saturation metrics 📈 #489

Conversation

boxofrad commented Feb 2, 2022 • edited Loading

dnephin left a comment • edited Loading

Choose a reason for hiding this comment

boxofrad commented Feb 3, 2022

dnephin commented Feb 3, 2022

boxofrad commented Feb 7, 2022

dnephin commented Feb 7, 2022

boxofrad commented Feb 10, 2022

hashicorp-cla commented Mar 12, 2022 • edited Loading

kisunji left a comment

Choose a reason for hiding this comment

kisunji Mar 18, 2022

Choose a reason for hiding this comment

boxofrad Mar 18, 2022

Choose a reason for hiding this comment

kisunji commented Mar 18, 2022

boxofrad commented Mar 18, 2022

mkeeler commented Apr 26, 2022

boxofrad commented Feb 2, 2022 •

edited

Loading

dnephin left a comment •

edited

Loading

hashicorp-cla commented Mar 12, 2022 •

edited

Loading