Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread saturation metrics 📈 #489

Merged
merged 5 commits into from
Apr 27, 2022
Merged

Thread saturation metrics 📈 #489

merged 5 commits into from
Apr 27, 2022

Conversation

boxofrad
Copy link
Contributor

@boxofrad boxofrad commented Feb 2, 2022

Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits.

We update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle; so at worst we'll leave the gauge on the previous (low) value for a while.

@boxofrad boxofrad requested a review from dnephin February 2, 2022 17:27
saturation.go Outdated Show resolved Hide resolved
api.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is looking good.

Mostly I have questions about the saturationMetric implementation. Originally you had a goroutine to report the times. Did you notice any difference in bias toward idle vs work time by removing it and reporting the time synchronously?

I'm wondering if by recording both idle time and work time we reduce the likelihood that the metric values goes the 1.0. I'm wondering if we could track only idle time, and assume anything not in idle time is time spent doing work. I started to experiment with that approach in this branch, but it's definitely not working yet (the tests fail).

Do you think an approach that only tracks idle time could work? My hope is that better represents the full saturation and that the metric will go to 1.0 more easily.

My second question about the implementation is if we are missing some time. We have 256 samples for a 1 second reporting period. I think that means if the mean time is shorter than 4s we will clobber some of the earlier samples. If we start the interval with lots of work, then a bunch of fast work comes in afterward. I think we would under-report the saturation.

In the branch I linked I continued to use a fixed size array, but use a different value for the index.
Do you think an approach of using now.UnixNano() / int64(10*time.Millisecond) % math.MaxUint8 to find the right sample bucket, and adding the idle time to that bucket could work? I think that by dividing by 10ms we can track up to 2.56s work of times. If we're idle for longer than that then we can still report a low saturation time on the next cycle because values won't have been erased yet.

api.go Outdated Show resolved Hide resolved
api.go Outdated Show resolved Hide resolved
saturation.go Outdated Show resolved Hide resolved
@boxofrad
Copy link
Contributor Author

boxofrad commented Feb 3, 2022

Those are great points @dnephin, thanks! 🙌🏻

I think the approach in your branch makes a lot of sense. It also made me realise that if we kept a simple running accumulator of sleep time (rather than a slice of samples) we could subtract it from the time elapsed since lastReport, and presumably wouldn't need to lose any time at all?

Here's a rough implementation: 25ba633, let me know what you think!

@dnephin
Copy link
Contributor

dnephin commented Feb 3, 2022

Ahh, interesting! I think this approach is a good simplification of what we had, but I wonder if it points out a potential improvement we could make by using a slice of samples.

If we only record elapsed time (no buckets/slice), then every time we report the metric we will always be reporting just the change since last report.

If we continued to use buckets, we could potentially take the saturation for an arbitrary period that includes time before the last report. That might help stabalize the value a little bit and make it less spiky. I guess with 10ms buckets we don't do much normalization, but if we were to change the buckets to 100ms, then we could easily keep 10s+ of samples. Each time we report the metric the samples from previous buckets would still be included in the idle/total ratio, which I think would help smooth out the values a bit.

@boxofrad
Copy link
Contributor Author

boxofrad commented Feb 7, 2022

Nice, yeah I think that would help a lot! Do you think we'd need to weight/bias towards more recent values to prevent it from hiding genuine spikes? I guess if the bucket-size is smaller it wouldn't matter as much?

@dnephin
Copy link
Contributor

dnephin commented Feb 7, 2022

Good question! Ya, it totally depends on bucket-size. As long we the bucket size is small enough we should have a good balance without having to apply weights I think.

@boxofrad
Copy link
Contributor Author

Ok, cool! I've just pushed a version based on the accumulator approach from my other branch, but keeping 5 previous measurements to smooth out the spikes.

When I tried the time-based array indexing, I found that it was possible to end up with an odd mix of old and new measurements because we wouldn't overwrite the previous measurements uniformly (i.e. we could have a measurement from T1 and T3 but be missing T2) so we couldn't calculate the total time elapsed based on T1.

Sort of hard to explain in words, so here's a rough ASCII diagram 😅

=> Start
buckets: [ | | | | ]

=> Record Sample 1:
now / 10ms % len(buckets) = 3
buckets: [ | | |1| ]

=> Record Sample 2:
now / 10ms % len(buckets) = 0
buckets: [2| | |1| ]

=> Record Sample 3:
now / 10ms % len(buckets) = 0
buckets: [3| | |1| ]

=> Record Sample 4:
now / 10ms % len(buckets) = 4
buckets: [3| | |1|4]

@boxofrad boxofrad requested a review from dnephin February 11, 2022 11:34
@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@kisunji kisunji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat! LGTM with minor nits about comments

saturation.go Outdated Show resolved Hide resolved
saturation.go Show resolved Hide resolved
// Note: it is expected that the caller is single-threaded and is not safe for
// concurrent use by multiple goroutines.
type saturationMetric struct {
reportInterval time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could this be simply interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I don't feel strongly either way 😄

saturation.go Outdated Show resolved Hide resolved
@kisunji
Copy link
Contributor

kisunji commented Mar 18, 2022

My github is acting up and left duplicate comments everywhere 😕 Sorry if it confused you as well

@boxofrad
Copy link
Contributor Author

No worries, thanks for the review @kisunji 🙌🏻 I really appreciate the clarifications you made to the doc comments.

saturation.go Outdated Show resolved Hide resolved
@mkeeler
Copy link
Member

mkeeler commented Apr 26, 2022

On top of my previous comment, I think we should push sampling/bucket logic into go-metrics which will already do that for us. All we need to do is calculate the saturation for some interval and then report that as a sample to go-metrics which will perform all the necessary aggregations and end up showing a smoothed view if you look at the average instead of the max metric.

Adds metrics suggested in #488, to record the percentage of time the
main and FSM goroutines are busy with work vs available to accept new
work, to give operators an idea of how close they are to hitting
capacity limits.

We keep 256 samples in memory for each metric, and update gauges (at
most) once a second, possibly less if the goroutines are idle. This
should be ok because it's unlikely that a goroutine would go from very
high saturation to being completely idle (so at worst we'll leave the
gauge on the previous low value).
- Much simpler implementation based on an accumulator of sleep time.
  We no longer drop samples when the buffer is full.
- We now keep 5 previous measurements to smooth out spikes.
- Rename metrics to `raft.thread.fsm.saturation` and
  `raft.thread.main.saturation`.
- Remove FSM saturation metric from the `Raft` struct.
@boxofrad boxofrad force-pushed the boxofrad/saturation-metrics branch from a09e93e to 1f8b1ad Compare April 27, 2022 13:22
@boxofrad boxofrad requested review from mkeeler and removed request for dnephin April 27, 2022 13:28
@boxofrad boxofrad merged commit 44124c2 into main Apr 27, 2022
@boxofrad boxofrad deleted the boxofrad/saturation-metrics branch April 27, 2022 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants