admission,goschedstats: add metrics for non work-conserving CPU behavior #96511

sumeerbhola · 2023-02-03T17:20:36Z

We have encountered scenarios with a large number of goroutines, which often causes an increase in the runnable goroutines, while the mean CPU utilization stays low (sometimes as low as 25%). Since there are non-zero runnable goroutines, at very short time scales of a few millis, CPU utilization must be 100% at those time scales. Since admission control (AC) samples the runnable goroutine count every 1ms, in order to react to such short time scales, we do see some drop in the slot count in some of these scenarios, and see queueing in the AC queues. The concern that comes up when seeing such queueing is whether AC is making the situation worse in its attempt to shift some queueing from the goroutine scheduler into the AC queue. Note that since admission.kv_slot_adjuster.overload_threshold is set to 32, AC does allow for significant queuing in the goroutine scheduler too, in an attempt to be work conserving. But it is still possible that the slot adjustment logic is being too slow to react and not allowing enough concurrency to keep the CPUs busy.

This PR adds two metrics to measure this behavior. These are still subject to sampling errors, but they are tied to the 1ms sampling of CPULoad. The admission.granter.cpu_non_work_conserving_duration.kv is incremented by the sampling duration * number of idle Ps if there are requests waiting in the AC KV (CPU) queue. Since we have observed idle P's even when there are runnable goroutines (which is not the fault of AC), there is another metric admission.granter.cpu_non_work_conserving_due_to_admission_duration.kv which discounts the number of idle Ps by the number of runnable goroutines.

These metrics give a sense of how much CPU capacity we are wasting per second. For example, if the first metric has a value 0.5s/s and we have 10 CPUs, so 10s/s of capacity, we are wasting 5% of the CPU. If the second metric is 0.3s/s then 3% of that CPU wastage can be attributed to AC queueing not behaving well. That is, one may expect CPU utilization to increase by 3% if AC is switched off. These metrics don't tell us the real latency impact of turning off AC, but the expectation is that if the reduction in CPU utilization due to AC divided by the observed CPU utilization (with AC on) is very small, the latency benefit of turning off AC will be small.

Epic: none

Fixes: #96495

cockroach-teamcity · 2023-02-03T17:20:46Z

This change is

abarganier

New metrics in general LGTM, just one small nit re: some additional clarification in the HELP text.

(not commenting on specific ways we're recording/using the metrics).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/granter.go line 652 at r1 (raw file):

	}
	// NB: Both the following metrics do not look at the SQL queues, since if
	// the KV queue is empty the only throttling on the SQL queues is via grant

nit: If neither of these metrics look at SQL queues, I think it'd be nice to include that information in the help text for both metrics.

Code quote:

	// NB: Both the following metrics do not look at the SQL queues, since if
	// the KV queue is empty the only throttling on the SQL queues is via grant
	// chaining, which should only cause delays if there is no CPU available.

We have encountered scenarios with a large number of goroutines, which often causes an increase in the runnable goroutines, while the mean CPU utilization stays low (sometimes as low as 25%). Since there are non-zero runnable goroutines, at very short time scales of a few millis, CPU utilization must be 100% at those time scales. Since admission control (AC) samples the runnable goroutine count every 1ms, in order to react to such short time scales, we do see some drop in the slot count in some of these scenarios, and see queueing in the AC queues. The concern that comes up when seeing such queueing is whether AC is making the situation worse in its attempt to shift some queueing from the goroutine scheduler into the AC queue. Note that since admission.kv_slot_adjuster.overload_threshold is set to 32, AC does allow for significant queuing in the goroutine scheduler too, in an attempt to be work conserving. But it is still possible that the slot adjustment logic is being too slow to react and not allowing enough concurrency to keep the CPUs busy. This PR adds two metrics to measure this behavior. These are still subject to sampling errors, but they are tied to the 1ms sampling of CPULoad. The admission.granter.cpu_non_work_conserving_duration.kv is incremented by the sampling duration * number of idle Ps if there are requests waiting in the AC KV (CPU) queue. Since we have observed idle P's even when there are runnable goroutines (which is not the fault of AC), there is another metric admission.granter.cpu_non_work_conserving_due_to_admission_duration.kv which discounts the number of idle Ps by the number of runnable goroutines. These metrics give a sense of how much CPU capacity we are wasting per second. For example, if the first metric has a value 0.5s/s and we have 10 CPUs, so 10s/s of capacity, we are wasting 5% of the CPU. If the second metric is 0.3s/s then 3% of that CPU wastage can be attributed to AC queueing not behaving well. That is, one may expect CPU utilization to increase by 3% if AC is switched off. These metrics don't tell us the real latency impact of turning off AC, but the expectation is that if the reduction in CPU utilization due to AC divided by the observed CPU utilization (with AC on) is very small, the latency benefit of turning off AC will be small. Epic: none Fixes: cockroachdb#96495

sumeerbhola requested a review from irfansharif February 3, 2023 17:20

sumeerbhola requested a review from a team as a code owner February 3, 2023 17:20

sumeerbhola force-pushed the work_conserving branch 3 times, most recently from 37a5a72 to 738829c Compare February 6, 2023 18:11

sumeerbhola requested a review from a team February 6, 2023 18:11

abarganier reviewed Feb 7, 2023

View reviewed changes

sumeerbhola force-pushed the work_conserving branch 2 times, most recently from 5ab18e7 to 3167e0a Compare February 8, 2023 19:39

sumeerbhola force-pushed the work_conserving branch from 3167e0a to 26e29ce Compare February 8, 2023 20:33

irfansharif mentioned this pull request Mar 10, 2023

admission: noisy granter CPULoad logs #81360

Closed

irfansharif mentioned this pull request Mar 22, 2023

admission: bypassing QueryTxn/PushTxn work can starve out non-bypassing AC work #99253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission,goschedstats: add metrics for non work-conserving CPU behavior #96511

admission,goschedstats: add metrics for non work-conserving CPU behavior #96511

sumeerbhola commented Feb 3, 2023

cockroach-teamcity commented Feb 3, 2023

abarganier left a comment

admission,goschedstats: add metrics for non work-conserving CPU behavior #96511

Are you sure you want to change the base?

admission,goschedstats: add metrics for non work-conserving CPU behavior #96511

Conversation

sumeerbhola commented Feb 3, 2023

cockroach-teamcity commented Feb 3, 2023

abarganier left a comment

Choose a reason for hiding this comment