Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: CPU metrics for high concurrency scenarios #96495

Open
Tracked by #82743
sumeerbhola opened this issue Feb 3, 2023 · 0 comments · May be fixed by #96511
Open
Tracked by #82743

admission: CPU metrics for high concurrency scenarios #96495

sumeerbhola opened this issue Feb 3, 2023 · 0 comments · May be fixed by #96511
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Feb 3, 2023

We have encountered scenarios with a large number of goroutines, which often causes an increase in the runnable goroutines, while the mean CPU utilization stays low (sometimes as low as 25%). Since there are non-zero runnable goroutines, at very short time scales of a few ms CPU utilization must be 100%. Since admission control (AC) samples the runnable goroutine count every 1ms, in order to react to such short time scales, we do see some drop in the slot count in some of these cases, and at the same time queueing in the AC queues. The concern that comes up when seeing such queueing is whether AC is making the situation worse in its attempt to shift some queueing from the goroutine scheduler into the AC queue. Note that since admission.kv_slot_adjuster.overload_threshold is set to 32, AC does allow for significant queuing in the goroutine scheduler too, in an attempt to be work conserving.

We try to answer two questions:
Q1. Should such scenarios be considered unreasonable and be fixed outside AC. There are 2 cases we have seen:

Q2. Given that these scenarios are sometimes reasonable, can we add metrics to answer the concern mentioned earlier regarding whether AC is making the situation worse.

The slot mechanism is imposing a max concurrency. If the max concurrency leaves some CPU idle, because enough of the admitted work is blocked (contention or IO), while we have work queued in AC, the AC queueing is not work conserving. We can try to sample this at 1ms intervals the way we sample numRunnableGoroutines.
If AC is indeed work conserving, AC queueing while the CPU "seems underutilized" is not happening, since the CPU is fully utilized when there is queueing in AC.

Jira issue: CRDB-24153

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Feb 3, 2023
@sumeerbhola sumeerbhola self-assigned this Feb 3, 2023
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 7, 2023
We have encountered scenarios with a large number of goroutines, which often
causes an increase in the runnable goroutines, while the mean CPU utilization
stays low (sometimes as low as 25%). Since there are non-zero runnable
goroutines, at very short time scales of a few millis, CPU utilization must
be 100% at those time scales. Since admission control (AC) samples the
runnable goroutine count every 1ms, in order to react to such short time
scales, we do see some drop in the slot count in some of these scenarios, and
see queueing in the AC queues. The concern that comes up when seeing such
queueing is whether AC is making the situation worse in its attempt to shift
some queueing from the goroutine scheduler into the AC queue. Note that since
admission.kv_slot_adjuster.overload_threshold is set to 32, AC does allow for
significant queuing in the goroutine scheduler too, in an attempt to be work
conserving. But it is still possible that the slot adjustment logic is being
too slow to react and not allowing enough concurrency to keep the CPUs busy.

This PR adds two metrics to measure this behavior. These are still subject
to sampling errors, but they are tied to the 1ms sampling of CPULoad.
The admission.granter.cpu_non_work_conserving_duration.kv is incremented
by the sampling duration * number of idle Ps if there are requests waiting
in the AC KV (CPU) queue. Since we have observed idle P's even when there are
runnable goroutines (which is not the fault of AC), there is another metric
admission.granter.cpu_non_work_conserving_due_to_admission_duration.kv
which discounts the number of idle Ps by the number of runnable goroutines.

These metrics give a sense of how much CPU capacity we are wasting per
second. For example, if the first metric has a value 0.5s/s and we have
10 CPUs, so 10s/s of capacity, we are wasting 5% of the CPU. If the second
metric is 0.3s/s then 3% of that CPU wastage can be attributed to AC
queueing not behaving well. That is, one may expect CPU utilization to
increase by 3% if AC is switched off. These metrics don't tell us the real
latency impact of turning off AC, but the expectation is that if the
reduction in CPU utilization due to AC divided by the observed CPU
utilization (with AC on) is very small, the latency benefit of turning off
AC will be small.

Epic: none
Fixes: cockroachdb#96495
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 8, 2023
We have encountered scenarios with a large number of goroutines, which often
causes an increase in the runnable goroutines, while the mean CPU utilization
stays low (sometimes as low as 25%). Since there are non-zero runnable
goroutines, at very short time scales of a few millis, CPU utilization must
be 100% at those time scales. Since admission control (AC) samples the
runnable goroutine count every 1ms, in order to react to such short time
scales, we do see some drop in the slot count in some of these scenarios, and
see queueing in the AC queues. The concern that comes up when seeing such
queueing is whether AC is making the situation worse in its attempt to shift
some queueing from the goroutine scheduler into the AC queue. Note that since
admission.kv_slot_adjuster.overload_threshold is set to 32, AC does allow for
significant queuing in the goroutine scheduler too, in an attempt to be work
conserving. But it is still possible that the slot adjustment logic is being
too slow to react and not allowing enough concurrency to keep the CPUs busy.

This PR adds two metrics to measure this behavior. These are still subject
to sampling errors, but they are tied to the 1ms sampling of CPULoad.
The admission.granter.cpu_non_work_conserving_duration.kv is incremented
by the sampling duration * number of idle Ps if there are requests waiting
in the AC KV (CPU) queue. Since we have observed idle P's even when there are
runnable goroutines (which is not the fault of AC), there is another metric
admission.granter.cpu_non_work_conserving_due_to_admission_duration.kv
which discounts the number of idle Ps by the number of runnable goroutines.

These metrics give a sense of how much CPU capacity we are wasting per
second. For example, if the first metric has a value 0.5s/s and we have
10 CPUs, so 10s/s of capacity, we are wasting 5% of the CPU. If the second
metric is 0.3s/s then 3% of that CPU wastage can be attributed to AC
queueing not behaving well. That is, one may expect CPU utilization to
increase by 3% if AC is switched off. These metrics don't tell us the real
latency impact of turning off AC, but the expectation is that if the
reduction in CPU utilization due to AC divided by the observed CPU
utilization (with AC on) is very small, the latency benefit of turning off
AC will be small.

Epic: none
Fixes: cockroachdb#96495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant