Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

Closed
stswidwinski opened this issue Aug 14, 2023 · 3 comments · Fixed by #18200
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug
Milestone

Comments

@stswidwinski
Copy link
Contributor

stswidwinski commented Aug 14, 2023

Nomad version

tip.

Operating system and Environment details

Linux.

Issue

Repeated failure to obtain device stats causes the backoff to become 0 and burn CPU. We've noticed the CPU being entirely pegged due to this logic. This is because the backoff logic:

https://github.com/hashicorp/nomad/blob/main/client/devicemanager/instance.go#L497-L504

Can easily overflow. Here are the logs of repeated failures and the computed backoff:

[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=5s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=1m20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=5m20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=21m20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-1759548h27m28.54775808s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-1914098h15m20.481480704s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-2532297h26m48.2163712s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-280223h58m36.530991104s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-1120895h54m26.123964416s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-2562047h47m16.854775808s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s

[...]

Sadly, this logic is copy-pasted in multiple different places across the codebase. Some examples:

nomad/nomad/worker.go

Lines 876 to 883 in 65d562b

w.setWorkloadStatus(WorkloadBackoff)
backoff := (1 << (2 * w.failures)) * base
if backoff > limit {
backoff = limit
} else {
w.failures++
}
select {

deadline := 1 << (2 * retries) * backoffBase
if deadline > backoffLimit {
deadline = backoffLimit
}
retryTimer.Reset(deadline)

So many different places may cause a CPU to spin on repeated failures

Reproduction steps

Compile any plugin which always fails to return stats. Run Nomad Client with said plugin and given it some time. It will end up producing bad data and eventually will burn CPU.

@tgross
Copy link
Member

tgross commented Aug 14, 2023

Hi @stswidwinski! Yeah this looks pretty straightforwardly a bug. Looks like in some quick testing that it overflows at only 24 iterations and that we should be using an approach like this:

backoff := time.Duration(0)
attempt := 0
for {
	if backoff < statsBackoffLimit {
		backoff = (1 << (2 * uint64(attempt))) * statsBackoffBaseline
		if backoff > statsBackoffLimit {
			backoff = statsBackoffLimit
		}
		attempt++
	}
}

This caps the backoff and avoids the overflow. And you're right there's a good number of places we're doing this incorrectly. I can get a patch up for this shortly. Thanks!

@tgross tgross self-assigned this Aug 14, 2023
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 14, 2023
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Aug 14, 2023
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 14, 2023
tgross added a commit that referenced this issue Aug 14, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a check on the cap before the bitshift to avoid overflow in all places
this can occur.

Fixes: #18199
@tgross tgross added this to the 1.6.x milestone Aug 14, 2023
tgross added a commit that referenced this issue Aug 14, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a check on the cap before the bitshift to avoid overflow in all places
this can occur.

Fixes: #18199
@tgross
Copy link
Member

tgross commented Aug 14, 2023

Patch is here: #18200 waiting for review. We'll get that into the next version of Nomad (with backports to 1.6.x, 1.5.x, and 1.4.x).

@stswidwinski
Copy link
Contributor Author

Thank you! I proposed a somewhat more modular option to help us avoid making a similar mistake in the future (PR linked in your PR)e. Hope that helps!

Nomad - Community Issues Triage automation moved this from In Progress to Done Aug 15, 2023
tgross added a commit that referenced this issue Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
tgross added a commit that referenced this issue Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
tgross added a commit that referenced this issue Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
tgross added a commit that referenced this issue Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants