Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

stswidwinski · 2023-08-14T18:36:39Z

Nomad version

tip.

Operating system and Environment details

Linux.

Issue

Repeated failure to obtain device stats causes the backoff to become 0 and burn CPU. We've noticed the CPU being entirely pegged due to this logic. This is because the backoff logic:

https://github.com/hashicorp/nomad/blob/main/client/devicemanager/instance.go#L497-L504

Can easily overflow. Here are the logs of repeated failures and the computed backoff:

[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=5s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=1m20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=5m20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=21m20s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-1759548h27m28.54775808s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-1914098h15m20.481480704s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-2532297h26m48.2163712s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-280223h58m36.530991104s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-1120895h54m26.123964416s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=30m0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=-2562047h47m16.854775808s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s
[ERROR] client.device_mgr: stats returned an error: plugin=PLUGIN error=<nil> retry=0s

[...]

Sadly, this logic is copy-pasted in multiple different places across the codebase. Some examples:

nomad/nomad/worker.go

Lines 876 to 883 in 65d562b

    
           w.setWorkloadStatus(WorkloadBackoff) 
        
           backoff := (1 << (2 * w.failures)) * base 
        
           if backoff > limit { 
        
           	backoff = limit 
        
           } else { 
        
           	w.failures++ 
        
           } 
        
           select {

nomad/nomad/state/state_store.go

Lines 282 to 286 in 65d562b

    
           deadline := 1 << (2 * retries) * backoffBase 
        
           if deadline > backoffLimit { 
        
           	deadline = backoffLimit 
        
           } 
        
           retryTimer.Reset(deadline)

So many different places may cause a CPU to spin on repeated failures

Reproduction steps

Compile any plugin which always fails to return stats. Run Nomad Client with said plugin and given it some time. It will end up producing bad data and eventually will burn CPU.

The text was updated successfully, but these errors were encountered:

tgross · 2023-08-14T19:41:57Z

Hi @stswidwinski! Yeah this looks pretty straightforwardly a bug. Looks like in some quick testing that it overflows at only 24 iterations and that we should be using an approach like this:

backoff := time.Duration(0)
attempt := 0
for {
	if backoff < statsBackoffLimit {
		backoff = (1 << (2 * uint64(attempt))) * statsBackoffBaseline
		if backoff > statsBackoffLimit {
			backoff = statsBackoffLimit
		}
		attempt++
	}
}

This caps the backoff and avoids the overflow. And you're right there's a good number of places we're doing this incorrectly. I can get a patch up for this shortly. Thanks!

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199

tgross · 2023-08-14T20:25:13Z

Patch is here: #18200 waiting for review. We'll get that into the next version of Nomad (with backports to 1.6.x, 1.5.x, and 1.4.x).

stswidwinski · 2023-08-14T22:37:19Z

Thank you! I proposed a somewhat more modular option to help us avoid making a similar mistake in the future (PR linked in your PR)e. Hope that helps!

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>

stswidwinski added the type/bug label Aug 14, 2023

tgross self-assigned this Aug 14, 2023

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 14, 2023

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Aug 14, 2023

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 14, 2023

tgross mentioned this issue Aug 14, 2023

fix multiple overflow errors in exponential backoff #18200

Merged

tgross added this to the 1.6.x milestone Aug 14, 2023

tgross closed this as completed in #18200 Aug 15, 2023

Nomad - Community Issues Triage automation moved this from In Progress to Done Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

stswidwinski commented Aug 14, 2023 •

edited

Loading

tgross commented Aug 14, 2023 •

edited

Loading

tgross commented Aug 14, 2023

stswidwinski commented Aug 14, 2023

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

Comments

stswidwinski commented Aug 14, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

tgross commented Aug 14, 2023 • edited Loading

tgross commented Aug 14, 2023

stswidwinski commented Aug 14, 2023

stswidwinski commented Aug 14, 2023 •

edited

Loading

tgross commented Aug 14, 2023 •

edited

Loading