fix multiple overflow errors in exponential backoff #18200

tgross · 2023-08-14T20:14:18Z

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable.

Introduce a check on the cap before the bitshift to avoid overflow in all places this can occur.

Fixes: #18199

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199

gulducat

I have a couple bits (heh) for consideration, which may be a result of my eyes glazing over reading these almost-but-not-quite-identical implementations.

nomad/state/state_store.go

nomad/worker.go

stswidwinski · 2023-08-14T22:15:52Z

I thought I'd propose a little bit more structured approach: #18201. Perhaps we can use that across the callsites to avoid future issues resulting from repeated logic?

tgross · 2023-08-15T14:08:52Z

I've pulled in the proposed changes from #18201

gulducat

Yay for reusable helper!

I have two concerns in the docker driver, one small and one large.
Other than that, LGTM!

drivers/docker/driver.go

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>

tgross · 2023-08-15T18:41:54Z

Backported to 1.6.x, 1.5.x, and 1.4.x

tgross requested review from shoenig, schmichael and gulducat August 14, 2023 20:14

tgross added the type/bug label Aug 14, 2023

tgross added this to the 1.6.x milestone Aug 14, 2023

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 14, 2023

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Aug 14, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui August 14, 2023 20:16 View deployment

tgross marked this pull request as ready for review August 14, 2023 20:21

tgross force-pushed the backoff-overflow branch from f7fc468 to c326e0a Compare August 14, 2023 20:23

tgross mentioned this pull request Aug 14, 2023

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199

Closed

vercel bot deployed to Preview – nomad-storybook-and-ui August 14, 2023 20:26 View deployment

gulducat approved these changes Aug 14, 2023

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

nomad/worker.go Outdated Show resolved Hide resolved

address comments from code review

676d15a

vercel bot deployed to Preview – nomad-storybook-and-ui August 14, 2023 21:01 View deployment

stswidwinski and others added 2 commits August 15, 2023 09:36

Propose a more structured approach to computation of backoff.

379a3ac

incorporate helper from @stswidwinski

bc7f360

tgross mentioned this pull request Aug 15, 2023

Propose a more structured approach to computation of backoff. #18201

Closed

tgross requested a review from gulducat August 15, 2023 14:07

vercel bot deployed to Preview – nomad-storybook-and-ui August 15, 2023 14:09 View deployment

stswidwinski mentioned this pull request Aug 15, 2023

Add retry mechanism for put call #18137

Merged

remove extraneous assignment

04dcb90

vercel bot deployed to Preview – nomad-storybook-and-ui August 15, 2023 15:33 View deployment

missed a helper

dd00966

vercel bot deployed to Preview – nomad-storybook-and-ui August 15, 2023 15:40 View deployment

fix ineffectual assignment

1f92874

vercel bot deployed to Preview – nomad-storybook-and-ui August 15, 2023 15:50 View deployment

gulducat approved these changes Aug 15, 2023

View reviewed changes

drivers/docker/driver.go Outdated Show resolved Hide resolved

drivers/docker/driver.go Show resolved Hide resolved

drivers/docker/driver.go Show resolved Hide resolved

fixes for docker driver

dd46498

vercel bot deployed to Preview – nomad-storybook-and-ui August 15, 2023 18:26 View deployment

tgross merged commit f00bff0 into main Aug 15, 2023
25 checks passed

Nomad - Community Issues Triage automation moved this from In Progress to Done Aug 15, 2023

tgross deleted the backoff-overflow branch August 15, 2023 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix multiple overflow errors in exponential backoff #18200

fix multiple overflow errors in exponential backoff #18200

tgross commented Aug 14, 2023

gulducat left a comment

stswidwinski commented Aug 14, 2023

tgross commented Aug 15, 2023

gulducat left a comment

tgross commented Aug 15, 2023

fix multiple overflow errors in exponential backoff #18200

fix multiple overflow errors in exponential backoff #18200

Conversation

tgross commented Aug 14, 2023

gulducat left a comment

Choose a reason for hiding this comment

stswidwinski commented Aug 14, 2023

tgross commented Aug 15, 2023

gulducat left a comment

Choose a reason for hiding this comment

tgross commented Aug 15, 2023