client: prevent using stale allocs #18601

schmichael · 2023-09-27T21:14:52Z

Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While diffAlloc(existing, updated) properly ignores stale alloc updates, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh responses are used.

Glossary:

fresh - modified at an index > the query index
stale - modified at an index <= the query index

Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index

schmichael · 2023-09-27T21:25:31Z

@stswidwinski I would love you weighing in on this as well if you have time since this is directly downstream from your fix in #18269

stswidwinski · 2023-09-27T21:39:14Z

I think this is entirely analogous to #18269 and I agree that this should address the problem. Previously we haven't addressed this because of two reasons (neither very good...):

we have implied that the likelihood of this occuring after the prior responses was correct (here: https://github.com/hashicorp/nomad/pull/18601/files#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R2388) was tiny (well, at least smaller than the issue we were fixing in that PR)
we were focusing on the somewhat larger blast radius race (since deletions applied globally rather than to the updated allocs)

I believe that #18267 more broadly describes this as a problem with nearly every RPC issued such that we require a minimal index. That's because a stale response is not an error which is somewhat compounded by related #18266.

While squashing the most painful samples seems like the right thing to do in the short term, perhaps it makes sense to mint a few wrappers which help issue queries with an index limit which convert stale responses into errors (such that the type checking helps us handle them). What do you think?

Edit: I think my statement of "nearly every" is somewhat strong. I think it's a relatively common pattern in those places which allow stale responses, but I would hate to make quantitative statements without having audited them. @schmichael has done that though, so a future reader should refer to the issues mentioned above.

schmichael · 2023-09-27T21:48:29Z

While squashing the most painful samples seems like the right thing to do in the short term, perhaps it makes sense to mint a few wrappers which help issue queries with an index limit which convert stale responses into errors (such that the type checking helps us handle them). What do you think?

Absolutely. I'm auditing all RPCs Clients make as we speak since as you pointed out this is likely part of a broader pattern.

In the past we had audited these calls for properly setting the index, but during that audit we had neglected to account for checking the index after the response.

Once I get the audit complete and see what the damage is, we can decide whether spot fixes or more holistic changes are best.

Given the opportunity to design blocking RPC semantics from scratch I would have absolutely choose a default behavior of returning an error instead of a stale response. Giving callers valid looking data and entrusting them to double check the index is obviously error prone.

client/client.go

tgross

LGTM 👍

schmichael · 2023-09-28T18:19:07Z

Skipping changelog entry because #18269 covers it (and sadly I don't think I can link 2 PRs from 1 logline in our changelog building system?)

Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index

schmichael · 2023-09-29T21:34:36Z

Backport failed: https://github.com/hashicorp/nomad/actions/runs/6342982782

Manual 1.4.x backport: d2cd6db

Manual 1.5.x backport: b1cf888

Manual 1.6.x backport: f02061f

schmichael requested review from lgfa29 and tgross September 27, 2023 21:14

vercel bot deployed to Preview – nomad-storybook-and-ui September 27, 2023 21:20 View deployment

tgross reviewed Sep 28, 2023

View reviewed changes

client/client.go Show resolved Hide resolved

cleanup log style

d2bd436

tgross approved these changes Sep 28, 2023

View reviewed changes

schmichael marked this pull request as ready for review September 28, 2023 18:19

vercel bot deployed to Preview – nomad-storybook-and-ui September 28, 2023 18:20 View deployment

schmichael added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Sep 28, 2023

schmichael merged commit e73026d into main Sep 28, 2023
29 of 31 checks passed

schmichael deleted the b-getallocs-index branch September 28, 2023 18:42

tgross mentioned this pull request Sep 28, 2023

client: prevent watching stale alloc state #18612

Merged

schmichael mentioned this pull request Sep 29, 2023

Nomad Server may instruct Clients to erroneously stop and GC all of their allocations. #18267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: prevent using stale allocs #18601

client: prevent using stale allocs #18601

schmichael commented Sep 27, 2023 •

edited

Loading

schmichael commented Sep 27, 2023 •

edited

Loading

stswidwinski commented Sep 27, 2023 •

edited

Loading

schmichael commented Sep 27, 2023

tgross left a comment

schmichael commented Sep 28, 2023

schmichael commented Sep 29, 2023

client: prevent using stale allocs #18601

client: prevent using stale allocs #18601

Conversation

schmichael commented Sep 27, 2023 • edited Loading

schmichael commented Sep 27, 2023 • edited Loading

stswidwinski commented Sep 27, 2023 • edited Loading

schmichael commented Sep 27, 2023

tgross left a comment

Choose a reason for hiding this comment

schmichael commented Sep 28, 2023

schmichael commented Sep 29, 2023

schmichael commented Sep 27, 2023 •

edited

Loading

schmichael commented Sep 27, 2023 •

edited

Loading

stswidwinski commented Sep 27, 2023 •

edited

Loading