Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: prevent using stale allocs #18601

Merged
merged 2 commits into from
Sep 28, 2023
Merged

client: prevent using stale allocs #18601

merged 2 commits into from
Sep 28, 2023

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Sep 27, 2023

Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While diffAlloc(existing, updated) properly ignores stale alloc updates, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh responses are used.

Glossary:

  • fresh - modified at an index > the query index
  • stale - modified at an index <= the query index

Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
@schmichael
Copy link
Member Author

schmichael commented Sep 27, 2023

@stswidwinski I would love you weighing in on this as well if you have time since this is directly downstream from your fix in #18269

@stswidwinski
Copy link
Contributor

stswidwinski commented Sep 27, 2023

I think this is entirely analogous to #18269 and I agree that this should address the problem. Previously we haven't addressed this because of two reasons (neither very good...):

  1. we have implied that the likelihood of this occuring after the prior responses was correct (here: https://github.com/hashicorp/nomad/pull/18601/files#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R2388) was tiny (well, at least smaller than the issue we were fixing in that PR)
  2. we were focusing on the somewhat larger blast radius race (since deletions applied globally rather than to the updated allocs)

I believe that #18267 more broadly describes this as a problem with nearly every RPC issued such that we require a minimal index. That's because a stale response is not an error which is somewhat compounded by related #18266.

While squashing the most painful samples seems like the right thing to do in the short term, perhaps it makes sense to mint a few wrappers which help issue queries with an index limit which convert stale responses into errors (such that the type checking helps us handle them). What do you think?

Edit: I think my statement of "nearly every" is somewhat strong. I think it's a relatively common pattern in those places which allow stale responses, but I would hate to make quantitative statements without having audited them. @schmichael has done that though, so a future reader should refer to the issues mentioned above.

@schmichael
Copy link
Member Author

While squashing the most painful samples seems like the right thing to do in the short term, perhaps it makes sense to mint a few wrappers which help issue queries with an index limit which convert stale responses into errors (such that the type checking helps us handle them). What do you think?

Absolutely. I'm auditing all RPCs Clients make as we speak since as you pointed out this is likely part of a broader pattern.

In the past we had audited these calls for properly setting the index, but during that audit we had neglected to account for checking the index after the response.

Once I get the audit complete and see what the damage is, we can decide whether spot fixes or more holistic changes are best.

Given the opportunity to design blocking RPC semantics from scratch I would have absolutely choose a default behavior of returning an error instead of a stale response. Giving callers valid looking data and entrusting them to double check the index is obviously error prone.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@schmichael
Copy link
Member Author

Skipping changelog entry because #18269 covers it (and sadly I don't think I can link 2 PRs from 1 logline in our changelog building system?)

@schmichael schmichael marked this pull request as ready for review September 28, 2023 18:19
@schmichael schmichael added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Sep 28, 2023
@schmichael schmichael merged commit e73026d into main Sep 28, 2023
29 of 31 checks passed
@schmichael schmichael deleted the b-getallocs-index branch September 28, 2023 18:42
schmichael added a commit that referenced this pull request Sep 29, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 29, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 29, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
@schmichael
Copy link
Member Author

Backport failed: https://github.com/hashicorp/nomad/actions/runs/6342982782

Manual 1.4.x backport: d2cd6db

Manual 1.5.x backport: b1cf888

Manual 1.6.x backport: f02061f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants