Ensure that the index processed by the client is at least as new as the last one processed. #18269

stswidwinski · 2023-08-21T12:58:54Z

A fix for #18267, ensuring that "time travel" of the client state is not possible.

I could not find tests which may exercise this logic meaningfully. Please point me the right direction and I'm happy to add them

Please note that this isn't an actual good fix to the underlying issue, but rather a stop-gap measure.

…he last index processed so that stale data does not impact the running allocations.

stswidwinski · 2023-08-21T13:16:56Z

I don't think the UI tests are related here. May I ask someone to re-run the tests or help me understand the failure mode?

lgfa29 · 2023-08-21T21:51:12Z

Thanks for the PR and detailed issue description @stswidwinski.

I could not find tests which may exercise this logic meaningfully. Please point me the right direction and I'm happy to add them

Yeah, this part of the code is very tricky to test reliably since there's a lot going on, causing tests to be flaky sometimes. TestClient_WatchAllocs may be a good place to start. I wonder if you setup a test like that and manually mess with the server raft index we could reliably reproduce and test the problem.

I don't think the UI tests are related here. May I ask someone to re-run the tests or help me understand the failure mode?

I'm not sure what's going with the this failure. I suspect it may be because this is an external PR the Percy token may not be exposed. But we can safely ignore this error as that step only uploads UI visual diff testing results, which this PR wouldn't affect.

The changes look good to me, but I will just chat with the rest of the team to make sure I'm not missing anything.

Thanks again for the detailed work!

Testing for absence is a rather difficult task without hooks into the code to assert that given code paths have been hit. In this test we simply make sure that a transaction with a lower raft index does not make it to the client and depend on the timeout to make sure that this is the case. If the invariant is broken we expect the test to become either flaky or entirely broken. I acknowledge there may be a better way to test here, but lacking context this seems to be a minimal attempt to ensure that a similar issue doesn't creep back in. Feedback welcome.

1. error is ignored in WaitForResult 2. existing timeouts in blocking queries cannot be overridden and would make the tests always pass 3. the scheduler seems to ignore raft transactions with lower ids As such, we would need more infrastructure to write such tests. Do you have an idea is there is a precedent I can use here no to reinvent the wheel? In general we need: 1. To reduce the raft index of scheduler (purge the state or reassign the state: This can be done with regular assignment into a newly created state 2. Induce a timeout on the blocking query to ensure that the query result is passed to client -> This I see no easy way of achieving

stswidwinski · 2023-08-22T01:39:36Z

Yeah, this part of the code is very tricky to test reliably since there's a lot going on, causing tests to be flaky sometimes. TestClient_WatchAllocs may be a good place to start. I wonder if you setup a test like that and manually mess with the server raft index we could reliably reproduce and test the problem.

I'm beginning to think that these tests are not quite possible without broader changes to timeout settings. Please see the recent commit comments for some more context. Do you know if there is an easy way to override the timeouts used for blocking queries by Agent? What do you think?

stswidwinski · 2023-08-23T19:01:41Z

@lgfa29 please let me know how you'd like to proceed

lgfa29 · 2023-08-23T22:17:19Z

Hi @stswidwinski 👋

Apologies for the delay, but I don't have a good answer yet 😅

I'm checking with the rest of the team for any additional ideas and I will let you know if something comes up. But since this is such a tricky behaviour to trigger in a test we may need do without out for now.

stswidwinski · 2023-08-23T23:29:44Z

Hey @lgfa29!

Thank you for the update and sorry to keep bugging you. Let me know if I can help in some way or if any part of the bug report (or PR) makes this more confusing. Otherwise, I'll sit tight :)

schmichael

I think this is a fine shortcircuit to add, but I am struggling to see where it could fix misbehavior.

I'm beginning to think that these tests are not quite possible without broader changes to timeout settings.

I think you're correct here. One approach we could take is adding a rpcQueryWaitTime time.Duration to the Client struct and overriding it during tests to at least exercise the "blocking query timeout" case. At (normal) runtime the zero value should get converted to the default in Client.rpc(...), but it's probably best to explicitly set it to structs.DefaultBlockingRPCQueryTime in client.NewClient(...).

The main downside I can see is that this is yet-another-field on the already enormous Client struct, and a test-only one at that. I can't think of a less invasive change we could make to exercise some of this logic though.

I think moving Client.RPC(...) and dependencies into a new struct that implements RPCer interface{ RPC(...) error } or similar might provide us with a reasonable hook to manipulate RPC behavior in tests, but (a) that's a lot of work, and (b) we could easily mock so much that we're not actually testing realistic runtime scenarios anymore.

Another option would be to refactor the function being changed here into multiple functions and unit test them. This greatly increases the (b) not testing realistic scenarios risk above, but there may be opportunities to assert some useful things.

Long story short: I'd love someone to explore better testing options for Client RPCs, but I don't think we need to hold up this PR to wait for them.

client/client.go

stswidwinski · 2023-08-25T20:05:43Z

I think this is a fine shortcircuit to add, but I am struggling to see where it could fix misbehavior.

I think the basic tenant of the breakage right now is that right now we accept the answer of the Scheduler independent of whether it is new or if it is stale. In the case of staleness we would like to avoid applying the changes that such an answer would imply (killing allocations which are "unexpected" and garbage collecting them).

The stale replies may exist if the server we are using to get information from (we allow staleness so this is not the leader) times out awaiting for the raft state to catch up with an index which would allow it to process the query correctly. This happens more frequently than one would expect because of the retry logic within the client. The issue submitted (#18267) is quite verbose about this, but I'm happy to provide any more detail if it's helpful.

The change in a somewhat trivial way discards the data returned by the scheduler are requires that the response corresponds to the minimal index requested.

Use identifier-style keys for structured logs Co-authored-by: Michael Schurter <michael.schurter@gmail.com>

schmichael · 2023-08-25T22:16:33Z

I think the basic tenant of the breakage right now is that right now we accept the answer of the Scheduler independent of whether it is new or if it is stale.

Ah you're right! I was thinking this check was sufficient: https://github.com/hashicorp/nomad/pull/18269/files#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R2396

However that check is only performed for allocs that are observable by the server returning the response. If the server is too stale to see the alloc exists, the Client won't even make it to that check and think the alloc has been GC'd by the server and therefore should be GC'd locally.

Yikes. I'll add a changelog entry and get this merged. Thanks @stswidwinski!

Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index

When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269

Similar to #18269, it is possible that even if Node.GetClientAllocs retrieves fresh allocs that the subsequent Alloc.GetAllocs call retrieves stale allocs. While `diffAlloc(existing, updated)` properly ignores stale alloc *updates*, alloc deletions have no such check. So if a client retrieves an alloc created at index 123, and then a subsequent Alloc.GetAllocs call hits a new server which returns results at index 100, the client will stop the alloc created at 123 because it will be missing from the stale response. This change applies the same logic as #18269 and ensures only fresh responses are used. Glossary: * fresh - modified at an index > the query index * stale - modified at an index <= the query index

When waiting on a previous alloc we must query against the leader before switching to a stale query with index set. Also check to ensure the response is fresh before using it like #18269

Ensure that the index processed by the client is at least as new as t…

220eb51

…he last index processed so that stale data does not impact the running allocations.

vercel bot deployed to Preview – nomad-storybook-and-ui August 21, 2023 13:04 View deployment

stswidwinski added 2 commits August 22, 2023 09:10

Add unlock.

0db169a

vercel bot deployed to Preview – nomad-storybook-and-ui August 22, 2023 01:16 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 22, 2023 01:41 View deployment

schmichael approved these changes Aug 25, 2023

View reviewed changes

client/client.go Outdated Show resolved Hide resolved

Update client/client.go

5d57896

Use identifier-style keys for structured logs Co-authored-by: Michael Schurter <michael.schurter@gmail.com>

vercel bot deployed to Preview – nomad-storybook-and-ui August 25, 2023 20:11 View deployment

add cl

83310de

vercel bot deployed to Preview – nomad-storybook-and-ui August 25, 2023 22:21 View deployment

schmichael added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Aug 25, 2023

schmichael merged commit f25480c into hashicorp:main Aug 25, 2023
27 of 29 checks passed

schmichael mentioned this pull request Sep 27, 2023

client: prevent using stale allocs #18601

Merged

schmichael mentioned this pull request Sep 28, 2023

client: prevent watching stale alloc state #18612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that the index processed by the client is at least as new as the last one processed. #18269

Ensure that the index processed by the client is at least as new as the last one processed. #18269

stswidwinski commented Aug 21, 2023 •

edited

Loading

stswidwinski commented Aug 21, 2023

lgfa29 commented Aug 21, 2023

stswidwinski commented Aug 22, 2023 •

edited

Loading

stswidwinski commented Aug 23, 2023

lgfa29 commented Aug 23, 2023

stswidwinski commented Aug 23, 2023

schmichael left a comment •

edited

Loading

stswidwinski commented Aug 25, 2023 •

edited

Loading

schmichael commented Aug 25, 2023

Ensure that the index processed by the client is at least as new as the last one processed. #18269

Ensure that the index processed by the client is at least as new as the last one processed. #18269

Conversation

stswidwinski commented Aug 21, 2023 • edited Loading

stswidwinski commented Aug 21, 2023

lgfa29 commented Aug 21, 2023

stswidwinski commented Aug 22, 2023 • edited Loading

stswidwinski commented Aug 23, 2023

lgfa29 commented Aug 23, 2023

stswidwinski commented Aug 23, 2023

schmichael left a comment • edited Loading

Choose a reason for hiding this comment

stswidwinski commented Aug 25, 2023 • edited Loading

schmichael commented Aug 25, 2023

stswidwinski commented Aug 21, 2023 •

edited

Loading

stswidwinski commented Aug 22, 2023 •

edited

Loading

schmichael left a comment •

edited

Loading

stswidwinski commented Aug 25, 2023 •

edited

Loading