Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that the index processed by the client is at least as new as the last one processed. #18269

Conversation

stswidwinski
Copy link
Contributor

@stswidwinski stswidwinski commented Aug 21, 2023

A fix for #18267, ensuring that "time travel" of the client state is not possible.

I could not find tests which may exercise this logic meaningfully. Please point me the right direction and I'm happy to add them

Please note that this isn't an actual good fix to the underlying issue, but rather a stop-gap measure.

…he last index processed so that stale data does not impact the running allocations.
@stswidwinski
Copy link
Contributor Author

I don't think the UI tests are related here. May I ask someone to re-run the tests or help me understand the failure mode?

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 21, 2023

Thanks for the PR and detailed issue description @stswidwinski.

I could not find tests which may exercise this logic meaningfully. Please point me the right direction and I'm happy to add them

Yeah, this part of the code is very tricky to test reliably since there's a lot going on, causing tests to be flaky sometimes. TestClient_WatchAllocs may be a good place to start. I wonder if you setup a test like that and manually mess with the server raft index we could reliably reproduce and test the problem.

I don't think the UI tests are related here. May I ask someone to re-run the tests or help me understand the failure mode?

I'm not sure what's going with the this failure. I suspect it may be because this is an external PR the Percy token may not be exposed. But we can safely ignore this error as that step only uploads UI visual diff testing results, which this PR wouldn't affect.

The changes look good to me, but I will just chat with the rest of the team to make sure I'm not missing anything.

Thanks again for the detailed work!

Testing for absence is a rather difficult task without hooks into the code to assert that given code paths have been hit. In this test we simply make sure that a transaction with a lower raft index does not make it to the client and depend on the timeout to make sure that this is the case. If the invariant is broken we expect the test to become either flaky or entirely broken.

I acknowledge there may be a better way to test here, but lacking context this seems to be a minimal attempt to ensure that a similar issue doesn't creep back in. Feedback welcome.
1. error is ignored in WaitForResult
2. existing timeouts in blocking queries cannot be overridden and would make the tests always pass
3. the scheduler seems to ignore raft transactions with lower ids

As such, we would need more infrastructure to write such tests. Do you have an idea is there is a precedent I can use here no to reinvent the wheel? In general we need:

1. To reduce the raft index of scheduler (purge the state or reassign the state: This can be done with regular assignment into a newly created state
2. Induce a timeout on the blocking query to ensure that the query result is passed to client -> This I see no easy way of achieving
@stswidwinski
Copy link
Contributor Author

stswidwinski commented Aug 22, 2023

Yeah, this part of the code is very tricky to test reliably since there's a lot going on, causing tests to be flaky sometimes. TestClient_WatchAllocs may be a good place to start. I wonder if you setup a test like that and manually mess with the server raft index we could reliably reproduce and test the problem.

I'm beginning to think that these tests are not quite possible without broader changes to timeout settings. Please see the recent commit comments for some more context. Do you know if there is an easy way to override the timeouts used for blocking queries by Agent? What do you think?

@stswidwinski
Copy link
Contributor Author

@lgfa29 please let me know how you'd like to proceed

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 23, 2023

Hi @stswidwinski 👋

Apologies for the delay, but I don't have a good answer yet 😅

I'm checking with the rest of the team for any additional ideas and I will let you know if something comes up. But since this is such a tricky behaviour to trigger in a test we may need do without out for now.

@stswidwinski
Copy link
Contributor Author

Hey @lgfa29!

Thank you for the update and sorry to keep bugging you. Let me know if I can help in some way or if any part of the bug report (or PR) makes this more confusing. Otherwise, I'll sit tight :)

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a fine shortcircuit to add, but I am struggling to see where it could fix misbehavior.

I'm beginning to think that these tests are not quite possible without broader changes to timeout settings.

I think you're correct here. One approach we could take is adding a rpcQueryWaitTime time.Duration to the Client struct and overriding it during tests to at least exercise the "blocking query timeout" case. At (normal) runtime the zero value should get converted to the default in Client.rpc(...), but it's probably best to explicitly set it to structs.DefaultBlockingRPCQueryTime in client.NewClient(...).

The main downside I can see is that this is yet-another-field on the already enormous Client struct, and a test-only one at that. I can't think of a less invasive change we could make to exercise some of this logic though.

I think moving Client.RPC(...) and dependencies into a new struct that implements RPCer interface{ RPC(...) error } or similar might provide us with a reasonable hook to manipulate RPC behavior in tests, but (a) that's a lot of work, and (b) we could easily mock so much that we're not actually testing realistic runtime scenarios anymore.

Another option would be to refactor the function being changed here into multiple functions and unit test them. This greatly increases the (b) not testing realistic scenarios risk above, but there may be opportunities to assert some useful things.

Long story short: I'd love someone to explore better testing options for Client RPCs, but I don't think we need to hold up this PR to wait for them.

client/client.go Outdated Show resolved Hide resolved
@stswidwinski
Copy link
Contributor Author

stswidwinski commented Aug 25, 2023

I think this is a fine shortcircuit to add, but I am struggling to see where it could fix misbehavior.

I think the basic tenant of the breakage right now is that right now we accept the answer of the Scheduler independent of whether it is new or if it is stale. In the case of staleness we would like to avoid applying the changes that such an answer would imply (killing allocations which are "unexpected" and garbage collecting them).

The stale replies may exist if the server we are using to get information from (we allow staleness so this is not the leader) times out awaiting for the raft state to catch up with an index which would allow it to process the query correctly. This happens more frequently than one would expect because of the retry logic within the client. The issue submitted (#18267) is quite verbose about this, but I'm happy to provide any more detail if it's helpful.

The change in a somewhat trivial way discards the data returned by the scheduler are requires that the response corresponds to the minimal index requested.

Use identifier-style keys for structured logs

Co-authored-by: Michael Schurter <michael.schurter@gmail.com>
@schmichael
Copy link
Member

I think the basic tenant of the breakage right now is that right now we accept the answer of the Scheduler independent of whether it is new or if it is stale.

Ah you're right! I was thinking this check was sufficient: https://github.com/hashicorp/nomad/pull/18269/files#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R2396

However that check is only performed for allocs that are observable by the server returning the response. If the server is too stale to see the alloc exists, the Client won't even make it to that check and think the alloc has been GC'd by the server and therefore should be GC'd locally.

Yikes. I'll add a changelog entry and get this merged. Thanks @stswidwinski!

@schmichael schmichael added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Aug 25, 2023
@schmichael schmichael merged commit f25480c into hashicorp:main Aug 25, 2023
27 of 29 checks passed
schmichael added a commit that referenced this pull request Sep 27, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 27, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 28, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 28, 2023
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
schmichael added a commit that referenced this pull request Sep 29, 2023
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
schmichael added a commit that referenced this pull request Sep 29, 2023
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
schmichael added a commit that referenced this pull request Sep 29, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 29, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 29, 2023
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
schmichael added a commit that referenced this pull request Sep 29, 2023
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
schmichael added a commit that referenced this pull request Sep 29, 2023
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
schmichael added a commit that referenced this pull request Sep 29, 2023
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants