Update alloc after reconnect and enforece client heartbeat order #15068

lgfa29 · 2022-10-27T21:46:27Z

When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat the leader updates its status to disconnected which results in evals for all jobs that have an allocation in that client. This eval is reconciled by the scheduler, and updates the alloc ClientStatus to unknown and appends a new state value to indicate the allocation is considered disconnected.

When the client reconnects again its status is set back to ready since the heartbeat succeeds. The ClientStatus of the allocation is still unknown as it may have failed while the client was disconnected. This status will only be updated once the client calls the Node.UpdateAlloc against a server, which will overwrite the unknown ClientStatus value set in the server with the correct status.

From the leader perspective this is what happens:

The problem described in #14925 happens because any eval after this disconnect/reconnect flow happens (such as a job update) is essentially indistinguishable from the last eval in the diagram above: the node is ready and the alloc ClientStatus is running, but these two scenarios need to be handle differently and so we need to store something in state to be able to detect this difference.

Current implementation uses the presence of a TaskClientReconnected task event to detect if an alloc needs to reconnect, but this event is still present even after the alloc reconnects, so the scheduler always consider the alloc as still reconnecting.

While testing the fix for #14925 I would often hit #12680, which prevented the issue from being triggered since the extra alloc would not be considered as "reconnecting", so I included a fix for #12680 in this PR as well since their root cause is similar.

Clients use three main RPC methods to communicate its state with servers:

Node.GetAllocs reads allocation data from the server and writes it to the client.
Node.UpdateAlloc reads allocation from from the client and writes them to the server.
Node.UpdateStatus writes the client status to the server and is used as the heartbeat mechanism.

These RPC methods are called periodically by the client, and independently from each other. The usual mental model is that clients heartbeat first with Node.UpdateStatus and then update their allocation data with Node.UpdateAlloc, but this is not always true. If these two operations are reversed the diagram above will look like this:

The state at the second last eval (Node updates allocs) looks just like the state when the node missed its heartbeat (Node misses heartbeat): Node.Status: disconnected and Alloc.ClientStatus: running.

This PR addresses these problems in the following ways:

Don't rely on task events since only a limited number of them are kept in state
Record when a allocation reconnects as an AllocState entry to keep it consistent with how disconnects are recorded
Check if the alloc needs to reconnect by verifying the last AllocState, it it's unknown the alloc reconnect has not been processed yet
Enforce that nodes register and heartbeat successfully before being able to update their allocs. This makes the reconnect flow easier to reason about and more predictable.

The commits are split by different work chunks:

882904b is the main fix for Job updates not applied to reconnected allocations #14925, both for system and service jobs.
720513f fixes disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680 by properly handling running allocs in disconnected clients
956a3fa enforces clients to heartbeat before they can update their allocs. While 720513f would be enough to fix disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680 I noticed other edge cases that I was not able to fix in the scheduler.

This is an alternative fix for #14948.

Closes #14925
Closes #12680

When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic.

When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events.

Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done.

nomad/node_endpoint.go

lgfa29 · 2022-11-02T02:36:18Z

nomad/structs/structs.go

+	for i := len(a.AllocStates) - 1; i >= 0; i-- {
+		s := a.AllocStates[i]
+		if s.Field != AllocStateFieldClientStatus {
+			continue
 		}
-	}


This assumes that states are order chronologically, which I think it's fine since entries are added using alloc.AppendState? Another option would be to traverse all values.

nomad/structs/structs.go

scheduler/reconcile_test.go

scheduler/reconcile_util_test.go

When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation.

lgfa29 · 2022-11-02T21:57:57Z

Ops, I found a regression while doing some extra testing where the alloc keeps running when the job is stopped while the client is still disconnected. I'm still investigating it.

More race condition fun: f5ce8a9 😄

This time between alloc reads and writes in the client. If the client reads the alloc state before writing it, it will trigger the Reconnect flow even if DesiredStatus: stop. While this was existing bug, by delaying the alloc status write this PR just made this problem more likely to happen.

schmichael

In the description:

When a Nomad client misses a heartbeat the leader updates its status to disconnected

should probably read:

When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat...

Clients without an alloc with max_client_disconnect set will transition to down and allocs to lost. Minor pedantic point, but I suspect we may be linking to this PR in the future to explain this behavior so I think it's worth being precise. 😁

Don't rely on task events since only a limited number of them are kept in state

🎉 I'm really happy to be keeping task events for humans only.

.changelog/15068.txt

nomad/node_endpoint.go

nomad/structs/structs.go

scheduler/reconcile.go

schmichael

Absolutely fantastic work from the repro to the description to the code! What a complex situation to debug and fixup.

An e2e test would be nice since there's so much emergent nondeterministic behavior at play. Your unit tests are great but come with some assumptions about the state of the data being tested that may not cover everything that can happen at runtime.

I forget if our Test{Agent,Server} make it possible to write an integration test which at least exercises these changes with a real client.Client.

The obvious problem with an e2e testing is how to simulate a disconnected client. Do I have a terrible idea for you! 😅

Schedule count=2 nomad agents via raw exec with the "fake" nomad's datacenter = your-test-name, and then schedule a job with max_client_disconnect set that targets that datacenter.

Grab the node ID of which of the 2 your alloc got scheduled on, send it a kill signal, let it restart via the parent Nomad, and then assert this intended behavior.

I think the nested-nomad agent should work, and the unique datacenter trick should prevent any flakiness from other jobs colliding with your careful placement scheme.

Does not block merging this though. Probably best as a followup.

scheduler/reconcile_util.go

schmichael · 2022-11-03T02:39:51Z

scheduler/reconcile_util.go

+						// Handle the case where the client updates its allocs
+						// before sending a heartbeat. The alloc will be set to
+						// running, but the client is still considered to be
+						// disconnected.


I thought this was prevented by the node readiness check added in Node.UpdateAlloc?

Yup, the RPC update is enough for this particular issue, but I was wondering if there was another scenario that we don't know about that could trigger this condition (like some data corruption? or bad alloc in state?) and so I kept the reconciler fix as well as a kind of layered defense.

But I can remove this as well, less code to think about? 😅

I got really confused by this while reading the comments for the test cases in TestAllocSet_filterByTainted and TestReconciler_Disconnected_Client, which suggest it's still possible to hit this case. Two years from now someone will read that and wonder if the test is stale. 😀

I wouldn't be opposed to removing this extra code and tests but if we do leave it, I'd recommend making it clear in the docstrings that this is a case that should never happen.

That's what I meant by the layered approach. The RPC change fixes things externally but without this change the reconciler can still in theory generate an invalid result given a these specific conditions.

I would've liked to be able to fix everything in the scheduler, but I found one case for system jobs that I couldn't and only the RPC ordering fixed it, and I suspect there may be other edge cases like this, so I kept both approaches.

But I agree that a comment stating this should not happen would be good.

I'm sold now 😄

Adding code to the scheduler that we don't expect to hit just add stuff to an already complex part of the code, so I removed this check and its equivalent for system scheduling. Thanks for the feedback!

I will leave this thread unresolved just in case we come back to this in the future.

scheduler/util.go

tgross

I haven't finished reviewing yet but wanted to surface my comments before I wrap up for the day 😀 . Will pick this back up first thing tomorrow.

tgross · 2022-11-03T20:56:17Z

scheduler/reconcile_util.go

+						// Handle the case where the client updates its allocs
+						// before sending a heartbeat. The alloc will be set to
+						// running, but the client is still considered to be
+						// disconnected.


I got really confused by this while reading the comments for the test cases in TestAllocSet_filterByTainted and TestReconciler_Disconnected_Client, which suggest it's still possible to hit this case. Two years from now someone will read that and wonder if the test is stale. 😀

I wouldn't be opposed to removing this extra code and tests but if we do leave it, I'd recommend making it clear in the docstrings that this is a case that should never happen.

lgfa29 · 2022-11-03T23:25:01Z

~~I think I found a bug in the new NeedsToReconnect method that is preventing an alloc to disconnect twice. I will investigate further tomorrow.~~

Edit: fixed in cb51a28.

12f1ff5#diff-97e6d5511d191b7f15c08c580b602f02f7d22cfc2879f1a7d0a284c206b1c549 prevented the state store from being incorrectly mutated, so the update alloc needs to be sent to the plan applier to persist the changes. I should've noticed this since I had to do the same thing in the system scheduler 😬

Reconnected allocs have a new AllocState entry that must be persisted by the plan applier.

The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations.

tgross

LGTM! (Once the docstrings are fixed up as being discussed)

scheduler/reconcile_test.go

This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it.

github-actions · 2023-03-05T02:24:05Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lgfa29 mentioned this pull request Oct 27, 2022

Add ReconnectModifyIndex to handle reconnect lifecycle #14948

Closed

lgfa29 changed the title ~~wip~~ wip: update alloc after reconnect Oct 27, 2022

lgfa29 force-pushed the b-update-reconnected-alloc branch from 8c02d32 to 3d27f43 Compare October 31, 2022 21:52

vercel bot deployed to Preview – nomad-storybook-and-ui October 31, 2022 21:56 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from 3d27f43 to dd461c7 Compare October 31, 2022 22:05

vercel bot deployed to Preview – nomad-storybook-and-ui October 31, 2022 22:09 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from dd461c7 to 182d660 Compare November 1, 2022 01:36

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 01:39 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from 182d660 to f09eaf1 Compare November 1, 2022 14:52

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 14:56 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from f09eaf1 to ac0f6fc Compare November 1, 2022 15:32

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 15:35 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 19:55 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from 1db5842 to 56341f0 Compare November 1, 2022 22:59

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 23:03 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from 56341f0 to aec873d Compare November 2, 2022 01:30

lgfa29 changed the title ~~wip: update alloc after reconnect~~ Update alloc after reconnect and enforece client heartbeat order Nov 2, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2022 01:34 View deployment

lgfa29 force-pushed the b-update-reconnected-alloc branch from aec873d to e9d7a56 Compare November 2, 2022 01:50

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2022 01:54 View deployment

lgfa29 added 3 commits November 1, 2022 22:08

lgfa29 force-pushed the b-update-reconnected-alloc branch from e9d7a56 to 956a3fa Compare November 2, 2022 02:09

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2022 02:12 View deployment

lgfa29 commented Nov 2, 2022

View reviewed changes

chagelog: add entry for #15068

3e081c1

lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Nov 2, 2022

lgfa29 added this to the 1.4.x milestone Nov 2, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2022 21:58 View deployment

schmichael reviewed Nov 3, 2022

View reviewed changes

.changelog/15068.txt Outdated Show resolved Hide resolved

nomad/node_endpoint.go Outdated Show resolved Hide resolved

nomad/structs/structs.go Outdated Show resolved Hide resolved

scheduler/reconcile.go Outdated Show resolved Hide resolved

schmichael approved these changes Nov 3, 2022

View reviewed changes

apply more code review changes

12f1ff5

tgross reviewed Nov 3, 2022

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui November 3, 2022 21:14 View deployment

lgfa29 added 2 commits November 3, 2022 22:29

scheduler: persist changes to reconnected allocs

cb51a28

Reconnected allocs have a new AllocState entry that must be persisted by the plan applier.

vercel bot deployed to Preview – nomad-storybook-and-ui November 4, 2022 03:01 View deployment

tgross approved these changes Nov 4, 2022

View reviewed changes

scheduler/reconcile_test.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui November 4, 2022 18:40 View deployment

more code review

da8aee4

vercel bot deployed to Preview – nomad-storybook-and-ui November 4, 2022 18:46 View deployment

remove another unused test

48bfdc4

vercel bot deployed to Preview – nomad-storybook-and-ui November 4, 2022 19:16 View deployment

lgfa29 merged commit 7828c02 into main Nov 4, 2022

lgfa29 deleted the b-update-reconnected-alloc branch November 4, 2022 20:25

This was referenced Nov 4, 2022

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.3.x #15152

Merged

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

Merged

tgross mentioned this pull request Nov 22, 2022

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect #15144

Closed

lgfa29 mentioned this pull request Jan 10, 2023

New behavior when max_client_disconnect is used and the worker node moves from disconnected to ready #15483

Closed

lgfa29 mentioned this pull request Jan 19, 2023

core: enforce strict steps for clients reconnect #15808

Merged

This was referenced Jan 25, 2023

Backport of core: enforce strict steps for clients reconnect into release/1.3.x #15877

Merged

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

Merged

github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update alloc after reconnect and enforece client heartbeat order #15068

Update alloc after reconnect and enforece client heartbeat order #15068

lgfa29 commented Oct 27, 2022 •

edited

Loading

lgfa29 Nov 2, 2022

lgfa29 commented Nov 2, 2022

schmichael left a comment

schmichael left a comment

schmichael Nov 3, 2022

lgfa29 Nov 3, 2022

tgross Nov 3, 2022

lgfa29 Nov 3, 2022

lgfa29 Nov 4, 2022 •

edited

Loading

tgross left a comment

tgross Nov 3, 2022

lgfa29 commented Nov 3, 2022 •

edited

Loading

tgross left a comment

github-actions bot commented Mar 5, 2023

Update alloc after reconnect and enforece client heartbeat order #15068

Update alloc after reconnect and enforece client heartbeat order #15068

Conversation

lgfa29 commented Oct 27, 2022 • edited Loading

lgfa29 Nov 2, 2022

Choose a reason for hiding this comment

lgfa29 commented Nov 2, 2022

schmichael left a comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

schmichael Nov 3, 2022

Choose a reason for hiding this comment

lgfa29 Nov 3, 2022

Choose a reason for hiding this comment

tgross Nov 3, 2022

Choose a reason for hiding this comment

lgfa29 Nov 3, 2022

Choose a reason for hiding this comment

lgfa29 Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross Nov 3, 2022

Choose a reason for hiding this comment

lgfa29 commented Nov 3, 2022 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 5, 2023

lgfa29 commented Oct 27, 2022 •

edited

Loading

lgfa29 Nov 4, 2022 •

edited

Loading

lgfa29 commented Nov 3, 2022 •

edited

Loading