core: enforce strict steps for clients reconnect #15808

lgfa29 · 2023-01-17T04:58:17Z

When a Nomad client that is running an allocation with max_client_disconnect set misses a heartbeat the Nomad server will update its status to disconnected.

Upon reconnecting, the client will make three main RPC calls:

Node.UpdateStatus is used to set the client status to ready.
Node.UpdateAlloc is used to update the client-side information about
allocations, such as their ClientStatus, task states etc.
Node.Register is used to upsert the entire node information,
including its status.

These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations.

#15068 already enforced clients to heartbeat before updating their allocation data, but there are still scenarios that can generate wrong results.

For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources.

When this client comes back the order of events may be:

Client calls Node.UpdateStatus and is now ready.
Scheduler reconciles allocations and places the replacement alloc to
the client. The client is now assigned two allocations: the original
alloc that is still unknown and the replacement that is pending.
Client calls Node.UpdateAlloc and updates the original alloc to
running.
Scheduler notices too many allocs and stops the replacement.

This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled.

To avoid problems like this clients must update all of its relevant information before they can be considered ready and available for scheduling.

To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting:

Node.Register does not set the client status anymore.
Node.UpdateStatus sets the reconnecting client to the initializing
status until it successfully calls Node.UpdateAlloc.

These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.

Closes #15483

When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.

nomad/node_endpoint_test.go

lgfa29 · 2023-01-19T01:45:35Z

nomad/node_endpoint_test.go

-	require.ErrorContains(t, err, "not ready")
+	require.ErrorContains(t, err, "not allow to update allocs")


I changed the error message in https://github.com/hashicorp/nomad/pull/15808/files#diff-86b976ad80dd125c60d191875f83e20d1e4ee971966b910665d75c3165cb65e4L1182-R1209

lgfa29 · 2023-01-19T01:47:08Z

nomad/structs/structs.go

+	// LastMissedHeartbeatIndex stores the Raft index when the node
+	// last missed a heartbeat.
+	LastMissedHeartbeatIndex uint64


I'm not sure how accurate this name is. It's more like the first time the node last transition to an unresponsive status, but I couldn't think of a good name for this 😅

Yeah this name really implies to me that it's going to keep getting updated, whereas it's the index the node became unresponsive. It might be nice if we could clear the value once we're certain the node is live again, and that'd make things a little less confusing at the cost of a little extra logic to check for 0 in the UpdateStatus RPC.

Ah good point. Resetting it zero may help to indicate that this field is sort edge triggered. I pushed a commit do just that. Thanks!

tgross

This is looking really good @lgfa29. I think you boiled down the complex problem into a really nice targeted change here. I've mostly left suggestions but also one potentially nasty bug.

nomad/node_endpoint.go

.changelog/15808.txt

nomad/node_endpoint_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Update allocs and the LastAllocUpdateIndex in the same Raft transaction to avoid data inconsistency in case the UpdateAlloc request fails midway.

nomad/state/state_store_test.go

nomad/node_endpoint.go

tgross

This looks great @lgfa29!

nomad/node_endpoint.go

tgross · 2023-01-23T15:18:22Z

nomad/structs/structs.go

+	// LastMissedHeartbeatIndex stores the Raft index when the node
+	// last missed a heartbeat.
+	LastMissedHeartbeatIndex uint64


Yeah this name really implies to me that it's going to keep getting updated, whereas it's the index the node became unresponsive. It might be nice if we could clear the value once we're certain the node is live again, and that'd make things a little less confusing at the cost of a little extra logic to check for 0 in the UpdateStatus RPC.

nomad/state/state_store.go

vercel bot deployed to Preview – nomad-storybook-and-ui January 17, 2023 05:03 View deployment

lgfa29 force-pushed the b-node-status-fsm branch from e1df7a0 to 2180a10 Compare January 18, 2023 00:35

vercel bot deployed to Preview – nomad-storybook-and-ui January 18, 2023 00:40 View deployment

lgfa29 force-pushed the b-node-status-fsm branch from 2180a10 to 4331b7a Compare January 19, 2023 01:31

lgfa29 changed the title ~~[wip]~~ core: enforce strict steps for clients reconnect Jan 19, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 01:35 View deployment

chagelog: add entry for #15808

bb7cab4

lgfa29 requested review from tgross, schmichael and shoenig January 19, 2023 01:40

lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Jan 19, 2023

lgfa29 added this to the 1.4.4 milestone Jan 19, 2023

lgfa29 marked this pull request as ready for review January 19, 2023 01:41

lgfa29 commented Jan 19, 2023

View reviewed changes

nomad/node_endpoint_test.go Show resolved Hide resolved

lgfa29 commented Jan 19, 2023

View reviewed changes

test: add missing godoc strings and don't reuse variables

8ae8144

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 02:02 View deployment

test: don't leak goroutine

0b03f92

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 15:42 View deployment

tgross requested changes Jan 19, 2023

View reviewed changes

lgfa29 and others added 2 commits January 19, 2023 19:05

test: apply code review suggestions

0040e17

Update .changelog/15808.txt

67411b9

Co-authored-by: Tim Gross <tgross@hashicorp.com>

vercel bot deployed to Preview – nomad-storybook-and-ui January 20, 2023 00:12 View deployment

lgfa29 added 2 commits January 20, 2023 18:14

core: update LastAllocUpdateIndex with allocs

952b971

Update allocs and the LastAllocUpdateIndex in the same Raft transaction to avoid data inconsistency in case the UpdateAlloc request fails midway.

core: update Node.UpdateStatus method doc comment

6e1f1d8

vercel bot deployed to Preview – nomad-storybook-and-ui January 21, 2023 00:33 View deployment

update godoc string for UpdateAlloc

8202f0b

fix TestFSM_UpdateAllocFromClient_Unblock test

e52e15f

vercel bot deployed to Preview – nomad-storybook-and-ui January 21, 2023 00:58 View deployment

lgfa29 commented Jan 21, 2023

View reviewed changes

nomad/state/state_store_test.go Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui January 21, 2023 01:02 View deployment

lgfa29 requested a review from tgross January 21, 2023 01:19

tgross reviewed Jan 23, 2023

View reviewed changes

nomad/node_endpoint.go Show resolved Hide resolved

tgross approved these changes Jan 23, 2023

View reviewed changes

lgfa29 added 2 commits January 23, 2023 16:34

fix typo

bb0aa13

reset LastMissedHeartbeatIndex to zero when node is ready

5cb8b31

vercel bot deployed to Preview – nomad-storybook-and-ui January 23, 2023 21:50 View deployment

shoenig reviewed Jan 24, 2023

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

use go-set

f479a6a

vercel bot deployed to Preview – nomad-storybook-and-ui January 25, 2023 20:33 View deployment

lgfa29 merged commit 2659757 into main Jan 25, 2023

lgfa29 deleted the b-node-status-fsm branch January 25, 2023 20:54

This was referenced Jan 25, 2023

Backport of core: enforce strict steps for clients reconnect into release/1.3.x #15877

Merged

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

Merged

schmichael mentioned this pull request May 9, 2023

client: de-duplicate alloc updates and gate during restore #17074

Merged

tgross mentioned this pull request Jul 10, 2023

go-discover is only used during initial client introduction #17872

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: enforce strict steps for clients reconnect #15808

core: enforce strict steps for clients reconnect #15808

lgfa29 commented Jan 17, 2023 •

edited

Loading

lgfa29 Jan 19, 2023

lgfa29 Jan 19, 2023 •

edited

Loading

tgross Jan 23, 2023

lgfa29 Jan 23, 2023

tgross left a comment

tgross left a comment

tgross Jan 23, 2023

		require.ErrorContains(t, err, "not ready")
		require.ErrorContains(t, err, "not allow to update allocs")

core: enforce strict steps for clients reconnect #15808

core: enforce strict steps for clients reconnect #15808

Conversation

lgfa29 commented Jan 17, 2023 • edited Loading

lgfa29 Jan 19, 2023

Choose a reason for hiding this comment

lgfa29 Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

tgross Jan 23, 2023

Choose a reason for hiding this comment

lgfa29 Jan 23, 2023

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross Jan 23, 2023

Choose a reason for hiding this comment

lgfa29 commented Jan 17, 2023 •

edited

Loading

lgfa29 Jan 19, 2023 •

edited

Loading