Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update alloc after reconnect and enforece client heartbeat order #15068

Merged
merged 12 commits into from
Nov 4, 2022

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Oct 27, 2022

When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat the leader updates its status to disconnected which results in evals for all jobs that have an allocation in that client. This eval is reconciled by the scheduler, and updates the alloc ClientStatus to unknown and appends a new state value to indicate the allocation is considered disconnected.

When the client reconnects again its status is set back to ready since the heartbeat succeeds. The ClientStatus of the allocation is still unknown as it may have failed while the client was disconnected. This status will only be updated once the client calls the Node.UpdateAlloc against a server, which will overwrite the unknown ClientStatus value set in the server with the correct status.

From the leader perspective this is what happens:

node-disconnect drawio

The problem described in #14925 happens because any eval after this disconnect/reconnect flow happens (such as a job update) is essentially indistinguishable from the last eval in the diagram above: the node is ready and the alloc ClientStatus is running, but these two scenarios need to be handle differently and so we need to store something in state to be able to detect this difference.

Current implementation uses the presence of a TaskClientReconnected task event to detect if an alloc needs to reconnect, but this event is still present even after the alloc reconnects, so the scheduler always consider the alloc as still reconnecting.

While testing the fix for #14925 I would often hit #12680, which prevented the issue from being triggered since the extra alloc would not be considered as "reconnecting", so I included a fix for #12680 in this PR as well since their root cause is similar.

Clients use three main RPC methods to communicate its state with servers:

  • Node.GetAllocs reads allocation data from the server and writes it to the client.
  • Node.UpdateAlloc reads allocation from from the client and writes them to the server.
  • Node.UpdateStatus writes the client status to the server and is used as the heartbeat mechanism.

These RPC methods are called periodically by the client, and independently from each other. The usual mental model is that clients heartbeat first with Node.UpdateStatus and then update their allocation data with Node.UpdateAlloc, but this is not always true. If these two operations are reversed the diagram above will look like this:

node-disconnect drawio2

The state at the second last eval (Node updates allocs) looks just like the state when the node missed its heartbeat (Node misses heartbeat): Node.Status: disconnected and Alloc.ClientStatus: running.

This PR addresses these problems in the following ways:

  • Don't rely on task events since only a limited number of them are kept in state
  • Record when a allocation reconnects as an AllocState entry to keep it consistent with how disconnects are recorded
  • Check if the alloc needs to reconnect by verifying the last AllocState, it it's unknown the alloc reconnect has not been processed yet
  • Enforce that nodes register and heartbeat successfully before being able to update their allocs. This makes the reconnect flow easier to reason about and more predictable.

The commits are split by different work chunks:

This is an alternative fix for #14948.

Closes #14925
Closes #12680

@lgfa29 lgfa29 changed the title wip wip: update alloc after reconnect Oct 27, 2022
@lgfa29 lgfa29 changed the title wip: update alloc after reconnect Update alloc after reconnect and enforece client heartbeat order Nov 2, 2022
When an allocation reconnects to a cluster the scheduler needs to run
special logic to handle the reconnection, check if a replacement was
create and stop one of them.

If the allocation kept running while the node was disconnected, it will
be reconnected with `ClientStatus: running` and the node will have
`Status: ready`. This combination is the same as the normal steady state
of allocation, where everything is running as expected.

In order to differentiate between the two states (an allocation that is
reconnecting and one that is just running) the scheduler needs an extra
piece of state.

The current implementation uses the presence of a
`TaskClientReconnected` task event to detect when the allocation has
reconnected and thus must go through the reconnection process. But this
event remains even after the allocation is reconnected, causing all
future evals to consider the allocation as still reconnecting.

This commit changes the reconnect logic to use an `AllocState` to
register when the allocation was reconnected. This provides the
following benefits:

  - Only a limited number of task states are kept, and they are used for
    many other events. It's possible that, upon reconnecting, several
    actions are triggered that could cause the `TaskClientReconnected`
    event to be dropped.
  - Task events are set by clients and so their timestamps are subject
    to time skew from servers. This prevents using time to determine if
    an allocation reconnected after a disconnect event.
  - Disconnect events are already stored as `AllocState` and so storing
    reconnects there as well makes it the only source of information
    required.

With the new logic, the reconnection logic is only triggered if the
last `AllocState` is a disconnect event, meaning that the allocation has
not been reconnected yet. After the reconnection is handled, the new
`ClientStatus` is store in `AllocState` allowing future evals to skip
the reconnection logic.
When a client reconnects it makes two independent RPC calls:

  - `Node.UpdateStatus` to heartbeat and set its status as `ready`.
  - `Node.UpdateAlloc` to update the status of its allocations.

These two calls can happen in any order, and in case the allocations are
updated before a heartbeat it causes the state to be the same as a node
being disconnected: the node status will still be `disconnected` while
the allocation `ClientStatus` is set to `running`.

The current implementation did not handle this order of events properly,
and the scheduler would create an unnecessary placement since it
considered the allocation was being disconnected. This extra allocation
would then be quickly stopped by the heartbeat eval.

This commit adds a new code path to handle this order of events. If the
node is `disconnected` and the allocation `ClientStatus` is `running`
the scheduler will check if the allocation is actually reconnecting
using its `AllocState` events.
Clients interact with servers using three main RPC methods:

  - `Node.GetAllocs` reads allocation data from the server and writes it
    to the client.
  - `Node.UpdateAlloc` reads allocation from from the client and writes
    them to the server.
  - `Node.UpdateStatus` writes the client status to the server and is
    used as the heartbeat mechanism.

These three methods are called periodically by the clients and are done
so independently from each other, meaning that there can't be any
assumptions in their ordering.

This can generate scenarios that are hard to reason about and to code
for. For example, when a client misses too many heartbeats it will be
considered `down` or `disconnected` and the allocations it was running
are set to `lost` or `unknown`.

When connectivity is restored the to rest of the cluster, the natural
mental model is to think that the client will heartbeat first and then
update its allocations status into the servers.

But since there's no inherit order in these calls the reverse is just as
possible: the client updates the alloc status and then heartbeats. This
results in a state where allocs are, for example, `running` while the
client is still `disconnected`.

This commit adds a new verification to the `Node.UpdateAlloc` method to
reject updates from nodes that are not `ready`, forcing clients to
heartbeat first. Since this check is done server-side there is no need
to coordinate operations client-side: they can continue sending these
requests independently and alloc update will succeed after the heartbeat
is done.
nomad/node_endpoint.go Outdated Show resolved Hide resolved
Comment on lines +10405 to -10410
for i := len(a.AllocStates) - 1; i >= 0; i-- {
s := a.AllocStates[i]
if s.Field != AllocStateFieldClientStatus {
continue
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that states are order chronologically, which I think it's fine since entries are added using alloc.AppendState? Another option would be to traverse all values.

nomad/structs/structs.go Show resolved Hide resolved
scheduler/reconcile_test.go Show resolved Hide resolved
scheduler/reconcile_util_test.go Show resolved Hide resolved
@lgfa29 lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Nov 2, 2022
@lgfa29 lgfa29 added this to the 1.4.x milestone Nov 2, 2022
When the client reconnects with the server it synchronizes the state of
its allocations by sending data using the `Node.UpdateAlloc` RPC and
fetching data using the `Node.GetClientAllocs` RPC.

If the data fetch happens before the data write, `unknown` allocations
will still be in this state and would trigger the
`allocRunner.Reconnect` flow.

But when the server `DesiredStatus` for the allocation is `stop` the
client should not reconnect the allocation.
@lgfa29
Copy link
Contributor Author

lgfa29 commented Nov 2, 2022

Ops, I found a regression while doing some extra testing where the alloc keeps running when the job is stopped while the client is still disconnected. I'm still investigating it.

More race condition fun: f5ce8a9 😄

This time between alloc reads and writes in the client. If the client reads the alloc state before writing it, it will trigger the Reconnect flow even if DesiredStatus: stop. While this was existing bug, by delaying the alloc status write this PR just made this problem more likely to happen.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description:

When a Nomad client misses a heartbeat the leader updates its status to disconnected

should probably read:

When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat...

Clients without an alloc with max_client_disconnect set will transition to down and allocs to lost. Minor pedantic point, but I suspect we may be linking to this PR in the future to explain this behavior so I think it's worth being precise. 😁

Don't rely on task events since only a limited number of them are kept in state

🎉 I'm really happy to be keeping task events for humans only.

.changelog/15068.txt Outdated Show resolved Hide resolved
nomad/node_endpoint.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
scheduler/reconcile.go Outdated Show resolved Hide resolved
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely fantastic work from the repro to the description to the code! What a complex situation to debug and fixup.

An e2e test would be nice since there's so much emergent nondeterministic behavior at play. Your unit tests are great but come with some assumptions about the state of the data being tested that may not cover everything that can happen at runtime.

I forget if our Test{Agent,Server} make it possible to write an integration test which at least exercises these changes with a real client.Client.

The obvious problem with an e2e testing is how to simulate a disconnected client. Do I have a terrible idea for you! 😅

Schedule count=2 nomad agents via raw exec with the "fake" nomad's datacenter = your-test-name, and then schedule a job with max_client_disconnect set that targets that datacenter.

Grab the node ID of which of the 2 your alloc got scheduled on, send it a kill signal, let it restart via the parent Nomad, and then assert this intended behavior.

I think the nested-nomad agent should work, and the unique datacenter trick should prevent any flakiness from other jobs colliding with your careful placement scheme.

Does not block merging this though. Probably best as a followup.

scheduler/reconcile_util.go Outdated Show resolved Hide resolved
Comment on lines 265 to 268
// Handle the case where the client updates its allocs
// before sending a heartbeat. The alloc will be set to
// running, but the client is still considered to be
// disconnected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was prevented by the node readiness check added in Node.UpdateAlloc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the RPC update is enough for this particular issue, but I was wondering if there was another scenario that we don't know about that could trigger this condition (like some data corruption? or bad alloc in state?) and so I kept the reconciler fix as well as a kind of layered defense.

But I can remove this as well, less code to think about? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got really confused by this while reading the comments for the test cases in TestAllocSet_filterByTainted and TestReconciler_Disconnected_Client, which suggest it's still possible to hit this case. Two years from now someone will read that and wonder if the test is stale. 😀

I wouldn't be opposed to removing this extra code and tests but if we do leave it, I'd recommend making it clear in the docstrings that this is a case that should never happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant by the layered approach. The RPC change fixes things externally but without this change the reconciler can still in theory generate an invalid result given a these specific conditions.

I would've liked to be able to fix everything in the scheduler, but I found one case for system jobs that I couldn't and only the RPC ordering fixed it, and I suspect there may be other edge cases like this, so I kept both approaches.

But I agree that a comment stating this should not happen would be good.

Copy link
Contributor Author

@lgfa29 lgfa29 Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sold now 😄

Adding code to the scheduler that we don't expect to hit just add stuff to an already complex part of the code, so I removed this check and its equivalent for system scheduling. Thanks for the feedback!

I will leave this thread unresolved just in case we come back to this in the future.

scheduler/util.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing yet but wanted to surface my comments before I wrap up for the day 😀 . Will pick this back up first thing tomorrow.

Comment on lines 265 to 268
// Handle the case where the client updates its allocs
// before sending a heartbeat. The alloc will be set to
// running, but the client is still considered to be
// disconnected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got really confused by this while reading the comments for the test cases in TestAllocSet_filterByTainted and TestReconciler_Disconnected_Client, which suggest it's still possible to hit this case. Two years from now someone will read that and wonder if the test is stale. 😀

I wouldn't be opposed to removing this extra code and tests but if we do leave it, I'd recommend making it clear in the docstrings that this is a case that should never happen.

@lgfa29
Copy link
Contributor Author

lgfa29 commented Nov 3, 2022

I think I found a bug in the new NeedsToReconnect method that is preventing an alloc to disconnect twice. I will investigate further tomorrow.

Edit: fixed in cb51a28.

12f1ff5#diff-97e6d5511d191b7f15c08c580b602f02f7d22cfc2879f1a7d0a284c206b1c549 prevented the state store from being incorrectly mutated, so the update alloc needs to be sent to the plan applier to persist the changes. I should've noticed this since I had to do the same thing in the system scheduler 😬

Reconnected allocs have a new AllocState entry that must be persisted by
the plan applier.
The AllocUpdateRequest struct is used in three disjoint use cases:

1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs,
   and WriteRequest fields
2. Raft log message using the Allocs, Evals, and WriteRequest fields
3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields

Adding a new field that would only be used in one these cases (1) made
things more confusing and error prone. While in theory an
AllocUpdateRequest could send allocations from different nodes, in
practice this never actually happens since only clients call this method
with their own allocations.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (Once the docstrings are fixed up as being discussed)

scheduler/reconcile_test.go Outdated Show resolved Hide resolved
This condition could only be hit if, somehow, the allocation status was
set to "running" while the client was "unknown". This was addressed by
enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC
calls, so this scenario is not expected to happen.

Adding unnecessary code to the scheduler makes it harder to read and
reason about it.
@github-actions
Copy link

github-actions bot commented Mar 5, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line
Projects
None yet
4 participants