Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler: 2 phase reconnects and tests #12333

Conversation

DerekStrickland
Copy link
Contributor

@DerekStrickland DerekStrickland commented Mar 21, 2022

Not ready for review

This PR changes the generic scheduler reconcile logic to support zero-downtime reconnects.

The initial implementation for disconnected clients stopped replacement allocations before the reconnecting node synced its alloc state with the server. If the alloc failed while the node was disconnected, this could lead to a scenario where the replacement was stopped, but the reconnecting alloc was not healthy, and the cluster ends up underprovisioned.

This PR changes the logic so that allocs are identified as reconnecting after the reconnecting client sends its alloc updates to the server. To make this work, I made the following changes.

  • Node.UpdateAlloc
    • Looks up the server-side alloc state immediately instead of after checking if the incoming client alloc is terminal.
    • Previously, the incoming alloc update would be skipped if it was in a terminal state on the client.
    • Now it will only be skipped if it is in a terminal state at the client and not unknown at the server.
    • Any incoming alloc updates that are unknown at the sever will now trigger an eval with EvalTriggerReconnect.
  • I added anExpiredfunction to Allocation that uses AllocStates to determine if an unknown allocation has expired.
  • I updated the reconciler to append an AllocState entry when creating the unknown alloc update.
  • I added aReconnectedfunction to Allocation that uses AllocStates plus TaskStates to determine if an allocation has reconnected and whether it expired before reconnecting.
  • I updated the filterByTainted function to:
    • Filter allocs on disconnected nodes into disconnecting, lost, or ignore based on client status and expiration
      state.
    • Filter allocs on reconnecting nodes into reconnecting, lost, or ignore based on their alloc and task states.
  • I updated the filterByResheduable function to also be able to handle disconnecting allocations.
  • I updated the reconciler to:
    • Incorporate the ignore set now returned by filterByTainted into DesiredChanges
    • Consider job version/create_index in multiple areas.
  • I updated the allocWatcher to also trigger when a previous allocation transitions to unknown

@DerekStrickland DerekStrickland added this to the 1.3.0 milestone Mar 21, 2022
@DerekStrickland DerekStrickland self-assigned this Mar 21, 2022
@vercel vercel bot temporarily deployed to Preview – nomad March 22, 2022 09:18 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 22, 2022 23:06 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 24, 2022 22:27 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 25, 2022 22:21 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 27, 2022 11:52 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from 3249c78 to 371e2f9 Compare March 28, 2022 10:22
@vercel vercel bot temporarily deployed to Preview – nomad March 28, 2022 10:22 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from 371e2f9 to 479d73e Compare March 28, 2022 10:26
@vercel vercel bot temporarily deployed to Preview – nomad March 28, 2022 10:26 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from 479d73e to a92d25c Compare March 28, 2022 10:44
@vercel vercel bot temporarily deployed to Preview – nomad March 28, 2022 10:44 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from a92d25c to 08f4037 Compare March 28, 2022 10:48
@vercel vercel bot temporarily deployed to Preview – nomad March 28, 2022 10:48 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from 08f4037 to a4cbd80 Compare March 28, 2022 17:32
@vercel vercel bot temporarily deployed to Preview – nomad March 28, 2022 17:32 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from a4cbd80 to 277c36e Compare March 30, 2022 13:57
@vercel vercel bot temporarily deployed to Preview – nomad March 30, 2022 13:57 Inactive
@DerekStrickland DerekStrickland force-pushed the f-2-phase-reconnect-disconnected-clients branch from 277c36e to 568df7c Compare March 30, 2022 14:15
@vercel vercel bot temporarily deployed to Preview – nomad March 30, 2022 14:15 Inactive
@DerekStrickland DerekStrickland changed the title [WIP] reconciler: 2 phase reconnects and tests reconciler: 2 phase reconnects and tests Mar 30, 2022
@DerekStrickland DerekStrickland marked this pull request as ready for review March 30, 2022 14:27
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good @DerekStrickland. I've left some suggestions on the tests, as making those legible will be important for maintenance. And then you've got a couple of open questions that we should resolve before it's ready to merge.

client/allocwatcher/alloc_watcher.go Outdated Show resolved Hide resolved
client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
nomad/structs/structs_test.go Show resolved Hide resolved
nomad/structs/structs_test.go Show resolved Hide resolved
scheduler/reconcile_test.go Outdated Show resolved Hide resolved
scheduler/reconcile_test.go Outdated Show resolved Hide resolved
scheduler/reconcile_test.go Outdated Show resolved Hide resolved
scheduler/reconcile.go Show resolved Hide resolved
scheduler/reconcile.go Show resolved Hide resolved
scheduler/reconcile_util_test.go Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vercel vercel bot temporarily deployed to Preview – nomad March 31, 2022 15:29 Inactive
@DerekStrickland DerekStrickland merged commit a405bfe into f-disconnected-client-allocation-handling Mar 31, 2022
@DerekStrickland DerekStrickland deleted the f-2-phase-reconnect-disconnected-clients branch March 31, 2022 15:32
DerekStrickland added a commit that referenced this pull request Apr 5, 2022
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by.

* node_endpoint: Emit new eval for reconnecting unknown allocs.

* filterByTainted: handle 2 phase commit filtering rules.

* reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects.

* allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
DerekStrickland added a commit that referenced this pull request Apr 6, 2022
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by.

* node_endpoint: Emit new eval for reconnecting unknown allocs.

* filterByTainted: handle 2 phase commit filtering rules.

* reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects.

* allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants