Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of services: fix data integrity errors for Nomad native services into release/1.6.x #20591

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #20590 to be assessed for backporting due to the inclusion of the label backport/1.6.x.

The below text is copied from the body of the original PR.


This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations.

  • When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services.

    To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff.

  • When an allocation is marked client-terminal from the client in UpdateAllocsFromClient, we have an opportunity to ensure data integrity by deregistering services for that allocation.

  • When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal.

This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR.

Ref: #16616


Overview of commits

@vercel vercel bot temporarily deployed to Preview – nomad May 15, 2024 15:57 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui May 15, 2024 16:06 Inactive
@tgross tgross merged commit ad7446e into release/1.6.x May 15, 2024
24 of 26 checks passed
@tgross tgross deleted the backport/ensure-nomad-services-deregistered/slightly-content-magpie branch May 15, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants