Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Consul: add preflight checks for Envoy bootstrap into release/1.8.x #23449

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #23381 to be assessed for backporting due to the inclusion of the label backport/1.8.x.

The below text is copied from the body of the original PR.


Nomad creates Consul ACL tokens and service registrations to support Consul service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a tokens or services that have not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on consul-k8s where after creating the objects in Consul we try to read them from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust.

This changset adds preflight checks for the objects we create in Consul:

  • We add a preflight check for ACL tokens after we login via via Workload Identity and in the function we use to derive tokens in the legacy workflow. We do this check early because we also want to use this token for registering group services in the allocrunner hooks.
  • We add a preflight check for services right before we bootstrap Envoy in the taskrunner hook, so that we have time for our service client to batch updates to the local Consul agent in addition to the local agent sync.

We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing.

Fixes: #9307
Fixes: #10451
Fixes: #20516

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138


Notes for reviewers:

  • This is an unfortunately large PR as I had to touch a lot of test code and there are separate code paths for supporting the legacy workflow. To help make this a little more reviewable, I've split this into 2 commits with the code changes and their specific tests, a third commit with a test helper fix, and a final commit with docs and changelog.
  • I've got a follow-up described in NET-10138 to reduce the risk of this kind of bug in the future by expanding our Consul cluster in E2E to be a 3-node cluster.
  • The backport to 1.6.x+ent is not going to be clean, as all this code was heavily reworked for WI and multi-cluster support in 1.7.0. So I'm going to manually backport the relevant sections and open a new PR for that in the ENT repo.

Overview of commits

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-consul-token-read-self-on-login/adversely-awaited-cougar branch from f6a6027 to c6b7c60 Compare June 27, 2024 14:16
@tgross tgross merged commit 09f6a09 into release/1.8.x Jun 27, 2024
21 checks passed
@tgross tgross deleted the backport/b-consul-token-read-self-on-login/adversely-awaited-cougar branch June 27, 2024 14:45
Copy link

github-actions bot commented Jan 2, 2025

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants