Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Template + Vault during Nomad Client restarts #13313

Open
chuckyz opened this issue Jun 9, 2022 · 4 comments
Open

Improvements to Template + Vault during Nomad Client restarts #13313

chuckyz opened this issue Jun 9, 2022 · 4 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/template type/enhancement

Comments

@chuckyz
Copy link
Contributor

chuckyz commented Jun 9, 2022

First off, thank you so much for the template improvements in 1.2.4!!

We’ve implemented these in our testing environment and I’d like to make a further improvement proposal. Today, when our config management runs (Chef) we just hard restart Nomad after each run. This has served us pretty well to this point but unfortunately it’s pointed out a flaw within Nomad’s template system; especially when combined with these improvements.

I recently simulated a Vault failure (overrode the DNS for /etc/resolv.conf to 127.0.0.1), and everything behaved exactly as expected — until the client daemon restarted.

Upon restarting things the following messages started:

    Jun 02 11:59:59 foo-host nomad[3704847]:     2022-06-02T11:59:59.070-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:14.070605338 -0500 CDT m=+142.893296741"
    Jun 02 11:59:59 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:14.070605338 -0500 CDT m=+142.893296741"
    Jun 02 12:00:02 foo-host nomad[3704847]:     2022-06-02T12:00:02.482-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:17.482024359 -0500 CDT m=+146.304715757"
    Jun 02 12:00:02 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:17.482024359 -0500 CDT m=+146.304715757"
    Jun 02 12:00:17 foo-host nomad[3704847]:     2022-06-02T12:00:17.529-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:32.529657848 -0500 CDT m=+161.352349245"
    Jun 02 12:00:17 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:32.529657848 -0500 CDT m=+161.352349245"
    Jun 02 12:00:20 foo-host nomad[3704847]:     2022-06-02T12:00:20.937-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:35.937767731 -0500 CDT m=+164.760459129"
    Jun 02 12:00:20 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:35.937767731 -0500 CDT m=+164.760459129"

You can see from the times here that vault_retry seems to be ignored. I believe this is acceptable/desirable as this is a renewal of a lease outside of the template section and purely within the Vault information.

One thing this did was not cause the allocation to fail, but rather put it in a state I can’t really explain. It was running, the container was there happily working and serving traffic. However, from the control-plane it was completely broken. The CPU stats were unreported and it was as if the allocation existed but was ‘detached’ for lack of a better term.

When restarting Nomad for a 2nd time with the allocation in this state, Nomad marked it as failed and removed it from the node. I think this is not wrong behavior but it is undesirable for our use-cases.

Proposal

This is leading to the following asks:

  • Can we track Vault token state across daemon restarts?
  • Can we track Template state across daemon restarts, including current retry times?

Note: One extremely explicit call out here is I do not expect things to live through host restarts or things like docker restarts. If a host restarts/all containers stop, then all bets are off.

Use-cases

The purpose of those asks is to allow allocations to ‘survive’ through upstream problems and Nomad daemon restarts.

Attempted Solutions

Modifying all the *_retry settings.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 10, 2022
@tgross
Copy link
Member

tgross commented Jun 10, 2022

Hi @chuckyz!

One thing this did was not cause the allocation to fail, but rather put it in a state I can’t really explain. It was running, the container was there happily working and serving traffic. However, from the control-plane it was completely broken. The CPU stats were unreported and it was as if the allocation existed but was ‘detached’ for lack of a better term.

I suspect that what we're seeing here is the task failing to restore: the client's task runner hasn't successfully reattached to the task. Did this state continue after Vault connectivity was restored?

As for the template runner persisting state, this is all great and aligned with some ideas we've been discussing.

The currently tricky thing with templating is that the template runner runs in-process with the Nomad client but we're currently using consul-template as though it were a library. This was expedient to implement because then we get all the CT features "for free" but architecturally challenging to avoid security issues (ex #9129 #9129) and problems around restarting clients like you've described here (ex #9636). So we're planning on moving the template rendering (and artifact fetching) out into its own containerized process. See #12301. This would let us entirely avoid worrying about templates when the client restarts.

@tgross tgross added hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jun 10, 2022
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jun 10, 2022
@chuckyz
Copy link
Contributor Author

chuckyz commented Jun 14, 2022

Did this state continue after Vault connectivity was restored?

Let me re-test. I think this might be the core of that particular angle.

Looking at #12301 this would run consul-template and go-getter in the same way as Envoy where it's ran inside of an allocation in a bridge-style mode, yes?

Thinking about that, I think we'd still have the issue of a valid Vault token but the Vault fingerprint failing, and thus I'd really like some kind of a knob exposed that says 'I don't care that Vault is down and the fingerprint is failing, just keep retrying forever but leave the alloc in the state it's in now.' Ideally if vault_retry has attempts=0 this can be short circuited to that.

@tgross
Copy link
Member

tgross commented Jun 15, 2022

Looking at #12301 this would run consul-template and go-getter in the same way as Envoy where it's ran inside of an allocation in a bridge-style mode, yes?

Yes, although probably not in the same network namespace as the rest of the allocation. The nitty-gritty details still need to be worked out.

Thinking about that, I think we'd still have the issue of a valid Vault token but the Vault fingerprint failing, and thus I'd really like some kind of a knob exposed that says 'I don't care that Vault is down and the fingerprint is failing, just keep retrying forever but leave the alloc in the state it's in now.' Ideally if vault_retry has attempts=0 this can be short circuited to that.

vault_retry=0 already tries an unlimited number of times. Where the containerization would help is that in order to run in their own process, the template/artifact container would need to have its own Vault/Consul API client. That API client can continue to run, retrying unlimited times, and be unaffected when the client agent restarts.

@chuckyz
Copy link
Contributor Author

chuckyz commented Jun 22, 2022

That API client can continue to run, retrying unlimited times, and be unaffected when the client agent restarts.

perfect!

@DerekStrickland DerekStrickland self-assigned this Jun 30, 2022
@DerekStrickland DerekStrickland moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Aug 3, 2022
@DerekStrickland DerekStrickland moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Aug 3, 2022
@DerekStrickland DerekStrickland removed their assignment Aug 3, 2022
@mikenomitch mikenomitch removed the hcc/cst Admin - internal label Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/template type/enhancement
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants