Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config option to avoid task failure on template failure #13019

Closed
mikenomitch opened this issue May 13, 2022 · 2 comments · Fixed by #13907
Closed

Add config option to avoid task failure on template failure #13019

mikenomitch opened this issue May 13, 2022 · 2 comments · Fixed by #13907

Comments

@mikenomitch
Copy link
Contributor

mikenomitch commented May 13, 2022

Proposal

Currently, when a Nomad task is using a template stanza and that template fails to render because it cannot communicate with an external service, the Nomad agent will retry several times but eventually fail the underlying task. This makes Nomad tasks brittle and can couple Nomad task health to the health of external services like Consul, Vault, or the Nomad Server cluster.

For instance, if Vault goes down for 10 minutes, if Nomad jobs are using secrets in the template stanza, and they retry for 5 minutes, then all of the Nomad tasks for this job will fail. This is particularly bad as the Vault servers might be subject to a thundering herd of Nomad-based requests once they come back up.

Another example is if there are network issues between Nomad clients and servers while Nomad is using native service discovery. If two "edge" clients temporarily lose connection to the Nomad servers, they won't be able to refresh their nomadService information in templates. This should be okay and the clients should continue to run with the data they had on hand. Instead, these edge workloads will be killed. This makes native service discovery very hard to use for edge workloads.

Instead of going down, the Nomad tasks should log warnings, but use stale data. This new configuration value should be added to the template stanza for each task, and configured globally here as well.

In the 1.3.X cycle, we should opt in to the new behavior, but we should strongly suggest in the docs & changelog to do so. In 1.4.0, we should switch the default to opt out.

Use-cases

  • Nomad cluster resiliency with Vault
  • Nomad cluster resiliency with Consul
  • Nomad native service discovery use on flakey/edge connections
@tgross
Copy link
Member

tgross commented May 16, 2022

The existing template runner separates the "first render" behavior from the "steady state" behavior. It might be worth leaving the first render behavior alone (except for #13020's changes) because there's not yet any stale data to use. Otherwise the newly placed task will just hang until the whole deployment gets marked failed. (The allocrunner doesn't currently have a way of handling that behavior gracefully.)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants