Add config option to avoid task failure on template failure #13019

mikenomitch · 2022-05-13T22:16:53Z

Proposal

Currently, when a Nomad task is using a template stanza and that template fails to render because it cannot communicate with an external service, the Nomad agent will retry several times but eventually fail the underlying task. This makes Nomad tasks brittle and can couple Nomad task health to the health of external services like Consul, Vault, or the Nomad Server cluster.

For instance, if Vault goes down for 10 minutes, if Nomad jobs are using secrets in the template stanza, and they retry for 5 minutes, then all of the Nomad tasks for this job will fail. This is particularly bad as the Vault servers might be subject to a thundering herd of Nomad-based requests once they come back up.

Another example is if there are network issues between Nomad clients and servers while Nomad is using native service discovery. If two "edge" clients temporarily lose connection to the Nomad servers, they won't be able to refresh their nomadService information in templates. This should be okay and the clients should continue to run with the data they had on hand. Instead, these edge workloads will be killed. This makes native service discovery very hard to use for edge workloads.

Instead of going down, the Nomad tasks should log warnings, but use stale data. This new configuration value should be added to the template stanza for each task, and configured globally here as well.

In the 1.3.X cycle, we should opt in to the new behavior, but we should strongly suggest in the docs & changelog to do so. In 1.4.0, we should switch the default to opt out.

Use-cases

Nomad cluster resiliency with Vault
Nomad cluster resiliency with Consul
Nomad native service discovery use on flakey/edge connections

tgross · 2022-05-16T13:54:54Z

The existing template runner separates the "first render" behavior from the "steady state" behavior. It might be worth leaving the first render behavior alone (except for #13020's changes) because there's not yet any stale data to use. Otherwise the newly placed task will just hang until the whole deployment gets marked failed. (The allocrunner doesn't currently have a way of handling that behavior gracefully.)

github-actions · 2022-12-22T02:14:12Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mikenomitch added type/enhancement theme/template theme/consul-template labels May 13, 2022

mikenomitch mentioned this issue May 13, 2022

Set longer default retry, backoff and attempt count for Consul Template #13020

Closed

mikenomitch changed the title ~~Opt in to task failure on template failure~~ Add config option to avoid task failure on template failure May 16, 2022

mikenomitch added this to the 1.3.x milestone May 17, 2022

mikenomitch assigned DerekStrickland May 17, 2022

DerekStrickland mentioned this issue May 31, 2022

disconnected clients: Running tasks fail if template runner cannot contact servers #12798

Closed

DerekStrickland mentioned this issue Jul 23, 2022

Add Nomad RetryConfig to agent template config #13907

Merged

DerekStrickland closed this as completed in #13907 Aug 3, 2022

hc-github-team-nomad-core mentioned this issue Aug 3, 2022

Backport of Add Nomad RetryConfig to agent template config into release/1.3.x #13997

Merged

lgfa29 modified the milestones: 1.3.x, 1.3.3 Aug 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config option to avoid task failure on template failure #13019

Add config option to avoid task failure on template failure #13019

mikenomitch commented May 13, 2022 •

edited

Loading

tgross commented May 16, 2022 •

edited

Loading

github-actions bot commented Dec 22, 2022

Add config option to avoid task failure on template failure #13019

Add config option to avoid task failure on template failure #13019

Comments

mikenomitch commented May 13, 2022 • edited Loading

Proposal

Use-cases

tgross commented May 16, 2022 • edited Loading

github-actions bot commented Dec 22, 2022

mikenomitch commented May 13, 2022 •

edited

Loading

tgross commented May 16, 2022 •

edited

Loading