Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Expose retry behavior for template stanza #3866

Closed
SoMuchToGrok opened this issue Feb 13, 2018 · 4 comments · Fixed by #11606
Closed

[Feature] Expose retry behavior for template stanza #3866

SoMuchToGrok opened this issue Feb 13, 2018 · 4 comments · Fixed by #11606

Comments

@SoMuchToGrok
Copy link

SoMuchToGrok commented Feb 13, 2018

Nomad version

Nomad v0.7.1

Operating system and Environment details

Ubuntu 16.04.03 LTS

Issue

When a running job encounters a template rendering failure due to Vault being inaccessible, the entire job is marked as "dead" and no rescheduling attempts are ever made. When this issue presents itself during a "net new" deploy, it's not a major deal as someone will be investigating the issue nearly immediately. However, when you have successfully deployed long-lived jobs that renew their secrets every N hours/days, this behavior becomes especially important.

The desired behavior already exists in consul-template itself (for both consul communication and vault communication). See relevant config:

retry {
    # This enabled retries. Retries are enabled by default, so this is
    # redundant.
    enabled = true

    # This specifies the number of attempts to make before giving up. Each
    # attempt adds the exponential backoff sleep time. Setting this to
    # zero will implement an unlimited number of retries.
    attempts = 12

    # This is the base amount of time to sleep between retry attempts. Each
    # retry sleeps for an exponent of 2 longer than this base. For 5 retries,
    # the sleep times would be: 250ms, 500ms, 1s, 2s, then 4s.
    backoff = "250ms"

    # This is the maximum amount of time to sleep between retry attempts.
    # When max_backoff is set to zero, there is no upper limit to the
    # exponential sleep between retry attempts.
    # If max_backoff is set to 10s and backoff is set to 1s, sleep times
    # would be: 1s, 2s, 4s, 8s, 10s, 10s, ...
    max_backoff = "1m"
  }

Ideally, job operators should be able to configure these within the nomad template. I realize not everyone will want their jobs retrying indefinitely, but some jobs are mission-critical and requiring additional manual intervention after a Vault outage can be extremely costly.

Reproduction steps

  1. Successfully deploy a job to Nomad that retrieves a short-lived secret from Vault via the "embedded consul-template" (example template below).
  2. Make Vault inaccessible
  3. Wait for attempt to renew secrets, resulting in a failure. Allocation transitions to "Failed" state.
  4. Make Vault accessible again
  5. Job remains dead and will require manual intervention.

Example template

template {
        destination = "$${NOMAD_SECRETS_DIR}/server.bundle.pem"
        data = <<EOH
{{ $private_ip := env "NOMAD_IP_https" }}
{{ $ip_sans := printf "ip_sans=%s" $private_ip }}
{{ with secret "pki/us-west/issue/app" "common_name=app.service.consul" "alt_names=go-app.service.dc.consul" $ip_sans "format=pem" }}
{{ .Data.certificate }}
{{ .Data.issuing_ca }}
{{ .Data.private_key }}{{ end }}
EOH
      }
@SoMuchToGrok SoMuchToGrok changed the title [Feature] Expose retry behavior in template stanza when retrieving secrets from Vault [Feature] Expose retry behavior for template stanza Feb 13, 2018
@preetapan
Copy link
Contributor

@SoMuchToGrok thanks for reporting this. It sounds like the default retry behavior in consul template (which gives you 12 attempts and about 6minutes before the job transitions to failed) was not sufficient for your use case.

We could bring in just these set of options from Consul template into the nomad template stanza, but there's always going to be more knobs that you can change in consul template than what nomad supports. We are moving towards a plugin architecture that lets us expand to fully supporting these external tools, without having to keep updating our config options for each run time tool and driver. We will be able to address this then.

@dadgar
Copy link
Contributor

dadgar commented Feb 13, 2018

Linking these as they are the same issue but one for Vault, one for Consul: #2623

@eigengrau
Copy link

Is this issue just about retrying an allocation after killing it due to an unreachable Vault? We would very much like it if we could configure Nomad to not kill an allocation when re-rendering the template fails (e.g. due to unreachable Vault servers). Is this behavior in scope for this issue? With dynamic secrets, even when the secret expires, the service might degrade gracefully (e.g. continue running without a database connection). Static secrets read from Vault would continue to be valid in most cases until the connection to Vault is re-established.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants