Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad jobs should stay up when vault outage occurs #11209

Closed
mbrezovsky opened this issue Sep 20, 2021 · 18 comments · Fixed by #11606
Closed

Nomad jobs should stay up when vault outage occurs #11209

mbrezovsky opened this issue Sep 20, 2021 · 18 comments · Fixed by #11606

Comments

@mbrezovsky
Copy link

Proposal

This issue aims to integration with vault. Running nomad jobs should stay to live as long as possible after vault outage.

Use-cases

We have nomad cluster and vault in HA. Despite of clusters are scaled to cover minor outage of servers, sometimes it's unpredictable. After network issues on cloud provider side, our vault cluster was out of the service.
Firstly nomad receive 503 response from vault and some short period restart all jobs with template stanza. All jobs ended in pending state until vault has been online. It not depends on change_mode in template stanza.

Attempted Solutions

Nomad could to switch to emergency mode when vault outage is detected. In this case common processes (ttl...) could be disabled and enabled again when vault will be available.

@DerekStrickland DerekStrickland self-assigned this Sep 21, 2021
@DerekStrickland DerekStrickland added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 21, 2021
@DerekStrickland DerekStrickland moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Sep 21, 2021
@DerekStrickland
Copy link
Contributor

Hi @mbrezovsky!

Thanks for using Nomad, and thank you for filing this issue.

I'll do some research and see if there is an existing solution for this use case. In the meantime, do you think you could provide both your nomad and vault configuration files for both servers and agents after removing secrets? It would be really helpful in reproducing your exact use case, and then troubleshooting from there.

@mbrezovsky
Copy link
Author

Hi @DerekStrickland

Of course, here are the configuration files.
Nomad server

bind_addr = "{{ GetPrivateIP }}"

addresses = {
  http = "0.0.0.0"
}

datacenter = "dc"

data_dir = "/mnt/nomad"

log_level     = "INFO"
enable_syslog = false
enable_debug  = false

leave_on_terminate = true

server {
  enabled = true

  encrypt          = "my-key"
  bootstrap_expect = 3

  rejoin_after_leave = true
}

consul {
  address = "127.0.0.1:8500"

  auto_advertise       = true
  server_service_name  = "nomad"
  server_auto_join     = true
  checks_use_advertise = true
}

vault {
  enabled     = true
  address     = "http://10.0.0.62:8200"
  task_token_ttl = "1h"
  create_from_role = "nomad-cluster"
  token       = "my-token"
}

disable_anonymous_signature = true
disable_update_check        = true

Nomad agent

enable_debug  = false

leave_on_terminate = true

client {
  enabled = true

  servers = [
    "10.0.0.22","10.0.0.23","10.0.0.24",
  ]

  meta {
  }

  reserved {
    reserved_ports = "22,25,80,443,8080,8500-8600"
  }
}

consul {
  address = "127.0.0.1:8500"

  auto_advertise       = true
  checks_use_advertise = true
  client_service_name  = "nomad-client"
  client_auto_join     = true
}

disable_anonymous_signature = true
disable_update_check        = true

vault {
  enabled     = true
  address     = "http://10.0.0.62:8200"
}

Vault

cluster_name = "my-cluster"

disable_mlock = false

default_lease_ttl = "24h"
max_lease_ttl     = "720h"

storage "consul" {
  address = "127.0.0.1:8500"
  scheme  = "http"
  path    = "vault/"
  token   = "my-token"
}

listener "tcp" {
  address     = "10.0.0.52:8200"
  cluster_address = "10.0.0.52:8201"
  tls_disable = true
}

api_addr     = "http://10.0.0.52:8200"

ui = true

I use separated consul cluster for nomad and vault - as storage. I will really appreciate if you find some solution for this issue. If these configuration files aren't sufficient, I can provide detailed setup for better reproducing. It is based on terraform/ansible setup tied to hetzner cloud provider.

@DerekStrickland DerekStrickland moved this from Triaging to Needs Triage in Nomad - Community Issues Triage Sep 24, 2021
@DerekStrickland DerekStrickland removed their assignment Sep 24, 2021
@RickyGrassmuck
Copy link
Contributor

I too have been running into this problem recently.

We are currently in the process of migrating Nomad, Consul and Vault over to a new platform which has led to a couple of short Vault service outages. Each time that Vault has become unavailable to Nomad, all of the jobs that were configured to pull from Vault would wind up being rescheduled and unable to be allocated until Vault access was restored.

I do see value with this behavior (and personally think it should remain as the default) but there are scenarios in which leaving the job running with the last obtained secrets remaining in use.

My personal preference for addressing this would be to create an additional option in the vault stanza for a job file that would enable persisting the last value obtained for the secrets. I could also see value in allowing the job to specify a max_lifetime value as well that would allow for setting a time limit for how long the secrets are allowed to be persisted without being updated before the job triggers the selected change operations.

Example of hypothetical persistence key in a vault job spec definition.

vault {
  policies      = ["nomad_jobs"]
  change_mode   = "restart"
  change_signal = "SIGUSR1"
  persistence {
    enabled = true
    max_lifetime = 8h
  }
}

@eightseventhreethree
Copy link

Unless I'm not following something it looks like if the renew fails it still marks the token as a renew:

According to the comment on this bool it should only be set to true if the token is updated:

// updatedToken lets us store state between loops. If true, a new token

@RickyGrassmuck
Copy link
Contributor

Unless I'm not following something it looks like if the renew fails it still marks the token as a renew:

That looks to only happens when using the noop change mode. If you use restart or signal modes, it marks the token as updated and triggers the restart.

It seems like it would make sense to separate actual change events from vault communication errors. As I proposed in my previous comment, this would allow a job to continue running through a communications issue while still performing the necessary action when an actual change to the secrets are detected.

@eightseventhreethree
Copy link

if h.vaultStanza.ChangeMode != structs.VaultChangeModeNoop {

If it doesn't equal noop

@RickyGrassmuck
Copy link
Contributor

if h.vaultStanza.ChangeMode != structs.VaultChangeModeNoop {

If it doesn't equal noop

Right, meaning it sets updatedToken when you are using structs.VaultChangeModeRestart or structs.VaultChangeModeSignal.

I'm on mobile right now so I may just not be following the code paths or missing a key part of it, but it looks like once updatedToken is set to true, the next iteration in the loop will result in specified change mode being triggered.

@doubleshot
Copy link

doubleshot commented Sep 28, 2021

@rigrassm The issue that @eightseventhreethree is raising is that the token renewal failed, so it shouldn't be triggering the vaultChangeModeRestart or vaultChangeModeSignal since that pretty much guarantees the nomad task will be restarted/signal'd into a bad state since vault isn't able to renew due to some reason(network/connectivity to vault or other issue).

This section:

select {
case err := <-renewCh:
// Clear the token
token = ""
h.logger.Error("failed to renew Vault token", "error", err)
stopRenewal()
// Check if we have to do anything
if h.vaultStanza.ChangeMode != structs.VaultChangeModeNoop {
updatedToken = true
}

contradicts the comment here:

// updatedToken lets us store state between loops. If true, a new token
// has been retrieved and we need to apply the Vault change mode

It also states in the documentation that change_mode is only to be triggered when vault token renewal was successful.
If Nomad is unable to renew the Vault token (perhaps due to a Vault outage or network error), the client will attempt to retrieve a new Vault token. If successful, the contents of the secrets file are updated on disk, and action will be taken according to the value set in the change_mode parameter.
https://www.nomadproject.io/docs/job-specification/vault

@RickyGrassmuck
Copy link
Contributor

Ahhh, ok, I see what I was missing yesterday. It's clearing the token which should put it in an indefinite loop until the token is renewed and then triggering the change mode.

That is peculiar though as I have observed this behavior multiple times(even recently). If I get some time this week I may try and setup a test environment to reproduce this and also test if the behavior occurs when using noop to see if this is actually what's leading to the job restart.

Beginning to think that it's possible something outside of this code path is leading to the job being restarted and this behavior is possibly just a side effect.

@mbrezovsky
Copy link
Author

I've mentioned that change_mode have no impact on this issue. I've already tested it on jobs with mixed change_mode values (noop, restart). All jobs was restarted when vault outage occured. That's the reason, why I created this issue. I don't know to solve it with available configuration options.

@RickyGrassmuck
Copy link
Contributor

@mbrezovsky Quick question, do any of your jobs templates use Consul lookups like {{ range service "postgres@dc1" }}{{.Address}}{{end}}?

I just realized while working on one of my job specs that I am using a Consul lookup in one of the jobs templates and a change to the service catalog entry Its looking up resulted in the job being restarted (default behavior of the template stanza).

I'm thinking it's possible these restarts could have been caused by a Consul catalog change which, if this occurred when vault access was down, would result in that job not getting placed. The message shown in the alloc activity could be misconstrued as there being a change in the Vault template when in reality, it was a Consul template that triggered the restart.

If this is in fact the reason for the job being restarted, it may be worth exploring options to avoid this when using both Vault and Consul look-ups in templates.

A couple possible solutions that could be implemented into Nomad off the top of my head for this:

  1. Have template stanzas inherit the Vault change_mode when the template does not explicitly define a change_mode of its own.
  2. Create a Job > Group and Job > Group > Task configuration option that could control the behaviors for both Consul and Vault template lookup behaviors. This could be overridden by the template stanza.

@mbrezovsky
Copy link
Author

No, actually I have no configured job with consul lookups. Just templates with vault secrets.

@baabgai
Copy link

baabgai commented Oct 5, 2021

I ran into similar issues rendering vault and consul kv templates in nomad jobs and I would also desperately need a good fix for that problem. From my research so far it seems that nomad or the integrated consul-template implementation is polling the templates' data from consul and vault on a regular basis. If for some reasons consul kv store or vault are not available nomad will start a number of retries with time backoffs. The default number of retries is set to 12 (consul, vault) after the max number of retries exceeds the job often stops running, which makes my setup much less resilient than I would have expected. This appears to me to be exactly the same scenario of the pending restart @mbrezovsky mentioned.
If consul-templates are used as a standalone service the number of retries can be configured and it seems to be possible to set it to 0 which corresponds to Infinity. Unfortunately this option is not available in nomad's template stanza. So I could imagine to have something like

template {
    data = "..."
    destination = "..."
    retry {
      enabled = true
      attempts = 0
      backoff = "250ms"
      max_backoff = "1m"
    }

Or maybe an even more general option could be to expose the full consul-template config and let the user add that custom override file into the nomad agent configuration directory.
For the vault use case I'm actually trying something out at the moment, but not sure if it will work. I thought I could install the vault-agent on the client nodes and instead of connecting the nomad clients to the vault servers I would, similar to consul agents, just connect them to the local vault agents to fetch secrets. Within the vault-agent's configuration there exists a retry section where I hope to be able to set a custom number of attempts. Of course this will only work as long as the vault-agent would return cached values as long as the retry period is going on. Maybe it is possible to circumvent the hard coded retry limit used inside nomad template stanza.
A related issue seems to be 2623

@RickyGrassmuck
Copy link
Contributor

Looking through #2623 I came across this issue cross referenced in it that describes this issue perfectly.

#3866

@jcdyer
Copy link

jcdyer commented Dec 18, 2021

We have managed to track down at least one (and hopefully the only) cause of this behavior: In the allocrunner/taskrunner code for template tasks, there is a rerenderTemplate function that loops forever selecting on channels. One of the channels reports errors from the task runner. If an error occurs, nomad marks all tasks as failed. The impact of this is that at a time when nomad is completely unable to successfully start allocs (because the template renderer cannot contact vault), it brings them all down, essentially (in our case) converting a template render error into a complete site outage.

The reason this doesn't occur when templates are configured with change_mode=noop is that there is a previous code segment that checks if all templates are noops, and if so, returns before starting the loop.

When I am back at my work computer, I will link the relevant lines of code.

In my opinion, the proper behavior when unable to render a template for any reason is to log the error, and stop the current attempt to update the alloc in question. I cannot think of a situation in which failing healthy running services when they cannot be restarted is a reasonable behavior, and if there are such situations, one should not be subjected to site outages as the default behavior.

@jcdyer
Copy link

jcdyer commented Dec 18, 2021

The code block which marks the services as failed is at L378 here:

func (tm *TaskTemplateManager) handleTemplateRerenders(allRenderedTime time.Time) {
// A lookup for the last time the template was handled
handledRenders := make(map[string]time.Time, len(tm.config.Templates))
for {
select {
case <-tm.shutdownCh:
return
case err, ok := <-tm.runner.ErrCh:
if !ok {
continue
}
tm.config.Lifecycle.Kill(context.Background(),
structs.NewTaskEvent(structs.TaskKilling).
SetFailsTask().
SetDisplayMessage(fmt.Sprintf("Template failed: %v", err)))
case <-tm.runner.TemplateRenderedCh():
tm.onTemplateRendered(handledRenders, allRenderedTime)
}
}
}

The reason this doesn't happen when (all) templates have a change_mode of noop is this code:

if tm.allTemplatesNoop() {
return
}
// handle all subsequent render events.
tm.handleTemplateRerenders(time.Now())

As a hacky workaround, we had hoped to be able to extend the connection retries on the consul template runner to multiple days to give us time to get vault healthy, but that parameter is not currently exposed in nomad. (It would be the github.com/hashicorp/consul-template/config.Config.Vault.Retry value on the conf value instantiated at

conf := ctconf.DefaultConfig()

But those values will not be configurable until #11606 is deployed.

@DerekStrickland
Copy link
Contributor

As @jcdyer mentioned, this PR should expose the Vault retry configuration parameters and allow for a more fault-tolerant approach to consul-template render failures for both Vault and Consul integrated templates. I've tried to read through this thread carefully looking for other failure scenarios and I will do my best to incorporate the feedback from this thread into the test plan. If anyone affected by this issue has the time to test out that branch, it would be really helpful to hear how it went for you.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

8 participants