Nomad jobs should stay up when vault outage occurs #11209

mbrezovsky · 2021-09-20T22:23:25Z

Proposal

This issue aims to integration with vault. Running nomad jobs should stay to live as long as possible after vault outage.

Use-cases

We have nomad cluster and vault in HA. Despite of clusters are scaled to cover minor outage of servers, sometimes it's unpredictable. After network issues on cloud provider side, our vault cluster was out of the service.
Firstly nomad receive 503 response from vault and some short period restart all jobs with template stanza. All jobs ended in pending state until vault has been online. It not depends on change_mode in template stanza.

Attempted Solutions

Nomad could to switch to emergency mode when vault outage is detected. In this case common processes (ttl...) could be disabled and enabled again when vault will be available.

DerekStrickland · 2021-09-23T12:03:16Z

Hi @mbrezovsky!

Thanks for using Nomad, and thank you for filing this issue.

I'll do some research and see if there is an existing solution for this use case. In the meantime, do you think you could provide both your nomad and vault configuration files for both servers and agents after removing secrets? It would be really helpful in reproducing your exact use case, and then troubleshooting from there.

mbrezovsky · 2021-09-23T12:42:20Z

Hi @DerekStrickland

Of course, here are the configuration files.
Nomad server

bind_addr = "{{ GetPrivateIP }}"

addresses = {
  http = "0.0.0.0"
}

datacenter = "dc"

data_dir = "/mnt/nomad"

log_level     = "INFO"
enable_syslog = false
enable_debug  = false

leave_on_terminate = true

server {
  enabled = true

  encrypt          = "my-key"
  bootstrap_expect = 3

  rejoin_after_leave = true
}

consul {
  address = "127.0.0.1:8500"

  auto_advertise       = true
  server_service_name  = "nomad"
  server_auto_join     = true
  checks_use_advertise = true
}

vault {
  enabled     = true
  address     = "http://10.0.0.62:8200"
  task_token_ttl = "1h"
  create_from_role = "nomad-cluster"
  token       = "my-token"
}

disable_anonymous_signature = true
disable_update_check        = true

Nomad agent

enable_debug  = false

leave_on_terminate = true

client {
  enabled = true

  servers = [
    "10.0.0.22","10.0.0.23","10.0.0.24",
  ]

  meta {
  }

  reserved {
    reserved_ports = "22,25,80,443,8080,8500-8600"
  }
}

consul {
  address = "127.0.0.1:8500"

  auto_advertise       = true
  checks_use_advertise = true
  client_service_name  = "nomad-client"
  client_auto_join     = true
}

disable_anonymous_signature = true
disable_update_check        = true

vault {
  enabled     = true
  address     = "http://10.0.0.62:8200"
}

Vault

cluster_name = "my-cluster"

disable_mlock = false

default_lease_ttl = "24h"
max_lease_ttl     = "720h"

storage "consul" {
  address = "127.0.0.1:8500"
  scheme  = "http"
  path    = "vault/"
  token   = "my-token"
}

listener "tcp" {
  address     = "10.0.0.52:8200"
  cluster_address = "10.0.0.52:8201"
  tls_disable = true
}

api_addr     = "http://10.0.0.52:8200"

ui = true

I use separated consul cluster for nomad and vault - as storage. I will really appreciate if you find some solution for this issue. If these configuration files aren't sufficient, I can provide detailed setup for better reproducing. It is based on terraform/ansible setup tied to hetzner cloud provider.

RickyGrassmuck · 2021-09-27T15:26:53Z

I too have been running into this problem recently.

We are currently in the process of migrating Nomad, Consul and Vault over to a new platform which has led to a couple of short Vault service outages. Each time that Vault has become unavailable to Nomad, all of the jobs that were configured to pull from Vault would wind up being rescheduled and unable to be allocated until Vault access was restored.

I do see value with this behavior (and personally think it should remain as the default) but there are scenarios in which leaving the job running with the last obtained secrets remaining in use.

My personal preference for addressing this would be to create an additional option in the vault stanza for a job file that would enable persisting the last value obtained for the secrets. I could also see value in allowing the job to specify a max_lifetime value as well that would allow for setting a time limit for how long the secrets are allowed to be persisted without being updated before the job triggers the selected change operations.

Example of hypothetical persistence key in a vault job spec definition.

vault {
  policies      = ["nomad_jobs"]
  change_mode   = "restart"
  change_signal = "SIGUSR1"
  persistence {
    enabled = true
    max_lifetime = 8h
  }
}

eightseventhreethree · 2021-09-27T18:52:03Z

Unless I'm not following something it looks like if the renew fails it still marks the token as a renew:

nomad/client/allocrunner/taskrunner/vault_hook.go

Line 273 in 0af4762

select {

According to the comment on this bool it should only be set to true if the token is updated:

nomad/client/allocrunner/taskrunner/vault_hook.go

Line 181 in 0af4762

// updatedToken lets us store state between loops. If true, a new token

RickyGrassmuck · 2021-09-27T21:38:53Z

Unless I'm not following something it looks like if the renew fails it still marks the token as a renew:

nomad/client/allocrunner/taskrunner/vault_hook.go

Line 273 in 0af4762

select {

That looks to only happens when using the noop change mode. If you use restart or signal modes, it marks the token as updated and triggers the restart.

It seems like it would make sense to separate actual change events from vault communication errors. As I proposed in my previous comment, this would allow a job to continue running through a communications issue while still performing the necessary action when an actual change to the secrets are detected.

eightseventhreethree · 2021-09-27T22:50:55Z

if h.vaultStanza.ChangeMode != structs.VaultChangeModeNoop {

If it doesn't equal noop

RickyGrassmuck · 2021-09-27T23:12:36Z

if h.vaultStanza.ChangeMode != structs.VaultChangeModeNoop {

If it doesn't equal noop

Right, meaning it sets updatedToken when you are using structs.VaultChangeModeRestart or structs.VaultChangeModeSignal.

I'm on mobile right now so I may just not be following the code paths or missing a key part of it, but it looks like once updatedToken is set to true, the next iteration in the loop will result in specified change mode being triggered.

doubleshot · 2021-09-28T14:34:23Z

@rigrassm The issue that @eightseventhreethree is raising is that the token renewal failed, so it shouldn't be triggering the vaultChangeModeRestart or vaultChangeModeSignal since that pretty much guarantees the nomad task will be restarted/signal'd into a bad state since vault isn't able to renew due to some reason(network/connectivity to vault or other issue).

This section:

nomad/client/allocrunner/taskrunner/vault_hook.go

Lines 273 to 283 in 0af4762

    
           select { 
        
           case err := <-renewCh: 
        
           	// Clear the token 
        
           	token = "" 
        
           	h.logger.Error("failed to renew Vault token", "error", err) 
        
           	stopRenewal() 
        
           	// Check if we have to do anything 
        
           	if h.vaultStanza.ChangeMode != structs.VaultChangeModeNoop { 
        
           		updatedToken = true 
        
           	}

contradicts the comment here:

nomad/client/allocrunner/taskrunner/vault_hook.go

Lines 181 to 182 in 0af4762

    
           // updatedToken lets us store state between loops. If true, a new token 
        
           // has been retrieved and we need to apply the Vault change mode

It also states in the documentation that change_mode is only to be triggered when vault token renewal was successful.
If Nomad is unable to renew the Vault token (perhaps due to a Vault outage or network error), the client will attempt to retrieve a new Vault token. If successful, the contents of the secrets file are updated on disk, and action will be taken according to the value set in the change_mode parameter.
https://www.nomadproject.io/docs/job-specification/vault

RickyGrassmuck · 2021-09-28T14:54:13Z

Ahhh, ok, I see what I was missing yesterday. It's clearing the token which should put it in an indefinite loop until the token is renewed and then triggering the change mode.

That is peculiar though as I have observed this behavior multiple times(even recently). If I get some time this week I may try and setup a test environment to reproduce this and also test if the behavior occurs when using noop to see if this is actually what's leading to the job restart.

Beginning to think that it's possible something outside of this code path is leading to the job being restarted and this behavior is possibly just a side effect.

mbrezovsky · 2021-09-28T15:45:34Z

I've mentioned that change_mode have no impact on this issue. I've already tested it on jobs with mixed change_mode values (noop, restart). All jobs was restarted when vault outage occured. That's the reason, why I created this issue. I don't know to solve it with available configuration options.

RickyGrassmuck · 2021-09-29T15:21:12Z

@mbrezovsky Quick question, do any of your jobs templates use Consul lookups like {{ range service "postgres@dc1" }}{{.Address}}{{end}}?

I just realized while working on one of my job specs that I am using a Consul lookup in one of the jobs templates and a change to the service catalog entry Its looking up resulted in the job being restarted (default behavior of the template stanza).

I'm thinking it's possible these restarts could have been caused by a Consul catalog change which, if this occurred when vault access was down, would result in that job not getting placed. The message shown in the alloc activity could be misconstrued as there being a change in the Vault template when in reality, it was a Consul template that triggered the restart.

If this is in fact the reason for the job being restarted, it may be worth exploring options to avoid this when using both Vault and Consul look-ups in templates.

A couple possible solutions that could be implemented into Nomad off the top of my head for this:

Have template stanzas inherit the Vault change_mode when the template does not explicitly define a change_mode of its own.
Create a Job > Group and Job > Group > Task configuration option that could control the behaviors for both Consul and Vault template lookup behaviors. This could be overridden by the template stanza.

mbrezovsky · 2021-09-29T20:54:01Z

No, actually I have no configured job with consul lookups. Just templates with vault secrets.

baabgai · 2021-10-05T06:39:45Z

I ran into similar issues rendering vault and consul kv templates in nomad jobs and I would also desperately need a good fix for that problem. From my research so far it seems that nomad or the integrated consul-template implementation is polling the templates' data from consul and vault on a regular basis. If for some reasons consul kv store or vault are not available nomad will start a number of retries with time backoffs. The default number of retries is set to 12 (consul, vault) after the max number of retries exceeds the job often stops running, which makes my setup much less resilient than I would have expected. This appears to me to be exactly the same scenario of the pending restart @mbrezovsky mentioned.
If consul-templates are used as a standalone service the number of retries can be configured and it seems to be possible to set it to 0 which corresponds to Infinity. Unfortunately this option is not available in nomad's template stanza. So I could imagine to have something like

template {
    data = "..."
    destination = "..."
    retry {
      enabled = true
      attempts = 0
      backoff = "250ms"
      max_backoff = "1m"
    }

Or maybe an even more general option could be to expose the full consul-template config and let the user add that custom override file into the nomad agent configuration directory.
For the vault use case I'm actually trying something out at the moment, but not sure if it will work. I thought I could install the vault-agent on the client nodes and instead of connecting the nomad clients to the vault servers I would, similar to consul agents, just connect them to the local vault agents to fetch secrets. Within the vault-agent's configuration there exists a retry section where I hope to be able to set a custom number of attempts. Of course this will only work as long as the vault-agent would return cached values as long as the retry period is going on. Maybe it is possible to circumvent the hard coded retry limit used inside nomad template stanza.
A related issue seems to be 2623

RickyGrassmuck · 2021-10-10T19:17:58Z

Looking through #2623 I came across this issue cross referenced in it that describes this issue perfectly.

#3866

jcdyer · 2021-12-18T15:42:13Z

We have managed to track down at least one (and hopefully the only) cause of this behavior: In the allocrunner/taskrunner code for template tasks, there is a rerenderTemplate function that loops forever selecting on channels. One of the channels reports errors from the task runner. If an error occurs, nomad marks all tasks as failed. The impact of this is that at a time when nomad is completely unable to successfully start allocs (because the template renderer cannot contact vault), it brings them all down, essentially (in our case) converting a template render error into a complete site outage.

The reason this doesn't occur when templates are configured with change_mode=noop is that there is a previous code segment that checks if all templates are noops, and if so, returns before starting the loop.

When I am back at my work computer, I will link the relevant lines of code.

In my opinion, the proper behavior when unable to render a template for any reason is to log the error, and stop the current attempt to update the alloc in question. I cannot think of a situation in which failing healthy running services when they cannot be restarted is a reasonable behavior, and if there are such situations, one should not be subjected to site outages as the default behavior.

jcdyer · 2021-12-18T16:17:43Z

The code block which marks the services as failed is at L378 here:

nomad/client/allocrunner/taskrunner/template/template.go

Lines 365 to 386 in 0af4762

    
           func (tm *TaskTemplateManager) handleTemplateRerenders(allRenderedTime time.Time) { 
        
           	// A lookup for the last time the template was handled 
        
           	handledRenders := make(map[string]time.Time, len(tm.config.Templates)) 
        
           	for { 
        
           		select { 
        
           		case <-tm.shutdownCh: 
        
           			return 
        
           		case err, ok := <-tm.runner.ErrCh: 
        
           			if !ok { 
        
           				continue 
        
           			} 
        
           			tm.config.Lifecycle.Kill(context.Background(), 
        
           				structs.NewTaskEvent(structs.TaskKilling). 
        
           					SetFailsTask(). 
        
           					SetDisplayMessage(fmt.Sprintf("Template failed: %v", err))) 
        
           		case <-tm.runner.TemplateRenderedCh(): 
        
           			tm.onTemplateRendered(handledRenders, allRenderedTime) 
        
           		} 
        
           	} 
        
           }

The reason this doesn't happen when (all) templates have a change_mode of noop is this code:

nomad/client/allocrunner/taskrunner/template/template.go

Lines 229 to 234 in 0af4762

    
           if tm.allTemplatesNoop() { 
        
           	return 
        
           } 
        
           // handle all subsequent render events. 
        
           tm.handleTemplateRerenders(time.Now())

As a hacky workaround, we had hoped to be able to extend the connection retries on the consul template runner to multiple days to give us time to get vault healthy, but that parameter is not currently exposed in nomad. (It would be the github.com/hashicorp/consul-template/config.Config.Vault.Retry value on the conf value instantiated at

nomad/client/allocrunner/taskrunner/template/template.go

Line 628 in 0af4762

conf := ctconf.DefaultConfig()

But those values will not be configurable until #11606 is deployed.

DerekStrickland · 2021-12-24T12:37:36Z

As @jcdyer mentioned, this PR should expose the Vault retry configuration parameters and allow for a more fault-tolerant approach to consul-template render failures for both Vault and Consul integrated templates. I've tried to read through this thread carefully looking for other failure scenarios and I will do my best to incorporate the feedback from this thread into the test plan. If anyone affected by this issue has the time to test out that branch, it would be really helpful to hear how it went for you.

github-actions · 2022-10-12T02:45:06Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mbrezovsky added the type/enhancement label Sep 20, 2021

DerekStrickland self-assigned this Sep 21, 2021

DerekStrickland added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 21, 2021

DerekStrickland moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Sep 21, 2021

DerekStrickland moved this from Triaging to Needs Triage in Nomad - Community Issues Triage Sep 24, 2021

DerekStrickland removed their assignment Sep 24, 2021

lgfa29 added stage/needs-investigation theme/template theme/vault labels Oct 2, 2021

jcdyer mentioned this issue Dec 21, 2021

Nomad job using template stanza permanently dead after transient consul errors #2623

Closed

DerekStrickland mentioned this issue Dec 24, 2021

Expose Consul template configuration parameters #11606

Merged

DerekStrickland closed this as completed in #11606 Jan 10, 2022

Nomad - Community Issues Triage automation moved this from Needs Triage to Done Jan 10, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad jobs should stay up when vault outage occurs #11209

Nomad jobs should stay up when vault outage occurs #11209

mbrezovsky commented Sep 20, 2021

DerekStrickland commented Sep 23, 2021

mbrezovsky commented Sep 23, 2021

RickyGrassmuck commented Sep 27, 2021

eightseventhreethree commented Sep 27, 2021

RickyGrassmuck commented Sep 27, 2021

eightseventhreethree commented Sep 27, 2021

RickyGrassmuck commented Sep 27, 2021

doubleshot commented Sep 28, 2021 •

edited

Loading

RickyGrassmuck commented Sep 28, 2021

mbrezovsky commented Sep 28, 2021

RickyGrassmuck commented Sep 29, 2021

mbrezovsky commented Sep 29, 2021

baabgai commented Oct 5, 2021 •

edited

Loading

RickyGrassmuck commented Oct 10, 2021

jcdyer commented Dec 18, 2021 •

edited

Loading

jcdyer commented Dec 18, 2021 •

edited

Loading

DerekStrickland commented Dec 24, 2021

github-actions bot commented Oct 12, 2022

Nomad jobs should stay up when vault outage occurs #11209

Nomad jobs should stay up when vault outage occurs #11209

Comments

mbrezovsky commented Sep 20, 2021

Proposal

Use-cases

Attempted Solutions

DerekStrickland commented Sep 23, 2021

mbrezovsky commented Sep 23, 2021

RickyGrassmuck commented Sep 27, 2021

eightseventhreethree commented Sep 27, 2021

RickyGrassmuck commented Sep 27, 2021

eightseventhreethree commented Sep 27, 2021

RickyGrassmuck commented Sep 27, 2021

doubleshot commented Sep 28, 2021 • edited Loading

RickyGrassmuck commented Sep 28, 2021

mbrezovsky commented Sep 28, 2021

RickyGrassmuck commented Sep 29, 2021

mbrezovsky commented Sep 29, 2021

baabgai commented Oct 5, 2021 • edited Loading

RickyGrassmuck commented Oct 10, 2021

jcdyer commented Dec 18, 2021 • edited Loading

jcdyer commented Dec 18, 2021 • edited Loading

DerekStrickland commented Dec 24, 2021

github-actions bot commented Oct 12, 2022

doubleshot commented Sep 28, 2021 •

edited

Loading

baabgai commented Oct 5, 2021 •

edited

Loading

jcdyer commented Dec 18, 2021 •

edited

Loading

jcdyer commented Dec 18, 2021 •

edited

Loading