0.6.3 Issue with templates being re-rendered every few seconds. #3197

adamlc · 2017-09-12T08:14:13Z

Nomad version

Nomad v0.6.3

Operating system and Environment details

centos-release-7-3.1611.el7.centos.x86_64

Issue

We're currently on Nomad 0.6.0 in production with no issues. After trying a new 0.6.3 we're getting problems with our templates on jobs. Every few seconds the template is re-rendered and a SIGHUP us sent to the containers. This happens continuously even though the data in consul / vault isn't changing. I think its vault related because templates without vault secrets in seem OK.

Reproduction steps

I have a simple node AMQP consumer that connects to RabbitMQ and or internal API. The only dynamic part of the template is the AMQP settings.

Nomad Server logs (if appropriate)

Nothing seems out of the ordinary here. Just the server joining the new cluster.

Sep 12 07:50:01 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:01 [INFO] serf: EventMemberJoin: nomad-server-eu-west-1c-i-05356c3336b356b13.global 10.0.2.166
Sep 12 07:50:01 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:01.540049 [INFO] nomad: adding server nomad-server-eu-west-1c-i-05356c3336b356b13.global (Addr: 10.0.2.166:4647) (DC: dc1)
Sep 12 07:50:01 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:01.549204 [INFO] nomad: Found expected number of peers (10.0.2.166:4647,10.0.1.24:4647,10.0.0.45:4647), attempting to bootstrap cluster...
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [WARN] raft: Heartbeat timeout from "" reached, starting election
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Node at 10.0.1.24:4647 [Candidate] entering Candidate state in term 2
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Election won. Tally: 2
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Node at 10.0.1.24:4647 [Leader] entering Leader state
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Added peer 10.0.2.166:4647, starting replication
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Added peer 10.0.0.45:4647, starting replication
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02.877179 [INFO] nomad: cluster leadership acquired
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [WARN] raft: AppendEntries to {Voter 10.0.0.45:4647 10.0.0.45:4647} rejected, sending older logs (next: 1)
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [WARN] raft: AppendEntries to {Voter 10.0.2.166:4647 10.0.2.166:4647} rejected, sending older logs (next: 1)
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: pipelining replication to peer {Voter 10.0.0.45:4647 10.0.0.45:4647}
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: pipelining replication to peer {Voter 10.0.2.166:4647 10.0.2.166:4647}

Nomad Client logs (if appropriate)

Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: ==> Nomad agent started! Log data will stream in below:
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.181939 [INFO] client: using state directory /var/lib/nomad/client
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.186149 [INFO] client: using alloc directory /var/lib/nomad/alloc
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.187271 [INFO] fingerprint.cgroups: cgroups are available
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.188386 [INFO] fingerprint.consul: consul agent is available
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.193260 [WARN] fingerprint.network: Unable to parse Speed in output of '/usr/sbin/ethtool eth0'
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.458078 [INFO] fingerprint.vault: Vault is available
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.488889 [INFO] client: Node ID "58bba806-9bd8-b76c-947b-ab34b7d755e4"
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.503844 [INFO] client.consul: discovered following Servers: 10.0.2.166:4647,10.0.1.24:4647,10.0.0.45:4647
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.527839 [INFO] client: node registration complete
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) creating new runner (dry: false, once: false)
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) creating watcher
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) starting
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) initiating run
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [INFO] (runner) initiating run
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [INFO] (runner) initiating run
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: Code: 403. Errors:
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: * permission denied
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/9d0e6f7a-c7cb-3e57-295e-621b8fe04a8f/server/local/config.json"
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: Code: 403. Errors:
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: * permission denied
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: Code: 403. Errors:
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: * permission denied
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:38 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:38 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:38 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:38 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/9d0e6f7a-c7cb-3e57-295e-621b8fe04a8f/server/local/config.json"
Sep 12 08:00:38 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:38 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.

The failure to renew the lease then just repeats in the log. Its odd this happens because the setup is exactly the same as the current production cluster.

I've checked by leases and stuff in vault and all seems good there.

Job file (if appropriate)

job "apiv2-worker" {
    datacenters = ["dc1"]
    type = "service"
    update {
        stagger = "45s"
        max_parallel = 1
    }

    group "worker" {
        count = 2

        restart {
            attempts = 10
            interval = "5m"
            delay = "25s"
            mode = "delay"
        }

        task "server" {
            driver = "docker"

            env {
                NODE_ENV = "production"
                NODE_PATH = "/usr/src/app"
                CONFIG_LOCATION = "${NOMAD_TASK_DIR}/config.json"
            }

            config {
                image = "*****"

                dns_servers = ["172.17.0.1"]

                auth {
                    username = "jumplead-docker-deploy"
                    password = "86SYf189W-Mz4ioXLL2z"
                }

                port_map {
                    healthcheck = 10060
                }
            }

            resources {
                cpu = 100
                memory = 120

                network {
                    mbits = 10
                    port "healthcheck" {
                        static = 29196
                    }
                }
            }

            service {
                name = "apiv2-worker"
                port = "healthcheck"
                tags = ["healthcheck"]
                check {
                    type = "http"
                    name = "AMQP Connected Check"
                    path = "/"
                    interval = "10s"
                    timeout = "5s"
                }
            }

            template {
                data = <<EOH
{
    "AMQP_URIS": [{{with $auth := secret "rabbitmq/creds/readwriteconfigure"}}{{range $i, $s := service "rabbitmq"}}{{ if ne $i 0 }},{{ end }}"amqp://{{$auth.Data.username}}:{{$auth.Data.password}}@{{.Address}}"{{end}}{{end}}],

    "API_URL": "http://apiv2.service.consul:29191/v2",

    "SYSTEMEMAIL_THRIFT": {
        "HOST": "system-email.service.consul",
        "PORT": 29174
    },

    "PRODUCTION": true
}
                EOH

                destination = "local/config.json"
                change_mode = "signal"
                change_signal = "SIGHUP"
            }

            vault {
                policies = ["rabbitmq"]
                change_mode = "signal"
                change_signal = "SIGHUP"
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

adamlc · 2017-09-12T08:43:53Z

Just a quick note, nomad is currently using root tokens to ensure it isn't a token permission issue.

sethvargo · 2017-09-12T14:27:03Z

Root tokens are not renewable, so you'll need to disable that or use a periodic root token. That's one of the errors you're seeing.

dadgar · 2017-09-12T16:47:40Z

@sethvargo Nomad server has root token, the instance of Consul-template doesn't have root. It has a renewable periodic token with the rabbitmq policy.

@adamlc This is likely a consul-template bug. Can you reproduce this behavior by running an instance of consul-template with that template?

dadgar · 2017-09-14T17:51:27Z

@adamlc Any update? Can you also give the Consul and Vault versions.

stevenscg · 2017-09-14T17:58:10Z

I found this issue in the google group and dadgar pointed me here.

For me, it looked like the old mysql database backend was all that was affected, but I think it is probably more than that.

I am not using a root token, but a token from the aws-ec2 auth backend.

I just upgraded my test cluster from 0.6.0 to 0.6.3 and hit this. The update rate on the renewer is so fast that I had to shut everything down.

dadgar · 2017-09-14T17:59:45Z

@stevenscg Can you attempt to run your config with just the newest consul-template. I am almost certain it is a bug in that codebase.

stevenscg · 2017-09-14T19:13:44Z

@dadgar I might be able to do that in a dev environment. It won't be the exact same setup, but the template and vault database backend and vault roles and policies will be. Might be tomorrow at the earliest. Are you looking for anything in particular.

I just finished rolling back the test environment to 0.6.0 and the issue appears to be gone.

dadgar · 2017-09-14T19:25:01Z

@stevenscg That would be awesome. Mainly looking for as simple of a reproduction step as possible with just consul-template. If you could grab the debug level logs as well when you invoke that would be 👍

stevenscg · 2017-09-14T19:33:40Z

Will do.

FWIW, this is fairly representative of what I was seeing:

Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): failed to renew: Error making API request.
Sep 14 17:47:21 ip-10-101-33-183 nomad: URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew
Sep 14 17:47:21 ip-10-101-33-183 nomad: Code: 403. Errors:
Sep 14 17:47:21 ip-10-101-33-183 nomad: * permission denied
Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): failed to renew: Error making API request.
Sep 14 17:47:21 ip-10-101-33-183 nomad: URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew
Sep 14 17:47:21 ip-10-101-33-183 nomad: Code: 403. Errors:
Sep 14 17:47:21 ip-10-101-33-183 nomad: * permission denied

Sep 14 17:48:47 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:48:48 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied
Sep 14 17:48:48 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:48:48 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied

That doesn't tell me much, but it might come in handy. It was fast too. The local syslog daemons couldn't keep up with it.

dadgar · 2017-09-14T19:34:46Z

@stevenscg What is the lease duration on secrets from: database/creds/billing

stevenscg · 2017-09-14T19:37:34Z

@dadgar

lease_duration 	1h0m0s
lease_renewable	true

Max ttl is 24h. Defaults for the backend, IIR.

adamlc · 2017-09-15T06:44:30Z

Sorry I haven't had time to look at this yet. I'll give it a go later if I have time :)

stevenscg · 2017-09-16T01:22:39Z

@dadgar I just tried rendering the same template from before using the consul-template 0.19.3 release:

2017/09/16 00:51:14.581357 [INFO] (runner) initiating run
2017/09/16 00:51:14.581365 [DEBUG] (runner) checking template 2d9350d20d9821111c000d5659375707
2017/09/16 00:51:14.581677 [DEBUG] (runner) rendering "/vagrant/billing-service/provision/app/app.ctmpl" => "/vagrant/billing-service/config/app.php"
2017/09/16 00:51:14.583573 [DEBUG] (runner) diffing and updating dependencies
.... kv.get, kv.block. vault.read, etc ....
2017/09/16 00:51:14.583632 [DEBUG] (runner) watching 7 dependencies
2017/09/16 00:51:14.583637 [DEBUG] (runner) all templates rendered
2017/09/16 00:52:58.767915 [DEBUG] (runner) receiving dependency health.service(cache-alpha|passing)
2017/09/16 00:52:58.767940 [INFO] (runner) initiating run

It seems to cycle through this every 1-3 minutes.

If I downgrade C/T to 0.19.2 and run the same setup in-place, the pattern is different:

2017/09/16 01:02:30.810225 [INFO] (runner) initiating run
2017/09/16 01:02:30.810237 [DEBUG] (runner) checking template 2d9350d20d9821111c000d5659375707
2017/09/16 01:02:30.810520 [DEBUG] (runner) rendering "/vagrant/billing-service/provision/app/app.ctmpl" => "/vagrant/billing-service/config/app.php"
2017/09/16 01:02:30.814312 [DEBUG] (runner) diffing and updating dependencies
.... kv.get, kv.block. vault.read, etc ....
2017/09/16 01:02:30.814441 [DEBUG] (runner) watching 7 dependencies
2017/09/16 01:04:25.171983 [DEBUG] (runner) receiving dependency health.service(cache-alpha|passing)
2017/09/16 01:04:25.172011 [INFO] (runner) initiating run

The [DEBUG] (runner) all templates rendered line only shows up 1 time with C/T v0.19.2 vs with every 1-3 minute interval with C/T v0.19.3.

For reference, this particular template has 1 secret from the (deprecated) mysql vault backend, 3 keys from consul, 1 service from consul.

I am not seeing any of the vault permission denied errors that was the most concerning issue from the other day.

Versions:

[vagrant@localhost ~]$ consul-template -version
consul-template v0.19.3 (ebf2d3d)
[vagrant@localhost ~]$ vault version
Vault v0.8.2 ('9afe7330e06e486ee326621624f2077d88bc9511')
[vagrant@localhost ~]$ consul version
Consul v0.9.3

stevenscg · 2017-09-16T02:09:52Z

@dadgar Another datapoint.... I went back to C/T v0.19.3 and dropped the mysql backend TTL from 1h to 5m and was able to trigger the renewer problem I saw in our deployed environment yesterday.

In fact, any TTL smaller than about 18 minutes on this particular system would trigger the issue. 30 minute TTL was reliably stable. This system is very lightly loaded FWIW.

The C/T output when the issue is triggered is roughly this:

....
2017/09/16 02:03:27.430761 [WARN] vault.read(mysql/creds/billing): renewer returned (maybe the lease expired)
2017/09/16 02:03:27.430781 [TRACE] vault.read(mysql/creds/billing): GET /v1/mysql/creds/billing?index=1505527407&stale=true&wait=1m0s
2017/09/16 02:03:27.454556 [TRACE] (view) vault.read(mysql/creds/billing) marking successful data response
2017/09/16 02:03:27.454613 [TRACE] (view) vault.read(mysql/creds/billing) no new data (index was the same)
2017/09/16 02:03:27.454623 [TRACE] vault.read(mysql/creds/billing): starting renewer
2017/09/16 02:03:27.454699 [TRACE] (view) vault.read(mysql/creds/billing) successful contact, resetting retries
2017/09/16 02:03:27.462674 [TRACE] vault.read(mysql/creds/billing): successfully renewed
2017/09/16 02:03:27.462700 [WARN] vault.read(mysql/creds/billing): renewer returned (maybe the lease expired)
2017/09/16 02:03:27.462715 [TRACE] vault.read(mysql/creds/billing): GET /v1/mysql/creds/billing?index=1505527407&stale=true&wait=1m0s
2017/09/16 02:03:27.479821 [TRACE] (view) vault.read(mysql/creds/billing) marking successful data response
2017/09/16 02:03:27.479846 [TRACE] (view) vault.read(mysql/creds/billing) no new data (index was the same)
2017/09/16 02:03:27.479853 [TRACE] vault.read(mysql/creds/billing): starting renewer
2017/09/16 02:03:27.479914 [TRACE] (view) vault.read(mysql/creds/billing) successful contact, resetting retries
2017/09/16 02:03:27.484584 [TRACE] vault.read(mysql/creds/billing): successfully renewed
2017/09/16 02:03:27.484607 [WARN] vault.read(mysql/creds/billing): renewer returned (maybe the lease expired)
2017/09/16 02:03:27.484623 [TRACE] vault.read(mysql/creds/billing): GET /v1/mysql/creds/billing?index=1505527407&stale=true&wait=1m0s
2017/09/16 02:03:27.544348 [TRACE] (view) vault.read(mysql/creds/billing) marking successful data response
2017/09/16 02:03:27.544382 [TRACE] (view) vault.read(mysql/creds/billing) no new data (index was the same)
2017/09/16 02:03:27.544390 [TRACE] vault.read(mysql/creds/billing): starting renewer
2017/09/16 02:03:27.544464 [TRACE] (view) vault.read(mysql/creds/billing) successful contact, resetting retries
2017/09/16 02:03:27.553385 [TRACE] vault.read(mysql/creds/billing): successfully renewed
....

adamlc · 2017-09-19T09:16:08Z

@dadgar I've just tried using consul template and it seems to work as expected, it doesn't constantly cycle the credentials like nomad does:

$ consul-template -v
consul-template v0.19.3 (e08c904)

Using the same template as my nomad file. Also using a root token here. (Just thought I'd mention I've tried using a root token in Nomad and one supplied by ec2-auth, both have the same results)

$ consul-template -consul-addr "54.229.122.21:8500" -vault-renew-token=false -template "template.tpl:template.json"

holtwilkins · 2017-09-25T02:25:23Z

I can confirm that I'm hitting this issue as well on Nomad 0.6.3, though I never ran Nomad 0.6.0 for comparison. These error messages cause enormous raft churn and consul servers go crazy trying to keep up with creating and reaping snapshots.

It seems like this is affecting the consul and database secret backends, but does not seem to affect the aws backend. Not sure if this is helpful or not, but figured I'd post in case it is.

Don't think we're doing anything special wrt TTLs of these mountpoints, but not sure where all to check:

[devtools-staging] hwilkins ~ $ vault mounts
Path                                           Type        Accessor             Plugin  Default TTL  Max TTL    Force No Cache  Replication Behavior  Description
aws/                                           aws         aws_69fee16d         n/a     14400        14400      false           replicated
consul/                                        consul      consul_49f9719c      n/a     14400        14400      false           replicated
cubbyhole/                                     cubbyhole   cubbyhole_650dc4a0   n/a     n/a          n/a        false           local                 per-token private secret storage
database/                                      database    database_137ef1c0    n/a     14400        14400      false           replicated            generic database backend
secret/                                        generic     generic_13ee50f4     n/a     system       system     false           replicated            generic secret storage
ssh/                                           ssh         ssh_db590344         n/a     600          600        false           replicated
sys/                                           system      system_d6587c9c      n/a     n/a          n/a        false           replicated            system endpoints used for control, policy and debugging

dadgar · 2017-09-25T16:48:11Z

Hey all, we will be looking into this before cutting 0.7, so the next release should resolve this.

dadgar · 2017-10-11T23:08:06Z

Hey all, to mitigate this until 0.7 is out you can set the vault_grace to be much smaller than the lease of your secret. Something like 5-15 seconds.

adamlc · 2017-11-13T09:37:42Z

@dadgar unfortunately I'm still getting this issue on 0.7.0, even after setting a low vault_grace period :(

dadgar · 2017-11-13T17:56:43Z

@adamlc What is the lease duration of your secret? Your vault_grace should be set to be less than your secrets lease duration. If this isn't the case, the template will re-render the template immediately since the secret will expire within the grace period.

https://www.nomadproject.io/docs/job-specification/template.html#vault_grace

adamlc · 2017-11-14T08:14:03Z

@dadgar I set the grace to 10s and all the leases are >24 hours. I'll have to have a dive into the logs again.

I'll see if I can come up with something producible for you to work with :)

adamlc · 2017-11-20T16:24:29Z

I have a suspicion this is something to do with the old mysql backend. I have 14k+ leases for mysql which probably got created when the tasks where restarting. I'm currently decommissioning the backend and revoking the leases. Hopefully this is whats causing it!

stevenscg · 2017-11-30T15:16:47Z

@adamlc @dadgar I'm seeing this on 0.7.0 as well and also with the old mysql backend. I am going to try adding vault_grace to my template stanzas, but the docs read like the default is 15s. Is that true?

adamlc · 2017-11-30T15:18:37Z

@stevenscg its interesting that you also use the old mysql backend. I've not actually had time to take a look at this in more detail yet.

I'm not sure what the default grace is but if you do a nomad plan before you should be able to see what the default / current one is.

stevenscg · 2017-11-30T15:20:30Z

@adamlc Ah. Good point about nomad plan. I'll try that. Did you just move to the new database secret backend instead of messing with it further?

stevenscg · 2017-11-30T15:20:47Z

I'll add that my database secret leases are 4h / 24h max:

Key      	Value
---      	-----
lease    	4h0m0s
lease_max	24h0m0s

adamlc · 2017-11-30T15:22:10Z

I'm still rocking 0.6.3 at the moment. I have migrated to the database backend but haven't had time to test the setup in 0.7.0 yet.

stevenscg · 2017-11-30T15:24:27Z

Got it. FYI, the default vault_grace does seem to be 15s as documented. I'm testing with an arbitrary 10s, but don't know why it would be any better.

stevenscg · 2017-11-30T18:00:30Z

@dadgar I'm still seeing this. Any thoughts before I roll back to 0.6.0 ??

Here's where I'm at:

Nomad 0.7.0
Vault 0.9.0
Now using the vault database backend instead of the older mysql.
If it matters functionally, I'm using plugin_name=mysql-aurora-database-plugin for mysql 5.6 / aurora compatibility
database role default_ttl = 12h, max_ttl=48h
vault_grace = 30s on my test job

stevenscg · 2017-11-30T18:02:00Z

I'll add the re-render rate with this new database backend seems higher than the legacy mysql backend. Up to approx 4-5 seconds between vs 20 or 30s.

jippi · 2017-11-30T19:14:52Z

@stevenscg I got similar setup, with the exception of Vault, which is 0.8.3 - no problems on 0.7

stevenscg · 2017-11-30T20:18:58Z

@jippi That's the crazy part to me. I'd think that my setup is really a very common one.

My vault was at 0.8.3 earlier today and the issue was happening both before and after the upgrade to 0.9.0.

stevenscg · 2017-11-30T20:27:04Z

@jippi How do you upgrade your nomad server and worker nodes?

I typically upgrade them in-place carefully one at a time servers then workers.

stevenscg · 2017-11-30T20:40:27Z

I've posted some additional details in the mailing list in case anybody else is experiencing this problem, though I'm worried that it is something very specific with my environment.

stevenscg · 2017-12-01T14:43:59Z

I ran the templates through consul-template versions 0.19.3 and 0.19.4. Both gave the same sys/leases/renew error.

It looks like this is going to be a vault default policy issue in my environment.

The default policy for my deployed environments has been around since Vault 0.6 or something.

On our development systems that were not experiencing this issue, vault is loaded fresh and thus the default policy is also refresh.

I vaguely remembered a comment or a discussion about the vault policy changing and the task token needed access to sys/leases/renew in vault.

The piece I needed is here:
https://github.com/hashicorp/nomad/blob/master/website/source/docs/vault-integration/index.html.md#permission-denied-errors

But not on the nomad website just yet.

adamlc · 2018-01-16T15:13:28Z

I'm just looking at this issue again and I'm still having issues with 0.7.1.

At first I thought it may be my database backend, I didn't have TTLs set for the database roles, so I thought that may be the issue, unfortunately not. I'm going to go through each secret referenced to make sure they have TTLs.

Other than the above I'm stumped! 😢

I'll post here if anything comes to light!

adamlc · 2018-01-16T16:14:27Z

Turns out that my vault default policy was completely different to newer versions, once I updated my default policy with one from a newer version it worked perfectly! Finally glad to have this sorted!

Thanks @stevenscg for pointing me in the right direction!

For anyone else with the same issue, simply grab the latest default vault policy and then overwrite your current default policy in vault and you should be good to go!

stevenscg · 2018-01-16T16:16:33Z

@adamlc Glad it helped you! So much grey hair over that one.

github-actions · 2022-12-04T02:17:04Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

adamlc mentioned this issue Sep 12, 2017

[unsure, still investigating] significant rise in vault leader CPU usage during and after upgrade to nomad 0.6.2 #3133

Closed

dadgar added type/bug theme/template labels Sep 12, 2017

preetapan mentioned this issue Oct 3, 2017

Dynamic secrets are renewed all the time, rather than according to their lease duration hashicorp/consul-template#1019

Closed

dadgar mentioned this issue Oct 12, 2017

Use Vault default grace #3360

Merged

dadgar closed this as completed in #3360 Oct 12, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022

0.6.3 Issue with templates being re-rendered every few seconds. #3197

0.6.3 Issue with templates being re-rendered every few seconds. #3197

Comments

adamlc commented Sep 12, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

adamlc commented Sep 12, 2017

sethvargo commented Sep 12, 2017

dadgar commented Sep 12, 2017

dadgar commented Sep 14, 2017 • edited Loading

stevenscg commented Sep 14, 2017

dadgar commented Sep 14, 2017

stevenscg commented Sep 14, 2017

dadgar commented Sep 14, 2017

stevenscg commented Sep 14, 2017

dadgar commented Sep 14, 2017

stevenscg commented Sep 14, 2017

adamlc commented Sep 15, 2017

stevenscg commented Sep 16, 2017 • edited Loading

stevenscg commented Sep 16, 2017

adamlc commented Sep 19, 2017

holtwilkins commented Sep 25, 2017 • edited Loading

dadgar commented Sep 25, 2017

dadgar commented Oct 11, 2017 • edited Loading

adamlc commented Nov 13, 2017

dadgar commented Nov 13, 2017 • edited Loading

adamlc commented Nov 14, 2017

adamlc commented Nov 20, 2017

stevenscg commented Nov 30, 2017

adamlc commented Nov 30, 2017

stevenscg commented Nov 30, 2017

stevenscg commented Nov 30, 2017

adamlc commented Nov 30, 2017 • edited Loading

stevenscg commented Nov 30, 2017

stevenscg commented Nov 30, 2017

stevenscg commented Nov 30, 2017

jippi commented Nov 30, 2017

stevenscg commented Nov 30, 2017

stevenscg commented Nov 30, 2017

stevenscg commented Nov 30, 2017

stevenscg commented Dec 1, 2017

adamlc commented Jan 16, 2018

adamlc commented Jan 16, 2018

stevenscg commented Jan 16, 2018

github-actions bot commented Dec 4, 2022

dadgar commented Sep 14, 2017 •

edited

Loading

stevenscg commented Sep 16, 2017 •

edited

Loading

holtwilkins commented Sep 25, 2017 •

edited

Loading

dadgar commented Oct 11, 2017 •

edited

Loading

dadgar commented Nov 13, 2017 •

edited

Loading

adamlc commented Nov 30, 2017 •

edited

Loading