Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

templates: rewrite despite no content change if owner / group configured #14768

Closed
harningt opened this issue Oct 1, 2022 · 6 comments · Fixed by #15045
Closed

templates: rewrite despite no content change if owner / group configured #14768

harningt opened this issue Oct 1, 2022 · 6 comments · Fixed by #15045
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/dependencies Pull requests that update a dependency file theme/template type/bug
Milestone

Comments

@harningt
Copy link

harningt commented Oct 1, 2022

Nomad version

Nomad v1.3.5 (1359c25)

Operating system and Environment details

Fedora Linux clients and servers.
User Namespace configured docker instance as the primary use case.

Issue

Adjusting the ownership / group of templates (so they are owned by the namespaced user vs "nobody") results in the template being re-rendered + update trigger every 3-5 minutes despite no content changes.

Reproduction steps

  • Setup user namespaced client (I assume this is a requirement - maybe not).
  • Create a job that has a template that has a configured numeric uid/gid in the namespace (ex: 5550000 because the offset I configured for the namespaced user was 550000)

Expected Result

  • Files written out to match the prescribed uid/gid.
  • Files re-written on change to contents
  • change_mode action is run on contents change

Actual Result

  • Files written out to match the prescribed uid/gid.
  • Files re-written every 3-5 minutes (observed timestamp updates) with no content changes (sha1sum checked for sanity)
  • change_mode action is run every 3-5 minutes

Job file (if appropriate)

Reduced to bare minimum

job "test-template-issue" {
  type = "service"
  datacenters = ["dc1"]

  group "template-group" {
    task "template-task" {
      driver = "docker"

      resources {
        memory = 64
        cpu = 500
      }

      vault {
        policies = ["default"]
      }

      template {
        destination = "${NOMAD_SECRETS_DIR}/ca.pem"
        uid = "55500000"
        data = <<EOL
{{ with secret "pki-int/cert/ca_chain" }}{{ .Data.ca_chain }}{{ end }}
EOL
      }
      config {
        image        = "alpine:3.16"
        command     = "sleep"
        args        = ["infinity"]
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Sep 30 22:09:04 client-host-1 nomad[3190]:     2022-09-30T22:09:04.881-0400 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd->
Sep 30 22:09:04 client-host-1 nomad[3190]:  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd-836984c1798b task=template-task path=/va>
Sep 30 22:09:04 client-host-1 nomad[3190]:     2022-09-30T22:09:04.882-0400 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd->
Sep 30 22:09:04 client-host-1 nomad[3190]:  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd-836984c1798b task=template-task @module=>
Sep 30 22:09:05 client-host-1 nomad[3190]:     2022-09-30T22:09:05.763-0400 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 30 22:09:05 client-host-1 nomad[3190]:  agent: (runner) creating new runner (dry: false, once: false)
Sep 30 22:09:05 client-host-1 nomad[3190]:     2022-09-30T22:09:05.763-0400 [INFO]  agent: (runner) creating watcher
Sep 30 22:09:05 client-host-1 nomad[3190]:     2022-09-30T22:09:05.763-0400 [INFO]  agent: (runner) starting
Sep 30 22:09:05 client-host-1 nomad[3190]:  agent: (runner) creating watcher
Sep 30 22:09:05 client-host-1 nomad[3190]:  agent: (runner) starting
Sep 30 22:09:06 client-host-1 nomad[3190]:     2022-09-30T22:09:06.035-0400 [WARN]  agent: vault.read(pki-int/cert/ca_chain): failed to check if pki-int/cert/ca_chain is KVv2, assume not:>
Sep 30 22:09:06 client-host-1 nomad[3190]: URL: GET https://vault.service.consul:8200/v1/sys/internal/ui/mounts/pki-int/cert/ca_chain
Sep 30 22:09:06 client-host-1 nomad[3190]: Code: 403. Errors:
Sep 30 22:09:06 client-host-1 nomad[3190]: * preflight capability check returned 403, please ensure client's policies grant access to path "pki-int/cert/ca_chain/"
Sep 30 22:09:06 client-host-1 nomad[3190]:  agent: vault.read(pki-int/cert/ca_chain): failed to check if pki-int/cert/ca_chain is KVv2, assume not: Error making API request.
                                            
                                            URL: GET https://vault.service.consul:8200/v1/sys/internal/ui/mounts/pki-int/cert/ca_chain
                                            Code: 403. Errors:
                                            
                                            * preflight capability check returned 403, please ensure client's policies grant access to path "pki-int/cert/ca_chain/"
Sep 30 22:09:06 client-host-1 nomad[3190]:     2022-09-30T22:09:06.103-0400 [INFO]  agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/f2ab3f86-c0fb-c1e0-82cd-836984c1798b/template>
Sep 30 22:09:06 client-host-1 nomad[3190]:  agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/f2ab3f86-c0fb-c1e0-82cd-836984c1798b/template-task/secrets/ca.pem"
Sep 30 22:09:06 client-host-1 nomad[3190]:     2022-09-30T22:09:06.141-0400 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=b361afce7e8ae57579a2022c8d73cf8>
Sep 30 22:09:06 client-host-1 nomad[3190]:  client.driver_mgr.docker: created container: driver=docker container_id=b361afce7e8ae57579a2022c8d73cf846c4c32a1525983947d80f9383249c1c2
Sep 30 22:09:06 client-host-1 nomad[3190]:     2022-09-30T22:09:06.509-0400 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=b361afce7e8ae57579a2022c8d73cf8>
Sep 30 22:09:06 client-host-1 nomad[3190]:  client.driver_mgr.docker: started container: driver=docker container_id=b361afce7e8ae57579a2022c8d73cf846c4c32a1525983947d80f9383249c1c2
Sep 30 22:13:46 client-host-1 nomad[3190]:     2022-09-30T22:13:46.780-0400 [INFO]  agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/f2ab3f86-c0fb-c1e0-82cd-836984c1798b/template>
Sep 30 22:13:46 client-host-1 nomad[3190]:  agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/f2ab3f86-c0fb-c1e0-82cd-836984c1798b/template-task/secrets/ca.pem"
Sep 30 22:13:53 client-host-1 nomad[3190]:     2022-09-30T22:13:53.018-0400 [INFO]  client.driver_mgr.docker: stopped container: container_id=b361afce7e8ae57579a2022c8d73cf846c4c32a152598>
Sep 30 22:13:53 client-host-1 nomad[3190]:  client.driver_mgr.docker: stopped container: container_id=b361afce7e8ae57579a2022c8d73cf846c4c32a1525983947d80f9383249c1c2 driver=docker
Sep 30 22:13:53 client-host-1 nomad[3190]:     2022-09-30T22:13:53.066-0400 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=f2ab3f86-c0fb-c1e0-82cd-836984c1798b task=te>
Sep 30 22:13:53 client-host-1 nomad[3190]:  client.alloc_runner.task_runner: restarting task: alloc_id=f2ab3f86-c0fb-c1e0-82cd-836984c1798b task=template-task reason="" delay=0s
Sep 30 22:13:53 client-host-1 nomad[3190]:     2022-09-30T22:13:53.071-0400 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd->
Sep 30 22:13:53 client-host-1 nomad[3190]:  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd-836984c1798b task=template-task path=/va>
Sep 30 22:13:53 client-host-1 nomad[3190]:  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd-836984c1798b task=template-task @module=>
Sep 30 22:13:53 client-host-1 nomad[3190]:     2022-09-30T22:13:53.071-0400 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f2ab3f86-c0fb-c1e0-82cd->
Sep 30 22:13:53 client-host-1 nomad[3190]:     2022-09-30T22:13:53.105-0400 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=fb74e3122f8ef512742138c9a69fad2>
Sep 30 22:13:53 client-host-1 nomad[3190]:  client.driver_mgr.docker: created container: driver=docker container_id=fb74e3122f8ef512742138c9a69fad25fa0f964b9b0486f42751c6fc4992263b
Sep 30 22:13:53 client-host-1 nomad[3190]:     2022-09-30T22:13:53.400-0400 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=fb74e3122f8ef512742138c9a69fad2>
Sep 30 22:13:53 client-host-1 nomad[3190]:  client.driver_mgr.docker: started container: driver=docker container_id=fb74e3122f8ef512742138c9a69fad25fa0f964b9b0486f42751c6fc4992263b
@harningt
Copy link
Author

harningt commented Oct 1, 2022

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 3, 2022
@tgross tgross self-assigned this Oct 3, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 3, 2022
@tgross
Copy link
Member

tgross commented Oct 3, 2022

Hi @harningt! I was able to reproduce this on the current HEAD of the main branch as well. tl;dr this is a consul-template bug and I've opened an issue over there reproducing the bug with consul-template alone: hashicorp/consul-template#1651

I'm going to keep this issue open here in Nomad so that we have a place to point folks who get caught by this and so that we can track getting the fix from CT merged into Nomad as well. Sorry I don't have an immediate answer for you @harningt but I'll keep this issue updated with progress.


Here's what I did to reproduce on Nomad.

Run Vault in dev mode. Configure the CLI with VAULT_TOKEN and VAULT_ADDR and then add wide read capabilities to the default policy and write a secret:

$ vault policy read default > vault-policy.hcl
$ echo 'path "secret/*" { capabilities = ["read"] }' >> ./vault-policy.hcl
$ vault policy write default ./vault-policy.hcl
Success! Uploaded policy: default

$ vault kv put -mount=secret example content='<html>hello, world</html>'
=== Secret Path ===
secret/data/example
...

Use the following for Nomad's vault block and run Nomad in dev mode:

vault {
  enabled = true
  address = "http://127.0.0.1:8200"
  token = "REDACTED"
  allow_unauthenticated = true
}

Run the following job:

jobspec
job "example" {
  datacenters = ["dc1"]

  group "group" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "task" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      vault {
        policies = ["default"]
      }

      template {
        destination = "${NOMAD_TASK_DIR}/index.html"
        uid         = "55500000"
        data        = <<EOL
{{ with secret "secret/data/example" }}{{ .Data.content }}{{ end }}
EOL
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

The template renders, but a few minutes later the task restarts and I get the following in the logs:

    2022-10-03T15:38:37.589Z [TRACE] agent: vault.read(secret/data/example): GET /v1/secret/data/example
    2022-10-03T15:38:37.592Z [TRACE] agent: vault.read(secret/data/example): non-renewable secret, set sleep for 4m25.158462478s
    2022-10-03T15:38:37.592Z [TRACE] agent: (view) vault.read(secret/data/example) marking successful data response
    2022-10-03T15:38:41.894Z [TRACE] agent: vault.read(secret/data/example): GET /v1/secret/data/example
    2022-10-03T15:38:41.896Z [TRACE] agent: vault.read(secret/data/example): non-renewable secret, set sleep for 4m16.657934249s
    2022-10-03T15:38:41.897Z [TRACE] agent: (view) vault.read(secret/data/example) marking successful data response
    2022-10-03T15:38:41.897Z [TRACE] agent: (view) vault.read(secret/data/example) successful contact, resetting retries
    2022-10-03T15:38:41.897Z [TRACE] agent: (view) vault.read(secret/data/example) received data
    2022-10-03T15:38:41.898Z [TRACE] agent: (view) vault.read(secret/data/example) starting fetch
    2022-10-03T15:38:41.898Z [DEBUG] agent: (runner) receiving dependency vault.read(secret/data/example)
    2022-10-03T15:38:41.898Z [DEBUG] agent: (runner) initiating run
    2022-10-03T15:38:41.899Z [DEBUG] agent: (runner) checking template a13a5b09928fab8feb0193763059c344
    2022-10-03T15:38:41.899Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/tmp/NomadClient3031433125/54f2316f-ff34-3be3-cb39-8f031c9957bb/task/local/index.html"
    2022-10-03T15:38:41.900Z [INFO]  agent: (runner) rendered "(dynamic)" => "/tmp/NomadClient3031433125/54f2316f-ff34-3be3-cb39-8f031c9957bb/task/local/index.html"
    2022-10-03T15:38:41.900Z [DEBUG] agent: (runner) diffing and updating dependencies
    2022-10-03T15:38:41.901Z [DEBUG] agent: (runner) vault.read(secret/data/example) is still needed
    2022-10-03T15:38:41.901Z [DEBUG] agent: (runner) watching 1 dependencies
    2022-10-03T15:38:41.901Z [DEBUG] agent: (runner) all templates rendered
    2022-10-03T15:38:46.532Z [TRACE] client.alloc_runner.task_runner: Restart requested: alloc_id=54f2316f-ff34-3be3-cb39-8f031c9957bb task=task failure=false event="1664811526532378748 - Restart Signaled"

I ran this a couple times and the window of time varies but seems to always be on the order of minutes but less than 5min, because that's the CT sleep timeout.

Next I ran the same job but with the uid field commented-out. After <5min or so I get the same logs from the template runner, but without a restart:

    2022-10-03T15:45:19.018Z [TRACE] agent: vault.read(secret/data/example): GET /v1/secret/data/example
    2022-10-03T15:45:19.020Z [TRACE] agent: vault.read(secret/data/example): non-renewable secret, set sleep for 4m23.635532634s
    2022-10-03T15:45:19.020Z [TRACE] agent: (view) vault.read(secret/data/example) marking successful data response
    2022-10-03T15:45:19.021Z [TRACE] agent: (view) vault.read(secret/data/example) successful contact, resetting retries
    2022-10-03T15:45:19.021Z [TRACE] agent: (view) vault.read(secret/data/example) received data
    2022-10-03T15:45:19.021Z [TRACE] agent: (view) vault.read(secret/data/example) starting fetch
    2022-10-03T15:45:19.021Z [DEBUG] agent: (runner) receiving dependency vault.read(secret/data/example)
    2022-10-03T15:45:19.022Z [DEBUG] agent: (runner) initiating run
    2022-10-03T15:45:19.022Z [DEBUG] agent: (runner) checking template a13a5b09928fab8feb0193763059c344
    2022-10-03T15:45:19.022Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/tmp/NomadClient1024084621/7eeea810-42ca-db86-a478-d432571700b6/task/local/index.html"
    2022-10-03T15:45:19.023Z [DEBUG] agent: (runner) diffing and updating dependencies
    2022-10-03T15:45:19.023Z [DEBUG] agent: (runner) vault.read(secret/data/example) is still needed
    2022-10-03T15:45:19.023Z [DEBUG] agent: (runner) watching 1 dependencies
    2022-10-03T15:45:19.023Z [DEBUG] agent: (runner) all templates rendered

I did some "printf debugging" and it looks like the event we're getting from consul-template is the one we normally use to detect a change, so as far as I could tell Nomad is behaving correctly based on the event we're getting from CT. And then I tried to reproduce with consul-template alone and was able to reproduce it there.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/waiting-on-upstream This issue is waiting on an upstream PR review labels Oct 3, 2022
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Oct 3, 2022
@tgross
Copy link
Member

tgross commented Oct 3, 2022

@harningt in the meanwhile, see the workaround described here: hashicorp/consul-template#1651 (comment)

@tgross tgross added theme/dependencies Pull requests that update a dependency file and removed stage/waiting-on-upstream This issue is waiting on an upstream PR review labels Oct 4, 2022
@tgross
Copy link
Member

tgross commented Oct 4, 2022

Consul-Template 0.29.5 was just released with the fix. We're in a very short merge-freeze window for Nomad 1.4.0 GA but once that's shipped we'll be able to get our CT dependency updated and I'd expect the fix to go out in the next regular version of Nomad (which will probably be 1.4.1 1.4.2).

@tgross tgross added this to the 1.4.x milestone Oct 4, 2022
@harningt
Copy link
Author

Can confirm the workaround works :) Wish I had thought of that workaround when I was digging in the code.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/dependencies Pull requests that update a dependency file theme/template type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants