Job with Vault template re-rendered unexpectedly just after task started #15307

maxramqvist · 2022-11-18T09:08:53Z

Nomad version

Nomad 1.3.5 + Vault 1.10.1
Nomad 1.4.2 + Vault 1.12.1
Nomad 1.4.3 + Vault 1.12.1

Operating system and Environment details

Ubuntu 20.04, x86_64

Issue

Nomad jobs with Vault templates with using the database secrets engine unexpectedly gets re-templated one time ~15-45 seconds after task is started.

It happens both with the Mongo and Postgres integrations Vault database integrations.

At first I thought #15057 might be related but... It's not really the same is it. Although credentials twice from Vault sounds like it could be similar?

The issue happens both with and without Connect for the job.
This is reproducible every time in our environment, for different images and different Nomad job configurations.
Connect / no connect. Different type of health-checks. I've basically tried to change anything I can think of that I could imagine could affect the code path setting up the templates with no luck. We still get the extra templating.

This could of course be a Vault bug. If you think that's the case, I'm happy to open an issue there.

Reproduction steps

Post the attached job to Nomad. Wait for deployment to finish. A couple of seconds later a re-render of the template will happen.

Expected Result

No extra templating after deployment.

Actual Result

An extra templating after deployment. After the templating there is new credentials in the environment variables.

Nov 18, '22 09:55:47 +0100 | Started | Task started by client
Nov 18, '22 09:55:45 +0100 | Restarting | Task restarting in 0s
Nov 18, '22 09:55:44 +0100 | Terminated | Exit Code: 0
Nov 18, '22 09:55:44 +0100 | Restart Signaled | Template with change_mode restart re-rendered    <---- This is unexpected.
Nov 18, '22 09:55:09 +0100 | Started | Task started by client
Nov 18, '22 09:55:05 +0100 | Task Setup | Building Task Directory
Nov 18, '22 09:55:05 +0100 | Received | Task received by client

Job file (if appropriate)

{
  "ID": "cats",
  "Name": "cats",
  "Type": "service",
  "Datacenters": [
    "dc1"
  ],
  "Update": {
    "Stagger": 10000000000,
    "MaxParallel": 2,
    "HealthCheck": "checks",
    "MinHealthyTime": 25000000000,
    "HealthyDeadline": 120000000000,
    "AutoRevert": true,
    "AutoPromote": true,
    "Canary": 2
  },
  "TaskGroups": [
    {
      "Name": "cats",
      "Count": 2,
      "Tasks": [
        {
          "Name": "cats",
          "Vault": {
            "Policies": [
              "crm-postgres-crm_app-role"
            ]
          },
          "Driver": "docker",
          "Config": {
            "network_mode": "fan-250",
            "image": "imageregistry/cats:5ded6dc",
            "volumes": [
              "/usr/local/share/ca-certificates/:/certs/"
            ]
          },
          "Services": [
            {
              "Name": "cats",
              "Port": 80,
              "PortLabel": "80",
              "AddressMode": "driver",
              "Checks": [
                {
                  "Type": "http",
                  "Protocol": "http",
                  "Port": 80,
                  "Path": "/health",
                  "AddressMode": "driver",
                  "Interval": 10000000000,
                  "Timeout": 2000000000,
                  "CheckRestart": {
                    "Limit": 2,
                    "Grace": 30000000000,
                    "IgnoreWarnings": true
                  }
                }
              ]
            }
          ],
          "Env": {
            "SERVICE_PORT": "80"
          },
          "Templates": [
            {
              "DestPath": "secrets/crm-postgres.env",
              "EmbeddedTmpl": "{{ with secret \"database/creds/crm-postgres-crm_app-role\" }}\n  CRM_POSTGRES_CRM_CRM_APP=\"postgres://{{ .Data.username }}:{{ .Data.password }}@crm-postgres-dev.postgres.database.azure.com:5432/crm?sslmode=require\"\n  CRM_POSTGRES_CRM_CRM_APP_USERNAME=\"{{ .Data.username }}\"\n  CRM_POSTGRES_CRM_CRM_APP_PASSWORD=\"{{ .Data.password }}\"\n  CRM_POSTGRES_CRM_CRM_APP_HOST=\"database.com\"\n  CRM_POSTGRES_CRM_CRM_APP_PORT=\"5432\"\n  CRM_POSTGRES_CRM_CRM_APP_DATABASE=\"crm\"\n  CRM_POSTGRES_CRM_CRM_APP_SSL_MODE=\"require\"\n{{ end }}",
              "Envvars": true
            }
          ],
          "Resources": {
            "CPU": 100,
            "MemoryMB": 32
          },
          "RestartPolicy": {
            "Attempts": 10,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          }
        }
      ]
    }
  ]
}

Logs

I've seen nothing in the logs indicating a reason for the extra templating.

The text was updated successfully, but these errors were encountered:

tgross · 2022-11-23T20:05:50Z

Hi @maxramqvist! I see you've said in #15057

I've verified that there is an actual change in the templated credentials they get rotated, but I haven't looked in to TTL's for the credentials or the Vault token.

So this looks pretty strongly to me like a duplicate of #15057, unless you can reproduce with template.env = false. I'm going to close this as a dupe, but if you can provide trace-level client logs in that issue, that might give us a clue as to why consul-template is getting 2 commands to get the secrets.

maxramqvist · 2022-11-24T07:55:32Z

Thanks for the feedback @tgross!

I just set this config in the Nomad job and had the same result. A extra unexpected template just after deployment is done. So this should probably be opened again, right?

          "Templates": [
            {
              "DestPath": "/local/crm-postgres.vars",
              "EmbeddedTmpl": "{{ with secret \"database/creds/crm-postgres-crm_app-role\" }}\n  CRM_POSTGRES_CRM_CRM_APP=\"postgres://{{ .Data.username }}:{{ .Data.password }}@crm-postgres-dev.postgres.database.azure.com:5432/crm?sslmode=require\"\n  CRM_POSTGRES_CRM_CRM_APP_USERNAME=\"{{ .Data.username }}\"\n  CRM_POSTGRES_CRM_CRM_APP_PASSWORD=\"{{ .Data.password }}\"\n  CRM_POSTGRES_CRM_CRM_APP_HOST=\"database.com\"\n  CRM_POSTGRES_CRM_CRM_APP_PORT=\"5432\"\n  CRM_POSTGRES_CRM_CRM_APP_DATABASE=\"crm\"\n  CRM_POSTGRES_CRM_CRM_APP_SSL_MODE=\"require\"\n{{ end }}",
              "Envvars": false
            }
          ],

Update:
Nomad trace logs and full job definition sent to nomad-oss-debug@hashicorp.com.

tgross · 2022-11-28T14:11:59Z

Thanks @maxramqvist! Reopening. I'll circle back here once I get a chance to dig into those logs.

tgross · 2022-11-28T15:18:17Z

Ok @maxramqvist, I took a look at those logs and I've extraced the relevant bits for allocation ID 1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0:

At 09:21:24.294Z we see the allocation marked as healthy, and that state is broadcast inside the client to make sure that all the allocation runner components know it:

2022-11-24T09:21:24.294Z [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: health set: alloc_id=1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 healthy=true
2022-11-24T09:21:24.305Z [TRACE] client.alloc_runner: sending updated alloc: alloc_id=1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 client_status=running desired_status=""

Several heartbeat intervals pass, and during this time we get updates that include updating the state of other allocations:

2022-11-24T09:21:24.427Z [TRACE] client: next heartbeat: period=18.077635942s
2022-11-24T09:21:24.507Z [TRACE] client: next heartbeat: period=17.666527231s
2022-11-24T09:21:24.510Z [DEBUG] client: updated allocations: index=15302579 total=135 pulled=1 filtered=134
2022-11-24T09:21:24.510Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=134
2022-11-24T09:21:24.513Z [TRACE] client: next heartbeat: period=16.731584611s
2022-11-24T09:21:24.523Z [TRACE] client.alloc_runner: AllocRunner has terminated, skipping alloc update: alloc_id=2b904124-c608-c711-08b5-8174180bfeb1 modify_index=15299005
2022-11-24T09:21:24.523Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=134 errors=0

Then we get a new update from the server that tells us that allocation ID 1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 needs to be updated, which triggers the update hooks:

2022-11-24T09:21:25.136Z [DEBUG] client: updated allocations: index=15302581 total=135 pulled=2 filtered=133
2022-11-24T09:21:25.136Z [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=133
2022-11-24T09:21:25.141Z [TRACE] client.alloc_runner: running update hooks: alloc_id=1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 start="2022-11-24 09:21:25.141341269 +0000 UTC m=+4074.513664349"

Update hooks are mostly idempotent but they are not in the case of Vault dynamic secrets. So the underlying problem isn't that Vault is getting re-rendered, but that the server is telling the client to update these allocations!

One thing that jumps out at me is that desired_status="" looks weird. I wonder if there's a component that's wiping out the desired state (maybe the max_client_disconnect handling) then when that's getting compared against the desired state from the server it's seeing it as an update. Otherwise I don't see anything unusual in your job spec that would cause the server to need to update the clients. Let me see if I can reproduce this behavior with a similar job (ideally cutting out Vault as a contributor factor to reduce the complexity here). I'll report back once I've done so.

MikeN123 · 2022-11-30T14:22:03Z

I'm still investigating, but seeing same behaviour with Nomad vars and a block like this:

      template {
       data = <<EOF
{{- range nomadVarList }}
  {{- with nomadVar .Path }}
    {{- range .Tuples }}
{{ .K }}={{ .V | toJSON }}
    {{- end }}
  {{- end }}
{{- end }}
       EOF
       destination = "secrets/nomad.env"
       env = true
      }

[edit]
May be unrelated - it seems that after a minute or so it does not find any vars in Nomad, and then suddenly finds them again (Missing: nomad.var.list(nomad/jobs/xxx@default.global))

ahjohannessen · 2022-11-30T20:48:12Z

@MikeN123 I have the same issue, see #15433 - I have also experienced similar with vault usage in templates and token not being renewed / permission denied, switched to Nomad Variables in hope to get something that was more stable.

tgross · 2022-11-30T20:54:06Z

Hi @MikeN123 and @ahjohannessen, the issue in #15433 is definitely unrelated to what we're seeing the logs on this issue, which is that the template has rendered just fine but the server is updating the alloc out of the blue. I'd definitely encourage you to take those reports over to that issue though so we can keep debugging there.

maxramqvist · 2023-02-14T09:18:57Z

Yesterday night we restarted a whole datacenter (DR testing). Consul, Vault, Nomad servers and clients and so on. The issue hasn't appeared since.... But it was consistently happening during every deployment before that.
It wasn't a rolling restart, everything was down at the same time.
Both Nomad and Vault (and Consul) had been upgraded several times before, but the none of the clusters had been taken offline before this.

tgross · 2023-05-18T20:46:29Z

I've revisited this and found that we've independently fixed the problem at the client updates in #15915, which shipped in Nomad 1.5.0 with backports to Nomad 1.4.5 and 1.3.10. Should be safe to close this now.

maxramqvist added the type/bug label Nov 18, 2022

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 18, 2022

tgross added theme/vault theme/template labels Nov 21, 2022

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2022

Nomad - Community Issues Triage automation moved this from Needs Triage to Done Nov 23, 2022

tgross added the stage/duplicate label Nov 23, 2022

tgross reopened this Nov 28, 2022

Nomad - Community Issues Triage automation moved this from Done to Needs Triage Nov 28, 2022

tgross self-assigned this Nov 28, 2022

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Nov 28, 2022

tgross removed the stage/duplicate label Nov 28, 2022

tgross closed this as completed May 18, 2023

Nomad - Community Issues Triage automation moved this from Triaging to Done May 18, 2023

tgross added this to the 1.5.0 milestone May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job with Vault template re-rendered unexpectedly just after task started #15307

Job with Vault template re-rendered unexpectedly just after task started #15307

maxramqvist commented Nov 18, 2022 •

edited

Loading

tgross commented Nov 23, 2022

maxramqvist commented Nov 24, 2022 •

edited

Loading

tgross commented Nov 28, 2022

tgross commented Nov 28, 2022

MikeN123 commented Nov 30, 2022 •

edited

Loading

ahjohannessen commented Nov 30, 2022

tgross commented Nov 30, 2022

maxramqvist commented Feb 14, 2023

tgross commented May 18, 2023

Job with Vault template re-rendered unexpectedly just after task started #15307

Job with Vault template re-rendered unexpectedly just after task started #15307

Comments

maxramqvist commented Nov 18, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Logs

tgross commented Nov 23, 2022

maxramqvist commented Nov 24, 2022 • edited Loading

tgross commented Nov 28, 2022

tgross commented Nov 28, 2022

MikeN123 commented Nov 30, 2022 • edited Loading

ahjohannessen commented Nov 30, 2022

tgross commented Nov 30, 2022

maxramqvist commented Feb 14, 2023

tgross commented May 18, 2023

maxramqvist commented Nov 18, 2022 •

edited

Loading

maxramqvist commented Nov 24, 2022 •

edited

Loading

MikeN123 commented Nov 30, 2022 •

edited

Loading