Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job with Vault template re-rendered unexpectedly just after task started #15307

Closed
maxramqvist opened this issue Nov 18, 2022 · 9 comments
Closed

Comments

@maxramqvist
Copy link

maxramqvist commented Nov 18, 2022

Nomad version

  • Nomad 1.3.5 + Vault 1.10.1
  • Nomad 1.4.2 + Vault 1.12.1
  • Nomad 1.4.3 + Vault 1.12.1

Operating system and Environment details

Ubuntu 20.04, x86_64

Issue

Nomad jobs with Vault templates with using the database secrets engine unexpectedly gets re-templated one time ~15-45 seconds after task is started.

It happens both with the Mongo and Postgres integrations Vault database integrations.

At first I thought #15057 might be related but... It's not really the same is it. Although credentials twice from Vault sounds like it could be similar?

The issue happens both with and without Connect for the job.
This is reproducible every time in our environment, for different images and different Nomad job configurations.
Connect / no connect. Different type of health-checks. I've basically tried to change anything I can think of that I could imagine could affect the code path setting up the templates with no luck. We still get the extra templating.

This could of course be a Vault bug. If you think that's the case, I'm happy to open an issue there.

Reproduction steps

Post the attached job to Nomad. Wait for deployment to finish. A couple of seconds later a re-render of the template will happen.

Expected Result

No extra templating after deployment.

Actual Result

An extra templating after deployment. After the templating there is new credentials in the environment variables.

Nov 18, '22 09:55:47 +0100 | Started | Task started by client
Nov 18, '22 09:55:45 +0100 | Restarting | Task restarting in 0s
Nov 18, '22 09:55:44 +0100 | Terminated | Exit Code: 0
Nov 18, '22 09:55:44 +0100 | Restart Signaled | Template with change_mode restart re-rendered    <---- This is unexpected.
Nov 18, '22 09:55:09 +0100 | Started | Task started by client
Nov 18, '22 09:55:05 +0100 | Task Setup | Building Task Directory
Nov 18, '22 09:55:05 +0100 | Received | Task received by client

Job file (if appropriate)

{
  "ID": "cats",
  "Name": "cats",
  "Type": "service",
  "Datacenters": [
    "dc1"
  ],
  "Update": {
    "Stagger": 10000000000,
    "MaxParallel": 2,
    "HealthCheck": "checks",
    "MinHealthyTime": 25000000000,
    "HealthyDeadline": 120000000000,
    "AutoRevert": true,
    "AutoPromote": true,
    "Canary": 2
  },
  "TaskGroups": [
    {
      "Name": "cats",
      "Count": 2,
      "Tasks": [
        {
          "Name": "cats",
          "Vault": {
            "Policies": [
              "crm-postgres-crm_app-role"
            ]
          },
          "Driver": "docker",
          "Config": {
            "network_mode": "fan-250",
            "image": "imageregistry/cats:5ded6dc",
            "volumes": [
              "/usr/local/share/ca-certificates/:/certs/"
            ]
          },
          "Services": [
            {
              "Name": "cats",
              "Port": 80,
              "PortLabel": "80",
              "AddressMode": "driver",
              "Checks": [
                {
                  "Type": "http",
                  "Protocol": "http",
                  "Port": 80,
                  "Path": "/health",
                  "AddressMode": "driver",
                  "Interval": 10000000000,
                  "Timeout": 2000000000,
                  "CheckRestart": {
                    "Limit": 2,
                    "Grace": 30000000000,
                    "IgnoreWarnings": true
                  }
                }
              ]
            }
          ],
          "Env": {
            "SERVICE_PORT": "80"
          },
          "Templates": [
            {
              "DestPath": "secrets/crm-postgres.env",
              "EmbeddedTmpl": "{{ with secret \"database/creds/crm-postgres-crm_app-role\" }}\n  CRM_POSTGRES_CRM_CRM_APP=\"postgres://{{ .Data.username }}:{{ .Data.password }}@crm-postgres-dev.postgres.database.azure.com:5432/crm?sslmode=require\"\n  CRM_POSTGRES_CRM_CRM_APP_USERNAME=\"{{ .Data.username }}\"\n  CRM_POSTGRES_CRM_CRM_APP_PASSWORD=\"{{ .Data.password }}\"\n  CRM_POSTGRES_CRM_CRM_APP_HOST=\"database.com\"\n  CRM_POSTGRES_CRM_CRM_APP_PORT=\"5432\"\n  CRM_POSTGRES_CRM_CRM_APP_DATABASE=\"crm\"\n  CRM_POSTGRES_CRM_CRM_APP_SSL_MODE=\"require\"\n{{ end }}",
              "Envvars": true
            }
          ],
          "Resources": {
            "CPU": 100,
            "MemoryMB": 32
          },
          "RestartPolicy": {
            "Attempts": 10,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          }
        }
      ]
    }
  ]
}

Logs

I've seen nothing in the logs indicating a reason for the extra templating.

@tgross
Copy link
Member

tgross commented Nov 23, 2022

Hi @maxramqvist! I see you've said in #15057

I've verified that there is an actual change in the templated credentials they get rotated, but I haven't looked in to TTL's for the credentials or the Vault token.

So this looks pretty strongly to me like a duplicate of #15057, unless you can reproduce with template.env = false. I'm going to close this as a dupe, but if you can provide trace-level client logs in that issue, that might give us a clue as to why consul-template is getting 2 commands to get the secrets.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2022
Nomad - Community Issues Triage automation moved this from Needs Triage to Done Nov 23, 2022
@maxramqvist
Copy link
Author

maxramqvist commented Nov 24, 2022

Thanks for the feedback @tgross!

I just set this config in the Nomad job and had the same result. A extra unexpected template just after deployment is done. So this should probably be opened again, right?

          "Templates": [
            {
              "DestPath": "/local/crm-postgres.vars",
              "EmbeddedTmpl": "{{ with secret \"database/creds/crm-postgres-crm_app-role\" }}\n  CRM_POSTGRES_CRM_CRM_APP=\"postgres://{{ .Data.username }}:{{ .Data.password }}@crm-postgres-dev.postgres.database.azure.com:5432/crm?sslmode=require\"\n  CRM_POSTGRES_CRM_CRM_APP_USERNAME=\"{{ .Data.username }}\"\n  CRM_POSTGRES_CRM_CRM_APP_PASSWORD=\"{{ .Data.password }}\"\n  CRM_POSTGRES_CRM_CRM_APP_HOST=\"database.com\"\n  CRM_POSTGRES_CRM_CRM_APP_PORT=\"5432\"\n  CRM_POSTGRES_CRM_CRM_APP_DATABASE=\"crm\"\n  CRM_POSTGRES_CRM_CRM_APP_SSL_MODE=\"require\"\n{{ end }}",
              "Envvars": false
            }
          ],

Update:
Nomad trace logs and full job definition sent to nomad-oss-debug@hashicorp.com.

@tgross
Copy link
Member

tgross commented Nov 28, 2022

Thanks @maxramqvist! Reopening. I'll circle back here once I get a chance to dig into those logs.

@tgross tgross reopened this Nov 28, 2022
Nomad - Community Issues Triage automation moved this from Done to Needs Triage Nov 28, 2022
@tgross tgross self-assigned this Nov 28, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Nov 28, 2022
@tgross
Copy link
Member

tgross commented Nov 28, 2022

Ok @maxramqvist, I took a look at those logs and I've extraced the relevant bits for allocation ID 1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0:

At 09:21:24.294Z we see the allocation marked as healthy, and that state is broadcast inside the client to make sure that all the allocation runner components know it:

2022-11-24T09:21:24.294Z [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: health set: alloc_id=1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 healthy=true
2022-11-24T09:21:24.305Z [TRACE] client.alloc_runner: sending updated alloc: alloc_id=1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 client_status=running desired_status=""

Several heartbeat intervals pass, and during this time we get updates that include updating the state of other allocations:

2022-11-24T09:21:24.427Z [TRACE] client: next heartbeat: period=18.077635942s
2022-11-24T09:21:24.507Z [TRACE] client: next heartbeat: period=17.666527231s
2022-11-24T09:21:24.510Z [DEBUG] client: updated allocations: index=15302579 total=135 pulled=1 filtered=134
2022-11-24T09:21:24.510Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=134
2022-11-24T09:21:24.513Z [TRACE] client: next heartbeat: period=16.731584611s
2022-11-24T09:21:24.523Z [TRACE] client.alloc_runner: AllocRunner has terminated, skipping alloc update: alloc_id=2b904124-c608-c711-08b5-8174180bfeb1 modify_index=15299005
2022-11-24T09:21:24.523Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=134 errors=0

Then we get a new update from the server that tells us that allocation ID 1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 needs to be updated, which triggers the update hooks:

2022-11-24T09:21:25.136Z [DEBUG] client: updated allocations: index=15302581 total=135 pulled=2 filtered=133
2022-11-24T09:21:25.136Z [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=133
2022-11-24T09:21:25.141Z [TRACE] client.alloc_runner: running update hooks: alloc_id=1ad60fca-8c6c-d5ea-3aef-30c8c8f233d0 start="2022-11-24 09:21:25.141341269 +0000 UTC m=+4074.513664349"

Update hooks are mostly idempotent but they are not in the case of Vault dynamic secrets. So the underlying problem isn't that Vault is getting re-rendered, but that the server is telling the client to update these allocations!

One thing that jumps out at me is that desired_status="" looks weird. I wonder if there's a component that's wiping out the desired state (maybe the max_client_disconnect handling) then when that's getting compared against the desired state from the server it's seeing it as an update. Otherwise I don't see anything unusual in your job spec that would cause the server to need to update the clients. Let me see if I can reproduce this behavior with a similar job (ideally cutting out Vault as a contributor factor to reduce the complexity here). I'll report back once I've done so.

@MikeN123
Copy link
Contributor

MikeN123 commented Nov 30, 2022

I'm still investigating, but seeing same behaviour with Nomad vars and a block like this:

      template {
       data = <<EOF
{{- range nomadVarList }}
  {{- with nomadVar .Path }}
    {{- range .Tuples }}
{{ .K }}={{ .V | toJSON }}
    {{- end }}
  {{- end }}
{{- end }}
       EOF
       destination = "secrets/nomad.env"
       env = true
      }

[edit]
May be unrelated - it seems that after a minute or so it does not find any vars in Nomad, and then suddenly finds them again (Missing: nomad.var.list(nomad/jobs/xxx@default.global))

@ahjohannessen
Copy link

@MikeN123 I have the same issue, see #15433 - I have also experienced similar with vault usage in templates and token not being renewed / permission denied, switched to Nomad Variables in hope to get something that was more stable.

@tgross
Copy link
Member

tgross commented Nov 30, 2022

Hi @MikeN123 and @ahjohannessen, the issue in #15433 is definitely unrelated to what we're seeing the logs on this issue, which is that the template has rendered just fine but the server is updating the alloc out of the blue. I'd definitely encourage you to take those reports over to that issue though so we can keep debugging there.

@maxramqvist
Copy link
Author

Yesterday night we restarted a whole datacenter (DR testing). Consul, Vault, Nomad servers and clients and so on. The issue hasn't appeared since.... But it was consistently happening during every deployment before that.
It wasn't a rolling restart, everything was down at the same time.
Both Nomad and Vault (and Consul) had been upgraded several times before, but the none of the clusters had been taken offline before this.

@tgross
Copy link
Member

tgross commented May 18, 2023

I've revisited this and found that we've independently fixed the problem at the client updates in #15915, which shipped in Nomad 1.5.0 with backports to Nomad 1.4.5 and 1.3.10. Should be safe to close this now.

@tgross tgross closed this as completed May 18, 2023
Nomad - Community Issues Triage automation moved this from Triaging to Done May 18, 2023
@tgross tgross added this to the 1.5.0 milestone May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants