Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs with Nomad Variables are always broken by canary deployments #16259

Closed
aofei opened this issue Feb 25, 2023 · 3 comments · Fixed by #16266
Closed

Jobs with Nomad Variables are always broken by canary deployments #16259

aofei opened this issue Feb 25, 2023 · 3 comments · Fixed by #16266

Comments

@aofei
Copy link
Contributor

aofei commented Feb 25, 2023

Nomad version

Nomad v1.4.4 (7f29429)

Operating system and Environment details

Linux nomad-bug-16259 #36-Ubuntu SMP Mon Jan 23 21:04:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

Screen.Recording.2023-02-25.at.22.29.40.mp4

Reproduction steps

  1. Create file nomad-server.hcl:
data_dir = "/tmp/nomad-server"
name = "server"
server {
        enabled = true
        bootstrap_expect = 1
}
acl {
        enabled = true
}
  1. Create file nomad-client.hcl:
data_dir = "/tmp/nomad-client"
name = "client"
ports {
        http = "14646"
}
client {
        enabled = true
        servers = ["127.0.0.1"]
}
acl {
        enabled = true
}
  1. Create file foobar.nomad:
job "foobar" {
        datacenters = ["dc1"]
        group "foobar" {
                update {
                        canary = 1
                }
                task "foobar" {
                        driver = "exec"
                        config {
                                command = "tail"
                                args = ["-f", "/dev/null"]
                        }
                        template {
                                destination = "local/foobar"
                                data = <<EOF
job_version=0
{{with $jobVar := nomadVar "nomad/jobs/foobar"}}
foo={{$jobVar.foo}}
{{end}}
EOF
                        }
                }
        }
}
  1. Open terminal tab 1 and execute:
$ sudo nomad agent -config nomad-server.hcl
  1. Open terminal tab 2 and execute:
$ sudo nomad agent -config nomad-client.hcl
  1. Open terminal tab 3 and execute:
$ nomad acl bootstrap
$ export NOMAD_TOKEN=<The Bootstrap Token>
$ nomad var put nomad/jobs/foobar foo=bar
$ nomad run foobar.nomad
$ # Wait until the job deploys successfully.
$ sed -i 's/job_version=0/job_version=1/g' foobar.nomad
$ nomad run foobar.nomad
$ # Wait until the job deploys successfully.
  1. Open terminal tab 4 and execute:
$ export NOMAD_TOKEN=<The Bootstrap Token>
$ nomad job promote -group foobar foobar
  1. Now go to check terminal tab 2.

Other things to say

Key requirements for this bug to reproduce:

  1. ACL must be enabled.
  2. Job's update.canary must be greater than zero.
@aofei aofei added the type/bug label Feb 25, 2023
@tgross
Copy link
Member

tgross commented Feb 27, 2023

Hi @aofei! Thanks for opening this issue.

I was able to confirm this with the current HEAD of main and sudo nomad agent -dev -acl-enabled. With the new identity block feature we're shipping in Nomad 1.5.0, I was also able to confirm that the problem isn't that the workload identity is somehow being invalidated (which I didn't think was possible but wanted to make sure):

$ nomad alloc exec c96 env | grep NOMAD_TOKEN
NOMAD_TOKEN=<redacted>

$ NOMAD_TOKEN=<redacted> nomad var get nomad/jobs/foobar
Namespace   = default
Path        = nomad/jobs/foobar
Create Time = 2023-02-27T09:41:49-05:00
Check Index = 21

Items
foo = bar

I've double-checked this by running a job that instead of tail -f runs the following bash script:

#!/usr/bin/env bash
set -e

while :
do
    curl -v --fail \
         -H "X-Nomad-Token: ${NOMAD_TOKEN}" \
         --unix-socket "${NOMAD_SECRETS_DIR}/api.sock" \
         "http://localhost/v1/var/nomad/jobs/foobar" | jq '.Items.foo'
    sleep 2
done

If I tail the logs of the new allocation, it's still getting the variable even after promotion. That strongly suggests the problem is happening somewhere in the Nomad client, maybe in the template hook. I'm going to chat with some of my colleagues about where to look next and one of us will report back here.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 27, 2023
@tgross tgross self-assigned this Feb 27, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Feb 27, 2023
@tgross tgross added this to the 1.5.0 milestone Feb 27, 2023
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Feb 27, 2023
@tgross
Copy link
Member

tgross commented Feb 27, 2023

Fixed in #16266, which will ship in the upcoming Nomad 1.5.0-rc1 and will get backported to Nomad 1.4.x once 1.5 is GA.

@aofei
Copy link
Contributor Author

aofei commented Feb 27, 2023

Hi @tgross! Thanks for the great analysis! Glad it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants