Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change_mode = "script" sometime fails with Docker #15851

Closed
dani opened this issue Jan 23, 2023 · 21 comments · Fixed by #15915 or #23663
Closed

change_mode = "script" sometime fails with Docker #15851

dani opened this issue Jan 23, 2023 · 21 comments · Fixed by #15915 or #23663
Assignees
Labels
hcc/cst Admin - internal hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker theme/template type/bug
Milestone

Comments

@dani
Copy link

dani commented Jan 23, 2023

Nomad version

Nomad v1.4.3 (f464aca)

Operating system and Environment details

AlmaLinux 8.7
Nomad installed manually from the pre-built binary. Using Docker CE 20.10.22

Issue

For some of my tasks, I trigger a script when a template changes. Typicaly, I use this to trigger a custom reload action when a vault generated certificate gets renewed. Most of the time, this is working great. But sometimes, the execution fails with this error

Template failed to run script /local/bin/rotate-cert.sh with arguments [] because task driver doesn't support the exec operation

Reproduction steps

A job with a templated cert, a reload script and change_mode = "script", for example with :

      template {
        data        = <<-EOF
          {{ with pkiCert "pki/db/issue/mongo" "common_name=mongo.service.consul" "ttl=5m" }}
          {{ .Cert }}
          {{ .Key }}
          {{ end }}
        EOF
        destination = "secrets/mongo.bundle.pem"
        change_mode = "script"
        change_script {
          command = "/local/bin/rotate-cert.sh"
        }
      }

Using a short TTL like in this example makes the problem more visible, but it's still quite random

Expected Result

The /local/bin/rotate-cert.sh script is fired everytime the cert is renewed

Actual Result

Most of the time, it's working, but sometimes (couldn't find any common pattern), the script fails with

Template failed to run script /local/bin/rotate-cert.sh with arguments [] because task driver doesn't support the exec operation

As a workaround, I've added the fail_on_error stanza

        change_script {
          command = "/local/bin/rotate-cert.sh"
          # If cert rotation fails, kill the task and restart it
          fail_on_error = true
        }

So when this happens, the task is killed and restarted (so it gets it's new cert)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Nothing interesting, we can see the template being rendered, and the task being stopped (because of the fail_on_error)

janv. 23 10:14:54 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:54.138+0100 [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9b688daf-6269-eef6-1752-4f1db4413b3f/tracking-mongod/secrets/mongo.bundle.pem"
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.342+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=28a3bfc55b0d869f3abe3d2e67fa0ac2b26f1984133ee93da03d053fc110782d driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.378+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.378+0100 [INFO]  agent: (runner) received finish
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.577+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=639b3b2291c13487a9c6e09e0dad620403938c9f1201cf33cc1f463f851e82af driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.579+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=6103d61436006b17d96a1dc4a9f7f767e59eea9e6f782e0b8da5bbda2b344a81 driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.613+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.614+0100 [INFO]  agent: (runner) received finish
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.628+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.629+0100 [INFO]  agent: (runner) received finish
@dani dani added the type/bug label Jan 23, 2023
@lgfa29 lgfa29 self-assigned this Jan 23, 2023
@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 23, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Jan 23, 2023

Hi @dani 👋

Thanks for the bug report. Do you have any other template in this task that could cause the container to restart? Or maybe something in the rotate-cert.sh script could be causing something like that?

@lgfa29 lgfa29 moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 23, 2023
@dani
Copy link
Author

dani commented Jan 23, 2023

Nop, nothing specific in the script, which looks like

#!/bin/sh

mongosh --tlsAllowInvalidCertificates --eval <<_EOF
  db.adminCommand( { rotateCertificates: 1, message: "Rotating certificates" } );
_EOF

And nothing else which could be triggered at the same time. The error message is quite surprising. If the script was simply failing, it should be something else (don't remember the exact error message when the script fails, but it clearly state its exit code). In this case, it looks like the script isn't even fired ("because task driver doesn't support the exec operation").
The task restarting is because of the fail_on_error = true. The default behavior is for the change_script to fail (not being fired), and the task staying active, with a non working service as its certificate is now expired.

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 26, 2023

I was able to reproduce this with this job:

job "example" {
  datacenters = ["dc1"]

  group "client" {
    task "curl" {
      driver = "docker"

      config {
        image   = "curlimages/curl:7.87.0"
        command = "/bin/ash"
        args    = ["local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env ash

while true; do
{{range nomadService "server"}}
  curl http://{{.Address}}:{{.Port}}/
{{end}}
  sleep 1
done
EOF
        destination = "local/script.sh"
        change_mode = "script"
        change_script {
          command = "/bin/ash"
          args    = ["local/change.sh"]
        }
      }

      template {
        data        = <<EOF
#!/usr/bin/env ash

date
echo "change"
EOF
        destination = "local/change.sh"

      }

      resources {
        cpu    = 10
        memory = 50
      }
    }
  }

  group "server" {
    network {
      port "http" {}
    }

    service {
      name     = "server"
      provider = "nomad"
      port     = "http"
    }

    task "http" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "${NOMAD_PORT_http}", "-h", "/local"]
        ports   = ["http"]
      }

      template {
        data        = <<EOF
hello world
EOF
        destination = "local/index.html"
      }

      resources {
        cpu    = 10
        memory = 50
      }
    }
  }
}

Changing the server->http task and re-running the job will trigger the error.

The problem is that if the Nomad or Vault token changes the template hook will shutdown the existing template manager and create a new one.

The new template manager will not have access to the task driver handle that was set when in the Poststart hook since these are in-place updates.

#15915 restores the driver handler when the template manager is recreated.

@dani do you have more than one task in your group or are you using Vault by any chance?

@lgfa29 lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jan 26, 2023
@dani
Copy link
Author

dani commented Jan 27, 2023

@dani do you have more than one task in your group or are you using Vault by any chance?

Yes to both, there're 3 tasks + one prestart. And using vault (to get certificates)

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 27, 2023

Ah cool, so yeah I think the reproduction I had was the main cause behind the issue you're having. The next release of Nomad will have a fix for it.

@kkornienko-aparavi
Copy link

kkornienko-aparavi commented Jun 20, 2023

Still experiencing this error in Nomad 1.5.6.

change_script runs every day. It can run successfully several times, but sooner or later I'm getting:

Template failed to run script /local/mariadb_ssl_reload.sh with arguments [] because task driver doesn't support the exec operation

Any suggestions?

@lgfa29
Copy link
Contributor

lgfa29 commented Jun 22, 2023

Hi @kkornienko-aparavi 👋

Do you have any logs around when the problem happens? And would you be able to provide a sample of your job?

@kkornienko-aparavi
Copy link

kkornienko-aparavi commented Jun 26, 2023

Hello, @lgfa29 , thanks for reacting.
Looks like I found the root cause of the issue. It happens right after nomad client is restarted. New process tries to re-render the templates and fails.
Don't know is it at all should work or not )


Trying to post important parts of the job.

It's Grafana deployment job, which consists of 2 tasks.
One is MariaDB and second is Grafana App.
We use Vault with short-living certificates, we issue certificates and apply them to MariaDB.
MariaDB requires some special command to be invoked to reload the certificates, that's why we use change script.

job "grafana" {

  dynamic "group" {
    for_each = var.db_enabled ? [1] : []
    labels   = ["db"]
    content {
      count = 1

      network {
        port "mariadb" {
          to = 3306
        }
      }

      volume "nfsgrafana" {
        type            = "csi"
        source          = "nfsgrafana"
        access_mode     = "multi-node-multi-writer"
        attachment_mode = "file-system"
      }

      task "db" {
        driver = "docker"
        vault {
          env      = false
          policies = [...]
        }
        volume_mount {
          volume      = "nfsgrafana"
          destination = "/bitnami/mariadb"
        }
        config {
          image = "bitnami/mariadb"
          ports = ["mariadb"]
        }
        service {
          name = "grafana-db"
          port = "mariadb"
          check {
            name     = "tcp_validate"
            type     = "tcp"
            port     = "mariadb"
            interval = "15s"
            timeout  = "5s"
          }
        }
        env {
          MARIADB_DATABASE     = "grafana"
          MARIADB_SKIP_TEST_DB = "yes"
          MARIADB_EXTRA_FLAGS = join(" ", [
            "--require_secure_transport=ON",
            "--ssl_ca=/secrets/ca.crt",
            "--ssl_cert=/secrets/cert.crt",
            "--ssl_key=/secrets/cert.key",
            "--tls_version=TLSv1.3",
          ])
        }
        template {
          destination = "secrets/ca.crt"
          change_mode = "restart"
          data        = <<EOH
{{ with secret "pki/cert/ca" }}{{.Data.certificate}}{{ end }}
  EOH
        }
        template {
          destination = "secrets/cert.crt"
          change_mode = "noop"
          data        = <<EOH
{{- with $secret := secret "pki/issue/..." "common_name=..." "ttl=3d"}}
{{.Data.certificate}}
{{- end}}
  EOH
        }
        template {
          destination = "secrets/cert.key"
          change_mode = "script"
          change_script {
            command = "/local/mariadb_ssl_reload.sh"
          }
          data = <<EOH
{{- with $secret := secret "pki/issue/..." "common_name=..." "ttl=3d"}}
{{.Data.private_key}}
{{- end}}
  EOH
        }
        template {
          data        = file("grafana/mariadb_ssl_reload.sh")
          destination = "local/mariadb_ssl_reload.sh"
          perms       = "750"
        }
      }
    }
  }

  group "app" {
    count = 1

    network {
      port "grafana-http" {
        to = 3000
      }
    }

    task "app" {
      driver = "docker"
    }
    
    ...
    grafana consumes the DB, getting it's address via Consul.
    ...

  }
}

mariadb_ssl_reload.sh
#!/usr/bin/env bash mariadb --batch -u $MARIADB_ROOT_USER -p$MARIADB_ROOT_PASSWORD -e "flush ssl" --ssl && \ openssl x509 -in /secrets/cert.crt -dates -subject -issuer -noout > /local/ssl_reload_out.txt

Logs from Nomad client:
Nomad UI:
Task hook failed | Template failed to run script /local/mariadb_ssl_reload.sh with arguments [] because task driver doesn't support the exec operation

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 5, 2023

Thanks for the extra info @kkornienko-aparavi. Unfortunately I have not been able to reproduce this yet, even after restarting the Nomad agent.

It happens right after nomad client is restarted

Just to clarify this part, do you mean the Nomad client process is restarted or is it the entire machine, like a server reboot?

I'm reopening this issue just in case until we find a better answer.

@lgfa29 lgfa29 reopened this Jul 5, 2023
@lgfa29 lgfa29 moved this from Done to Needs Roadmapping in Nomad - Community Issues Triage Jul 5, 2023
@the-nando
Copy link
Contributor

the-nando commented Dec 11, 2023

hey @lgfa29 😄 I've run into the same issue on 1.6.3 with a script used to update the Java keystore on changes to a certificate rendered from Vault's PKI via Consul template:

template {
        destination = "secrets/kafka.pem"
        change_mode = "script"
        change_script {
          command       = "/local/hot-reload-keystore.sh"
          timeout       = "20s"
          fail_on_error = false
        }
        splay = "0ms"
        data  = <<-EOF
        {{- with secret "pki/issue/kafka-brokers" "common_name=xxx" "ttl=10m" "alt_names=*.yyyy" "private_key_format=pkcs8" -}}
        {{ .Data.private_key }}

        {{ .Data.certificate }}

        {{ range .Data.ca_chain }}
        {{ . }}

        {{ end }}
        {{- end -}}
        EOF
}

While investigating the issue I was able to reproduce it with a for loop updating a key in Consul K/V and restarting the Nomad client while the loop is running.

  1. Start the job
job "test" {
    region      = "dc"
    datacenters = ["dca"]

    constraint {
        attribute = node.unique.name
        value     = "xxx"
    }

    group "test" {
        count = 1

        task "test" {
            driver = "docker"
            config {
                image = "alpine:3.18.2"
                args  = ["tail", "-f", "/dev/null"]
            }

            template {
        data        = <<EOF
{{ range ls "services/my-ns/test" }}
{{.Key}}={{.Value}}
{{ end }}
EOF
        destination = "local/test.tmpl"
        change_mode = "script"
        change_script {
          command = "/local/change.sh"
        }
      }

      template {
        data        = <<EOF
#!/bin/sh
date >> local/run.log
EOF
        destination = "local/change.sh"
        perms       = "555"

      }
            resources {
                cores  = 4
                memory = 64
            }
        }
    }
}
  1. Start the loop:
while : ; do consul kv put services/my-ns/test/foo a=$RANDOM ; sleep 5 ; done 
  1. Restart the Nomad client:
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) receiving dependency kv.list(services/my-ns/test)
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) initiating run
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) checking template 06ec856a7a6a01f081141dc0b6392a33
2023-12-11T14:08:34.830Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/test.tmpl"
2023-12-11T14:08:34.833Z [INFO]  agent: (runner) rendered "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/test.tmpl"
2023-12-11T14:08:34.833Z [DEBUG] agent: (runner) checking template a55d0446878b1ccea6af24e1d973c69e
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/change.sh"
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) diffing and updating dependencies
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) kv.list(services/my-ns/test) is still needed
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) watching 1 dependencies
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) all templates rendered
2023-12-11T14:08:37.219Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=409a1017-ed4c-608f-c9a6-d24366f47dc4 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

Following runs succeed:

2023-12-11T14:08:43.021Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=409a1017-ed4c-608f-c9a6-d24366f47dc4 task=test type="Task hook message" msg="Template successfully ran script /local/change.sh with arguments: []. Exit code: 0" failed=false

Can you please check if you are able to reproduce it as well?

@ron-savoia
Copy link
Contributor

I've reproduced this in my lab as well, using the jobspecs provided by the-nando and lgfa29 above.
In each case the issue was noted only after a client restart.
The logs (nomad, journal & docker) and bundle attached were captured while testing with @the-nando's jobspec and steps.

Nomad version used

1.6.2+ent

Client restart timestamps / nomad log grep

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:26:07 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:26:34 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:20.033Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:27:20 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:20.033Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:45.017Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

client_docker.log.zip
journal_nomad_client.log.zip
journal_nomad_server.log.zip
nomad_client.log.zip
nomad_server.log.zip
nomad-debug-2023-12-12-152541Z.tar.gz

@davemay99 davemay99 added the hcc/cst Admin - internal label Dec 15, 2023
@aaronkvanmeerten
Copy link

Anecdotally (I don't have any of the above dumps currently), this behavior continues in 1.7.5, and it seems to continue to be related to a nomad client restart.

@tgross tgross self-assigned this Jul 18, 2024
@tgross tgross added this to the 1.8.x milestone Jul 18, 2024
@tgross
Copy link
Member

tgross commented Jul 18, 2024

Hi folks, I'm just getting started digging into this. At first glance it's pretty bizarre:

  • The error we're getting comes from the Poststart method of the template hook (ref template_hook.go#L203-L212) whenever the poststart request's DriverExec field is nil.
  • That request object is created in the taskrunner (ref task_runner_hooks.go#L350-L375) and we can see it's populated by a call to the function NewLazyHandle.
  • NewLazyHandle always returns an object, so there's no way for the field to be set to nil!

While I could easily imagine given the circumstances of the bug around client restarts that we've got some kind of race condition where the handle is not yet valid, that's not what the bug appears to be. My suspicion is that when the lazy handle's refreshHandleLocked method times out that we're setting the resulting nil object as the handle and effectively "cross contaminating" between hooks. This will take a good bit more debugging, but I'll update once I know more.

@tgross
Copy link
Member

tgross commented Jul 19, 2024

Oh, looks like I was looking at the wrong error message. That's embarassing. 😊 The "Template failed to run script %v with arguments %v because task driver doesn't support the exec operation" is from the task template manager, not the hook (ref template.go#L588). That's much less puzzling.

In that case, the error isn't happening in the post-stop hook but in the pre-start hook. But the handle we need to fire the change script isn't being set until the post-stop hook. So the template renders for the "first" time (according to the task runner lifecycle), immediately detects a change to the file, and that triggers the change script even though we don't have a handle to the task yet.

The very helpful client logs that @ron-savoia posted show the order of events pretty clearly for alloc 05607ec2 (with irrelevant bits elided):

2023-12-12T15:26:03.008Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test name=template start="2023-12-12 15:26:03.008869715 +0000 UTC m=+10.148452814"
2023-12-12T15:26:03.008Z [INFO] agent: (runner) creating new runner (dry: false, once: false)
...
(yadda yadda consul-template stuff)
...
2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/opt/nomad/data/alloc/05607ec2-43e9-283c-e652-7b5ed2c4c412/test/local/change.sh"
2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) diffing and updating dependencies
2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) watching 1 dependencies
...
2023-12-12T15:26:03.012Z [DEBUG] agent: (runner) all templates rendered
...
2023-12-12T15:26:04.381Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
...
2023-12-12T15:26:04.389Z [TRACE] client.alloc_runner.task_runner: finished prestart hooks: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test end="2023-12-12 15:26:04.389997341 +0000 UTC m=+11.529580481" duration=1.389301764s
...
2023-12-12T15:26:04.402Z [TRACE] client.alloc_runner.task_runner: running poststart hook: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test name=template start="2023-12-12 15:26:04.402213108 +0000 UTC m=+11.541796249"

We're never going to have a valid task handle at the point of the prestart, so we simply can't run the script at that point.

What I'll do next is write a unit test demonstrating the exact behavior we're seeing, and then see if I can figure out a reasonable way to "queue-up" the script execution so that if we restore a task handle we can run it afterwards. Without getting stuck or trying to execute if the task handle fails to be restored (ex. the task gets killed between the time we render and get to the poststop hook method), of course.

(Edit: I thought this sounded familiar, and as it turns out we didn't trigger change mode on templates during restore until I fixed that way back in #9636. So either this is a regression or a bug in that original implementation years ago.)

@tgross
Copy link
Member

tgross commented Jul 19, 2024

At this point reproduction is easy and deterministic, when running at least one Nomad server and separate Nomad client.

Create a variable:

$ nomad var put nomad/jobs/example name=Tim
Created variable "nomad/jobs/example" with modify index 428

Run the following job.

jobspec
job "example" {

  group "group" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data = <<EOT
<html>
<h1>hello, {{ with nomadVar "nomad/jobs/example" -}}{{.name}}{{- end -}}</h1>
</html>
        EOT

        destination = "local/index.html"

        change_mode = "script"
        change_script {
          command       = "/bin/sh"
          args          = ["-c", "echo ok"]
          fail_on_error = true
        }
      }

      resources {
        cpu    = 50
        memory = 50
      }

    }
  }
}

Stop the client.

$ sudo systemctl stop nomad

Update the variable.

$ nomad var put -force nomad/jobs/example name=Ron
Created variable "nomad/jobs/example" with modify index 440

Restart the client.

$ sudo systemctl start nomad

See the allocation fail:

$ nomad alloc status
...
Recent Events:
Time                       Type              Description
2024-07-19T16:15:46-04:00  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-07-19T16:15:41-04:00  Killing           Template script failed, task is being killed
2024-07-19T16:15:41-04:00  Task hook failed  Template failed to run script /bin/sh with arguments [-c echo ok] because task driver doesn't support the exec operation
2024-07-19T16:15:08-04:00  Started           Task started by client
2024-07-19T16:15:08-04:00  Task Setup        Building Task Directory
2024-07-19T16:15:08-04:00  Received          Task received by client

tgross added a commit that referenced this issue Jul 22, 2024
For templates with `change_mode = "script"`, we set a driver handle in the
poststart method, so the template runner can execute the script inside the
task. But when the client is restarted and the template contents change during
that window, we trigger a change_mode in the prestart method. In that case, the
hook will not have the handle and so returns an errror trying to run the change
mode.

We restore the driver handle before we call any prestart hooks, so we can pass
that handle in the constructor whenever it's available. In the normal task start
case the handle will be empty but also won't be called.

The error messages are also misleading, as there's no capabilities check
happening here. Update the error messages to match.

Fixes: #15851
Ref: https://hashicorp.atlassian.net/browse/NET-9338
tgross added a commit that referenced this issue Jul 22, 2024
For templates with `change_mode = "script"`, we set a driver handle in the
poststart method, so the template runner can execute the script inside the
task. But when the client is restarted and the template contents change during
that window, we trigger a change_mode in the prestart method. In that case, the
hook will not have the handle and so returns an errror trying to run the change
mode.

We restore the driver handle before we call any prestart hooks, so we can pass
that handle in the constructor whenever it's available. In the normal task start
case the handle will be empty but also won't be called.

The error messages are also misleading, as there's no capabilities check
happening here. Update the error messages to match.

Fixes: #15851
Ref: https://hashicorp.atlassian.net/browse/NET-9338
@tgross
Copy link
Member

tgross commented Jul 22, 2024

I've got a fix up here: #23663. The task driver handle was actually already available, but we simply we're providing it to the template hook in time for this use.

@aisvarya2
Copy link

aisvarya2 commented Jul 31, 2024

We are using Nomad version v1.8.2+ent. The template block looks as follows

template {
        destination = "secrets/secret.txt"
        change_mode = "script"
        change_script {
          command = "/bin/sh"
          args    = ["sleep", "180", "&&", "touch", "/local/test"]
          timeout = "90s"
        }
        data        = <<EOT
{{- with secret "dynamic/test-template" }}{{ .Data.data.foo }}{{- end }}
EOT
      }

However, when the secret in the Vault is updated the post hook fails with "Template ran script /bin/sh with arguments [sleep 5 && touch /local/test] on change but it exited with code code: 2".
There are no client restarts prior to this activity.

@tgross
Copy link
Member

tgross commented Jul 31, 2024

@aisvarya2 that's a different issue. In your case, the change script is running but you're getting an error. That's because you're passing invalid arguments to /bin/sh -- you can't pass what you'd type in the shell as arguments to /bin/sh. I believe you want something like:

change_script {
  command = "/bin/sh"
  args = ["-c", "'sleep 180 && touch /local/test'"]
  timeout = "90s"
}

Also, your timeout is less than the sleep window, but I suspect you've got some redaction going on here.

@aisvarya2
Copy link

@tgross we tried using the change_script as you mentioned with appropriate timeout but it does not work as expected when we pass it in the args.
It works as expected when we use the command to run the shell script directly but not when used in args

template {
        destination = "secrets/secret.txt"
        change_mode = "script"
        change_script {
          command = "local/change.sh"
          timeout = "90s"
        }
        data        = <<EOT
{{- with secret "dynamic/test-template" }}{{ .Data.data.foo }}{{- end }}
EOT
      }

template {
        data        = <<EOF
#!/bin/sh
sleep 60
touch /local/test
EOF
        destination = "local/change.sh"
        perms       = "555"

      }

@tgross
Copy link
Member

tgross commented Aug 1, 2024

@aisvarya2 you'll want to look at some trace-level logging to see what's going on with the script then. Getting exit code 2 means there's something wrong with the script itself, not the change_mode block. If you're unable to debug that, please open a new issue as what you're seeing is not at all related to this one.

@EugenKon
Copy link

EugenKon commented Aug 1, 2024

@aisvarya2 I suppose it could be something wrong with permissions: 555 is different from 0555. And use = <<-EOF, thus you can indent your code properly. You can try to run empty script to test change_mode first:

#!/bin/sh
echo "Ok"

Or

#!/bin/bash
echo "Ok"

Add more permissions to the script (do not forget to remove them after debugging): 0777;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker theme/template type/bug
Projects
10 participants