Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.4.2: Jobs killed due to nomadService templating failures #15115

Closed
iSchluff opened this issue Nov 2, 2022 · 16 comments
Closed

1.4.2: Jobs killed due to nomadService templating failures #15115

iSchluff opened this issue Nov 2, 2022 · 16 comments

Comments

@iSchluff
Copy link

iSchluff commented Nov 2, 2022

Nomad version

Output from nomad version
Nomad v1.4.2 (039d70e)

Operating system and Environment details

Linux nomad1 5.4.0-131-generic 147-Ubuntu

Issue

I have a set up where a job with multiple groups has various template cross-references using nomadService to dynamically fill in the address/port details.

With 1.4.x I see jobs getting killed because of template failures like these, even though the referenced service is running fine:
alertmanager service

Nov 02, '22 20:56:09 +0100 | Killed | Task successfully killed
Nov 02, '22 20:56:09 +0100 | Terminated | Exit Code: 0
Nov 02, '22 20:56:09 +0100 | Killing | Template failed: nomad.service(promteams): Unexpected response code: 403 (Permission denied)
Nov 02, '22 20:51:08 +0100 | Template | Missing: nomad.service(promteams)
Nov 02, '22 20:13:35 +0100 | Started | Task started by client
Nov 02, '22 20:13:35 +0100 | Task Setup | Building Task Directory
Nov 02, '22 20:13:35 +0100 | Received | Task received by client

promteams service

Nov 02, '22 17:39:44 +0100 | Started | Task started by client1
Nov 02, '22 17:39:43 +0100 | Driver | Downloading image
Nov 02, '22 17:39:43 +0100 | Task Setup | Building Task Directory
Nov 02, '22 17:39:42 +0100 | Received | Task received by client

Additionally sometimes a template will not update even though the nomad service changed.

Reproduction steps

Expected Result

Tasks should not be randomly killed, and templates should be properly updated when the service changes.

Actual Result

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 2, 2022

Thanks for the report @iSchluff.

I haven't been able to reproduce it yet. Would you happen to have any client and server logs around the time this happened?

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 3, 2022

@iSchluff the team found the root cause for the 403 errors and #15121 has a fix for that.

But now I'm curious about the job failing due to the template render error. By default, Nomad should retry rendering forever, which is configured by the client.template.nomad config. Did you happen to change this value?

@iSchluff
Copy link
Author

iSchluff commented Nov 3, 2022

Hi @lgfa29, no the client config is completely default

client {
  enabled = true
  servers = ["{{ nomad_address }}"]
  network_interface = "{{ nomad_default_interface }}"
}

The only thing that might be special is that it's a single server+client instance

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 4, 2022

Ah interesting. Do you have any logs form the client around the time of failure?

@iSchluff
Copy link
Author

iSchluff commented Nov 5, 2022

I think the relevant portion is this, which was logged when the allocation was terminated

This is a single instance with server+client, so the logs should all be interleaved

    2022-11-02T19:56:09.063Z [ERROR] agent: (view) nomad.service(alertmanager): Unexpected response code: 403 (Permission denied) (exceeded maximum retries)
    2022-11-02T19:56:09.127Z [ERROR] agent: (view) nomad.service(promteams): Unexpected response code: 403 (Permission denied) (exceeded maximum retries)
    2022-11-02T19:56:09.127Z [ERROR] agent: (runner) watcher reported error: nomad.service(promteams): Unexpected response code: 403 (Permission denied)
    2022-11-02T19:56:09.217Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f89a54d96766666318ef64c1c6b70deab979dcd313a1924d12f32b20079e2a8e driver=docker
    2022-11-02T19:56:09.257Z [INFO]  agent: (runner) stopping
    2022-11-02T19:56:09.257Z [INFO]  client.gc: marking allocation for GC: alloc_id=7f92cda3-2d4d-d0a9-139a-6ba95f042d7d
    2022-11-02T19:56:10.032Z [WARN]  nomad.rpc: failed TLS handshake: remote_addr=172.20.xx.xx:52738 error="remote error: tls: bad certificate"
    2022-11-02T19:56:10.034Z [INFO]  client.driver_mgr.docker: stopped container: container_id=2e88dc4bfe6d7e55ea494dda5f5fc77a340ee64a93302d0c3fc362c583b9333e driver=docker
    2022-11-02T19:56:10.061Z [INFO]  agent: (runner) stopping
    2022-11-02T19:56:10.061Z [INFO]  client.gc: marking allocation for GC: alloc_id=a2c806ea-69c4-4e01-4001-7a30436d9499
    2022-11-02T19:56:10.514Z [INFO]  client.driver_mgr.docker: stopped container: container_id=c7ff62929c182ade793053234a7f53bfcf539828edc575f30967fb9debd956a3 driver=docker
    2022-11-02T19:56:10.540Z [INFO]  agent: (runner) stopping
    2022-11-02T19:56:10.540Z [INFO]  client.gc: marking allocation for GC: alloc_id=7be59997-98e6-5e2a-fc65-b5a6f988780a

full log: https://p.rrbone.net/?be507af6af7b677d#5nZiq24EAXZeUAMPrWcKz2wvoD2Jr6k6JRqgAGAZXx6k

I have since started over on this instance with a new data directory as it was basically unusable in that state. With a 1.4.2 instance it pretty much works as expected, so it is clearly related to the update procedure.
If the old data directory would help you I can upload that somewhere

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 7, 2022

Thanks for the logs, I will dig deeper throughout the week.

The problem with the upgrade was fixed in #15121 and will be released when 1.4.3 comes out. I'm now trying to understand why the template caused the task to fail. I expected it to retry forever.

@iSchluff
Copy link
Author

iSchluff commented Nov 8, 2022

Maybe related: I am also having the problem that dependent templates are not updated after nomad services successfully restart, leaving me with broken configs.

On service stop the template is correctly updated, but not after it comes back.

Also for whatever reason the permission errors are back, even though this was fresh set up

Nov 08, '22 21:30:17 +0100 | Killing | Sent interrupt. Waiting 5s before force killing
Nov 08, '22 21:30:17 +0100 | Killed | Task successfully killed
Nov 08, '22 21:30:17 +0100 | Terminated | Exit Code: 0
Nov 08, '22 21:30:16 +0100 | Killing | Template failed: nomad.service(grafana): Unexpected response code: 403 (Permission denied)
Nov 08, '22 21:25:16 +0100 | Template | Missing: nomad.service(grafana), nomad.service(loki), nomad.service(telegraf), and 1 more
Nov 05, '22 18:52:21 +0100 | Started | Task started by client
Nov 05, '22 18:52:21 +0100 | Task Setup | Building Task Directory
Nov 05, '22 18:52:21 +0100 | Received | Task received by client

@veloc1
Copy link

veloc1 commented Nov 23, 2022

+1

Got template failed on every job update on 1.4.3. Sample nomad job spec:

spec

job "nomad-template-error" {
  type = "service"

  region        = "global"
  datacenters   = ["service_ds"]

  update {
    max_parallel = 1
    min_healthy_time = "15s"
    healthy_deadline = "3m"
    auto_revert = false
    canary = 0
  }

  group "postgres" {
    count = 1

    network {
      port "db" { to = 5432}
    }

    task "postgres" {
      driver = "podman"

      leader = true

      config {
        image = "docker.io/library/postgres:14.5"
        ports = ["db"]

      }

      service {
        name         = "postgres"
        provider     = "nomad"
        port         = "db"
        address_mode = "host"

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      env {
        POSTGRES_PASSWORD = "123"
        POSTGRES_USER = "user"
        POSTGRES_DB = "db"
        PGDATA = "/var/lib/postgresql/data/pgdata"
      }

      resources {
        cpu    = 500 # MHz
        memory = 512 # MB
      }
    }

    task "migrator" {
      driver = "podman"

      config {
        image = "docker://hello-world"
      }

      lifecycle {
        hook = "poststart"
        sidecar = false
      }

      restart {
        interval = "3m"
        attempts = 3
        delay    = "15s"
        mode     = "fail"
      }

      template {
        destination=".env"
        env = true

        data = <<EOH
        {{ range nomadService "postgres" }}
        DATABASE_URL="postgresql://user:123@{{ .Address }}:{{ .Port }}/db?schema=public"
        {{ end }}
        EOH
      }
    }
  }

  group "additional-group" {
    task "some-task" {
      driver = "podman"

      config {
        image = "docker://redis"
      }

      env {
        A="b"
      }
    }
  }
}

nomad alloc status after updating env variable in second group

ID                   = 37fa4b07-3239-88dd-89d8-1da29a4352c3
Eval ID              = f47974ab
Name                 = nomad-template-error.postgres[0]
Node ID              = 9161c33b
Node Name            = production
Job ID               = nomad-template-error
Job Version          = 3
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 9m38s ago
Modified             = 56s ago
Deployment ID        = c660b5bd
Deployment Health    = unhealthy
Replacement Alloc ID = a1f32824

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.25:30461 -> 5432

Task "migrator" (poststart) is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB  

Task Events:
Started At     = 2022-11-23T09:15:28Z
Finished At    = 2022-11-23T09:15:29Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2022-11-23T12:22:10+03:00  Alloc Unhealthy  Unhealthy because of failed task
2022-11-23T12:22:10+03:00  Killing          Sent interrupt. Waiting 5s before force killing
2022-11-23T12:22:05+03:00  Template         Missing: nomad.service(postgres)
2022-11-23T12:22:01+03:00  Killing          Template failed: nomad.service(postgres): Unexpected response code: 403 (Permission denied)
2022-11-23T12:15:29+03:00  Terminated       Exit Code: 0
2022-11-23T12:15:28+03:00  Started          Task started by client
2022-11-23T12:15:25+03:00  Driver           Pulling image docker.io/library/hello-world:latest
2022-11-23T12:15:25+03:00  Task Setup       Building Task Directory
2022-11-23T12:15:25+03:00  Received         Task received by client

Task "postgres" is "dead"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  3.9 MiB/512 MiB  300 MiB  

Task Events:
Started At     = 2022-11-23T09:15:25Z
Finished At    = 2022-11-23T09:22:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                 Description
2022-11-23T12:22:10+03:00  Alloc Unhealthy      Unhealthy because of dead task
2022-11-23T12:22:06+03:00  Killed               Task successfully killed
2022-11-23T12:22:06+03:00  Terminated           Exit Code: 0
2022-11-23T12:22:01+03:00  Killing              Sent interrupt. Waiting 5s before force killing
2022-11-23T12:22:01+03:00  Sibling Task Failed  Task's sibling "migrator" failed
2022-11-23T12:15:25+03:00  Started              Task started by client
2022-11-23T12:15:25+03:00  Task Setup           Building Task Directory
2022-11-23T12:15:25+03:00  Received             Task received by client

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 23, 2022

Thanks for the extra info @veloc1 and @iSchluff.

Could you check the allocation details as returned from the GET /v1/allocation/:id API endpoint?

There should be a field called SigningKeyID. Could you then check if the ID is listed in the output of the nomad operator root keyring list command?

Thanks!

@veloc1
Copy link

veloc1 commented Nov 24, 2022

@lgfa29 Yep, SigningKeyId is the same.

Also, attaching logs from client:

Logs from client

  2022-11-24T12:47:45.040+0300 [DEBUG] client: updated allocations: index=118181 total=10 pulled=6 filtered=4
2022-11-24T12:47:45.040+0300 [DEBUG] client: allocation updates: added=1 removed=0 updated=5 ignored=4
2022-11-24T12:47:45.041+0300 [DEBUG] http: request complete: method=GET path=/v1/node/6ff14f6b-aab8-d5d3-3dac-fdb89b501fc3/allocations?index=118175 duration=25.320547306s
2022-11-24T12:47:45.045+0300 [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman signal="" taskID=4281c919-9fe6-ab79-fc37-6c678a11ea76/some-task/ffa6d759 @module=podman timestamp="2022-11-24T12:47:45.045+0300"
2022-11-24T12:47:45.046+0300 [INFO]  agent: (runner) stopping
2022-11-24T12:47:45.046+0300 [DEBUG] agent: (runner) stopping watcher
2022-11-24T12:47:45.046+0300 [DEBUG] agent: (watcher) stopping all views
2022-11-24T12:47:45.046+0300 [INFO]  agent: (runner) received finish
2022-11-24T12:47:45.046+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-11-24T12:47:45.047+0300 [INFO]  agent: (runner) creating watcher
2022-11-24T12:47:45.047+0300 [DEBUG] client: allocation updates applied: added=1 removed=0 updated=5 ignored=4 errors=0
2022-11-24T12:47:45.048+0300 [INFO]  agent: (runner) starting
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) running initial templates
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) initiating run
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) checking template 8b0d0af27e8b6a0428e869b969771f72
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) missing data for 1 dependencies
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) missing dependency: nomad.service(postgres)
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) add used dependency nomad.service(postgres) to missing since isLeader but do not have a watcher
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) was not watching 1 dependencies
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (watcher) adding nomad.service(postgres)
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) diffing and updating dependencies
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) watching 1 dependencies
2022-11-24T12:47:45.049+0300 [DEBUG] http: request failed: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" error="Permission denied" code=403
2022-11-24T12:47:45.049+0300 [DEBUG] http: request complete: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" duration="91.989µs"
2022-11-24T12:47:45.049+0300 [WARN]  agent: (view) nomad.service(postgres): Unexpected response code: 403 (Permission denied) (retry attempt 1 after "250ms")
2022-11-24T12:47:45.055+0300 [DEBUG] client.alloc_migrator: waiting for previous alloc to terminate: alloc_id=f0c29325-7d53-39b4-8d82-e2a18dc97b63 previous_alloc=4281c919-9fe6-ab79-fc37-6c678a11ea76
2022-11-24T12:47:45.181+0300 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/4281c919-9fe6-ab79-fc37-6c678a11ea76/stats duration="196.999µs"
2022-11-24T12:47:45.181+0300 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/f4f2e0e0-d3a0-fee7-cc76-7b3cd8f9372d/stats duration="278.279µs"
2022-11-24T12:47:45.300+0300 [DEBUG] http: request failed: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" error="Permission denied" code=403
2022-11-24T12:47:45.300+0300 [DEBUG] http: request complete: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" duration="104.235µs"
2022-11-24T12:47:45.300+0300 [WARN]  agent: (view) nomad.service(postgres): Unexpected response code: 403 (Permission denied) (retry attempt 2 after "500ms")
2022-11-24T12:47:45.340+0300 [DEBUG] client: updated allocations: index=118184 total=10 pulled=3 filtered=7
2022-11-24T12:47:45.340+0300 [DEBUG] client: allocation updates: added=0 removed=0 updated=3 ignored=7
2022-11-24T12:47:45.343+0300 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=3 ignored=7 errors=0

Looks like, for some reason, dependency watcher missing auth token for first requests?

@mr-karan
Copy link
Contributor

mr-karan commented Dec 7, 2022

@lgfa29 Faced this in production today as well. I've a close to default client config as well (client.template.nomad is left untouched) but still the template failed to render which caused the job to fail.

@Thunderbottom
Copy link

@lgfa29 any updates on this? have been facing the same issue for a while now. Recently we also migrated to nomad variables, which also shows up alongside Missing: nomad.service in the UI events. Although, from what we could see, the variables and service gets rendered just fine in the template.

Now from what we've figured, this happens when you deploy a new version over an existing deployment. Seems like a race condition with the events channel in template.go: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/template/template.go#L350-L371

Let me know if I could be of any help.

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 14, 2023

Hi all, thanks for the extra info. I haven't been able to reproduce this yet, so no updates so far.

I will try the deployment update @Thunderbottom next time I get a chance, thanks for the tip!

@zhiguangwang
Copy link
Contributor

I have been facing the same issue with Nomad 1.4.3 when using job templates with the nomadVar function.

Example job spec:

job "whoami" {
  region      = "global"
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "service"

  update {
    max_parallel     = 1
    min_healthy_time = "0s"
  }

  group "whoami" {

    scaling {
      min = 1
      max = 4
    }

    network {
      port "http" {}
    }

    service {
      name         = "whoami"
      tags         = []
      port         = "http"
      address_mode = "host"
      check {
        name     = "whoami HTTP check"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "whoami" {
      driver       = "docker"
      user         = "65534" # nobody on Ubuntu
      kill_signal  = "SIGTERM"
      kill_timeout = "15s"

      template {
        destination = "local/config.yaml"
        data        = <<EOF
{{ with nomadVar "nomad/jobs/whoami" }}
mysql_username = "{{ .mysql_username }}"
mysql_password = "{{ .mysql_password }}"
{{ end }}
EOF
      }

      config {
        image = "traefik/whoami:v1.8.7"
        args = [
          "--name=${NOMAD_ALLOC_NAME}",
          "--port=${NOMAD_PORT_http}",
        ]
        network_mode = "host"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Note that I'm using Workload Identity feature by putting nomad variables at path nomad/jobs/whoami.

The issue can always be reproduced by scaling up/down the task group count at Nomad UI: Events with Type Template and Description Missing: nomad.var.block(nomad/jobs/whoami@default.global) will show up in task "Recent Events".

@tgross
Copy link
Member

tgross commented Feb 27, 2023

Hi folks! 👋 I was just fixing up #16259 and was looking around for similar issues and found this one... I suspect this issue will be fixed by #16266. Note that although #16259 talks about canaries, as far as I can tell this will happen for any in-place update, which looks like it could describe the behaviors we're seeing in this issue (unless I've missed something)

@tgross
Copy link
Member

tgross commented Jun 24, 2024

As noted above I'm going to mark this as closed by #16266. There's enough changed here that if anyone hits this again and is reading this, I'd ask that they open a new issue with the debug-level logs described above.

@tgross tgross closed this as completed Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants