Prevent kill_timeout greater than progress_deadline #8487

threemachines · 2020-07-21T20:05:04Z

Nomad version

Nomad v0.12.0 (8f7fbc8)

Operating system and Environment details

Ubuntu 18.04 on AWS

Issue

If a kill_timeout is greater than the job's progress_deadline, allocations may keep running (after initial kill signal) for long enough that the deploy fails. However, the allocations that were scheduled by that job will still be pending after job failure, and will be placed and started once the previous allocations do exit.

I'm honestly split between this being a bug or a feature request. I can't say with certainty that the current behavior is wrong, but we certainly found it surprising and unpleasant. We found it by accidental misconfiguration, and I'm hard-pressed to imagine a situation where you'd really want your job to work this way, so why not add a guard-rail?

The alternative fix I can imagine is that the timer for progress_deadline could not start until allocations have been placed, but that would have other problems.

Reproduction steps

Deploy the below job file. Should be fine.
Change the environment variable to force redeploys and deploy it again. Deploy will fail on timeout.
Do it again, just for fun. (Changing environment variables again.)
Wait until five minutes after the second deploy.

You now have fifteen allocations from deploy 0, and five allocations from deploy 2, which all started several minutes after deploy 2 failed. (The five allocations from deploy 1 went directly from pending to completed.)

Job file

job "example" {
  datacenters = ["sandbox"]
  type = "service"
  update {
    max_parallel = 5
    min_healthy_time = "10s"
    healthy_deadline = "30s"
    progress_deadline = "1m"
    auto_revert = false
    canary = 0
  }
  migrate {
    max_parallel = 5
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }
  group "cache" {
    count = 20
    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }
    ephemeral_disk {
      size = 300
    }
    task "redis" {
      driver = "docker"
      ## send redis a sighup so it doesn't actually exit
      ## this lets us play with the kill_timeouts
      kill_signal = "SIGHUP"
      kill_timeout = "5m"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      env {
        asdf = "asdf"
      }
      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "db" {}
        }
      }
      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

One obvious modification is to set auto_revert = true. Although that does eventually result in a better final state (all 20 allocations from the same job and same version), the way it gets there is very alarming, and you end up with some weird time-traveling jobs. (Version 1 finished a minute after version 2, etc.) If you stay in this jobs-pending state for more than a few minutes (we, uh, had a 24h kill_timeout...) your allocations can become nightmarish. (I think there's additional failure modes when you stack enough of those up, but I haven't tried to reproduce them yet.)

The text was updated successfully, but these errors were encountered:

tgross · 2020-12-16T21:37:38Z

Hi @threemachines, sorry no one got back to you on this one. I agree that's always going to be a misconfiguration so we should put up some validation of the jobspec on that. That sort of makes this issue a "bug-hancement" but I'll mark it as accepted so it gets on our roadmap.

pavanrangain · 2023-05-24T03:10:41Z

Wonder why this got back ported on a patch release. We were waiting for fix of #17079 but this backport is preventing us from moving to 1..5.6 as many jobs may now need a change.
Isn't this considered a breaking change to have been back ported to patch release ?

Also if i check the changes https://github.com/hashicorp/nomad/pull/17207/files does it mean we cannot set progress_deadline to 0 for tasks which take longer to be healthy now and have to always set some higher value ?

https://developer.hashicorp.com/nomad/docs/job-specification/update#progress_deadline - this does not explicitly say so but i i read it right it means 0 is considered as infinite i.e till an alloc becomes unhealthy

Juanadelacuesta · 2023-05-24T11:44:25Z

Hello pavanrangain, Im sorry this issue is delaying your progress, this change was not intended as a breaking one, its just to signal a misconfiguration that can cause your cluster to act in unexpected ways, when updating a task the "old" allocations might take too long to gracefully exit and wont allow the new allocations to succeed, causing a failed deployment

Addressing your concern about not being able to set progress_deadline to 0 for slow tasks, it might be an unexpected consequence of the bug fix and we will be looking into it.

lgfa29 · 2023-05-29T16:30:44Z

Close by #16761.

The incorrect validation when progress_deadline is 0 is fixed in #17342.

galeep added stage/needs-investigation stage/needs-discussion labels Jul 21, 2020

tgross added type/bug and removed stage/needs-discussion labels Aug 24, 2020

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/jobspec and removed stage/needs-investigation labels Dec 16, 2020

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

mikenomitch added help-wanted We encourage community PRs for these issues! good first issue labels Aug 12, 2022

jrasell assigned Juanadelacuesta Mar 30, 2023

Juanadelacuesta mentioned this issue Apr 3, 2023

Prevent kill_timeout greater than progress_deadline #16761

Merged

hc-github-team-nomad-core mentioned this issue Apr 4, 2023

Backport of Prevent kill_timeout greater than progress_deadline into release/1.3.x #16787

Merged

This was referenced May 16, 2023

Backport of Prevent kill_timeout greater than progress_deadline into release/1.4.x #17205

Merged

Backport of Prevent kill_timeout greater than progress_deadline into release/1.5.x #17207

Merged

lgfa29 closed this as completed May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent kill_timeout greater than progress_deadline #8487

Prevent kill_timeout greater than progress_deadline #8487

threemachines commented Jul 21, 2020 •

edited

Loading

tgross commented Dec 16, 2020

pavanrangain commented May 24, 2023

Juanadelacuesta commented May 24, 2023 •

edited

Loading

lgfa29 commented May 29, 2023

Prevent kill_timeout greater than progress_deadline #8487

Prevent kill_timeout greater than progress_deadline #8487

Comments

threemachines commented Jul 21, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file

tgross commented Dec 16, 2020

pavanrangain commented May 24, 2023

Juanadelacuesta commented May 24, 2023 • edited Loading

lgfa29 commented May 29, 2023

threemachines commented Jul 21, 2020 •

edited

Loading

Juanadelacuesta commented May 24, 2023 •

edited

Loading