Failed canary allocations are rescheduled as non canary allocations #6936

drewbailey · 2020-01-13T20:56:27Z

Nomad version

0.10.3-dev

Operating system and Environment details

ubuntu 19.10

Issue

Rescheduling during deployments doc claims that

The update stanza controls rolling updates and canary deployments. A task group's reschedule stanza does not take affect during a deployment

However, with the reproduction steps & file listed, a failed set of canary allocations get rescheduled according to the reschedule stanza, but do not register as canary allocations. This causes the previous deploys healthy tasks to be marked as stop during reconciliation.

Reproduction steps

using a local cluster with 3 server nodes, 2 client nodes

# run first job
nomad job run repro.hcl

# allow deploy to become healthy

nomad job run repro-fail.hcl

# observe rescheduled eval eventually kill off healthy allocations

Failed canaries, future reschedule occurring shortly

ID            = failing-nomad-test01
Name          = failing-nomad-test01
Submit Date   = 2020-01-13T15:35:51-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         4        4       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       d8ddc12d  2s from now

Latest Deployment
ID          = 86d8d83f
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         4       0        4          2020-01-13T15:45:51-05:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
78bc2bb6  303bd631  group       1        run      failed   30s ago  26s ago
164b81e5  303bd631  group       1        run      failed   30s ago  26s ago
203a01c6  303bd631  group       1        run      failed   30s ago  26s ago
497ae378  303bd631  group       1        run      failed   30s ago  26s ago
18df0793  27ce5ad0  group       0        run      running  55s ago  40s ago
2949e33b  27ce5ad0  group       0        run      running  55s ago  36s ago
0e191e2a  27ce5ad0  group       0        run      running  55s ago  38s ago
e5f6b54c  27ce5ad0  group       0        run      running  55s ago  37s ago

Newly placed allocations cause original healthy allocs to be marked as stop

ID            = failing-nomad-test01
Name          = failing-nomad-test01
Submit Date   = 2020-01-13T15:35:51-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         1        8       3         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       289de16b  58s from now

Latest Deployment
ID          = 86d8d83f
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         8       0        8          2020-01-13T15:45:51-05:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created   Modified
5824276a  303bd631  group       1        run      failed    4s ago    0s ago
19cb409a  303bd631  group       1        run      failed    4s ago    0s ago
8c70b8cc  303bd631  group       1        run      failed    4s ago    0s ago
8c5e464e  27ce5ad0  group       1        run      failed    4s ago    0s ago
497ae378  303bd631  group       1        stop     failed    36s ago   4s ago
78bc2bb6  303bd631  group       1        stop     failed    36s ago   4s ago
164b81e5  303bd631  group       1        stop     failed    36s ago   4s ago
203a01c6  303bd631  group       1        stop     failed    36s ago   4s ago
2949e33b  27ce5ad0  group       0        stop     running   1m1s ago  0s ago
0e191e2a  27ce5ad0  group       0        stop     complete  1m1s ago  0s ago
18df0793  27ce5ad0  group       0        stop     complete  1m1s ago  0s ago
e5f6b54c  27ce5ad0  group       0        stop     complete  1m1s ago  0s ago

Job file (if appropriate)

repro.hcl

job "failing-nomad-test01" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    health_check      = "checks"
    max_parallel      = 4
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    auto_promote      = false
    canary            = 4
  }

  migrate {
    max_parallel     = 3
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "group" {
    count = 4

    restart {
      attempts = 0
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "rails" {
      driver = "docker"

      env {
        RAILS_ENV                = "production"
        RAILS_SERVE_STATIC_FILES = "1"
        RAILS_LOG_TO_STDOUT      = "1"
        TEST                     = "2020-01-09 01:23:18 +0000"
      }

      config {
        image = "kaspergrubbe/diceapp:0.0.6"

        command = "bundle"
        args    = ["exec", "unicorn", "-c", "/app/config/unicorn.rb"]

        port_map {
          web = 8080
        }
      }

      resources {
        cpu    = 750
        memory = 250

        network {
          mbits = 50
          port  "web" {}
        }
      }

      service {
        name = "failing-nomad-test00"
        tags = []
        port = "web"

        check {
          name     = "failing-nomad-test00 healthcheck"
          type     = "http"
          protocol = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
        }
      }
    }
  }
}

repro-fail.hcl

job "failing-nomad-test01" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    health_check      = "checks"
    max_parallel      = 4
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    auto_promote      = false
    canary            = 4
  }

  migrate {
    max_parallel     = 3
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "group" {
    count = 4

    restart {
      attempts = 0
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "rails" {
      driver = "docker"

      env {
        RAILS_ENV                = "production"
        RAILS_SERVE_STATIC_FILES = "1"
        RAILS_LOG_TO_STDOUT      = "1"
        TEST                     = "2020-01-09 01:23:18 +0000"
      }

      config {
        image = "kaspergrubbe/diceapp:0.0.6"

        command = "false"

        port_map {
          web = 8080
        }

        dns_servers = ["172.17.0.1"]
      }

      resources {
        cpu    = 750
        memory = 250

        network {
          mbits = 50
          port  "web" {}
        }
      }

      service {
        name = "failing-nomad-test00"
        tags = []
        port = "web"

        check {
          name     = "failing-nomad-test00 healthcheck"
          type     = "http"
          protocol = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
        }
      }
    }
  }
}

originally reported issue #6864

The text was updated successfully, but these errors were encountered:

kaspergrubbe · 2020-03-25T11:30:51Z

@drewbailey Which release is this live in? I can't see it mentioned in the changelog.

drewbailey · 2020-03-25T12:37:58Z

@kaspergrubbe sorry about that, for some reason it's not in the changelog, I'll update that, it was released in 0.10.4

kaspergrubbe · 2020-03-25T14:11:21Z

Thank you @drewbailey

github-actions · 2022-11-11T02:33:28Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

drewbailey added type/bug theme/scheduling labels Jan 13, 2020

drewbailey self-assigned this Jan 13, 2020

drewbailey mentioned this issue Jan 13, 2020

Nomad kills off healthy allocations on deploy when new allocations fail #6864

Closed

drewbailey mentioned this issue Jan 22, 2020

keep placed canaries aligned in raft store #6975

Merged

drewbailey added this to the 0.10.4 milestone Jan 23, 2020

schmichael modified the milestones: 0.10.4, 0.10.3 Jan 30, 2020

drewbailey closed this as completed in #6975 Feb 3, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed canary allocations are rescheduled as non canary allocations #6936

Failed canary allocations are rescheduled as non canary allocations #6936

drewbailey commented Jan 13, 2020

kaspergrubbe commented Mar 25, 2020

drewbailey commented Mar 25, 2020

kaspergrubbe commented Mar 25, 2020

github-actions bot commented Nov 11, 2022

Failed canary allocations are rescheduled as non canary allocations #6936

Failed canary allocations are rescheduled as non canary allocations #6936

Comments

drewbailey commented Jan 13, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

kaspergrubbe commented Mar 25, 2020

drewbailey commented Mar 25, 2020

kaspergrubbe commented Mar 25, 2020

github-actions bot commented Nov 11, 2022