Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed canary allocations are rescheduled as non canary allocations #6936

Closed
drewbailey opened this issue Jan 13, 2020 · 4 comments · Fixed by #6975
Closed

Failed canary allocations are rescheduled as non canary allocations #6936

drewbailey opened this issue Jan 13, 2020 · 4 comments · Fixed by #6975

Comments

@drewbailey
Copy link
Contributor

Nomad version

0.10.3-dev

Operating system and Environment details

ubuntu 19.10

Issue

Rescheduling during deployments doc claims that

The update stanza controls rolling updates and canary deployments. A task group's reschedule stanza does not take affect during a deployment

However, with the reproduction steps & file listed, a failed set of canary allocations get rescheduled according to the reschedule stanza, but do not register as canary allocations. This causes the previous deploys healthy tasks to be marked as stop during reconciliation.

Reproduction steps

using a local cluster with 3 server nodes, 2 client nodes

# run first job
nomad job run repro.hcl

# allow deploy to become healthy

nomad job run repro-fail.hcl

# observe rescheduled eval eventually kill off healthy allocations

Failed canaries, future reschedule occurring shortly

ID            = failing-nomad-test01
Name          = failing-nomad-test01
Submit Date   = 2020-01-13T15:35:51-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         4        4       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       d8ddc12d  2s from now

Latest Deployment
ID          = 86d8d83f
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         4       0        4          2020-01-13T15:45:51-05:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
78bc2bb6  303bd631  group       1        run      failed   30s ago  26s ago
164b81e5  303bd631  group       1        run      failed   30s ago  26s ago
203a01c6  303bd631  group       1        run      failed   30s ago  26s ago
497ae378  303bd631  group       1        run      failed   30s ago  26s ago
18df0793  27ce5ad0  group       0        run      running  55s ago  40s ago
2949e33b  27ce5ad0  group       0        run      running  55s ago  36s ago
0e191e2a  27ce5ad0  group       0        run      running  55s ago  38s ago
e5f6b54c  27ce5ad0  group       0        run      running  55s ago  37s ago

Newly placed allocations cause original healthy allocs to be marked as stop

ID            = failing-nomad-test01
Name          = failing-nomad-test01
Submit Date   = 2020-01-13T15:35:51-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         1        8       3         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       289de16b  58s from now

Latest Deployment
ID          = 86d8d83f
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         8       0        8          2020-01-13T15:45:51-05:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created   Modified
5824276a  303bd631  group       1        run      failed    4s ago    0s ago
19cb409a  303bd631  group       1        run      failed    4s ago    0s ago
8c70b8cc  303bd631  group       1        run      failed    4s ago    0s ago
8c5e464e  27ce5ad0  group       1        run      failed    4s ago    0s ago
497ae378  303bd631  group       1        stop     failed    36s ago   4s ago
78bc2bb6  303bd631  group       1        stop     failed    36s ago   4s ago
164b81e5  303bd631  group       1        stop     failed    36s ago   4s ago
203a01c6  303bd631  group       1        stop     failed    36s ago   4s ago
2949e33b  27ce5ad0  group       0        stop     running   1m1s ago  0s ago
0e191e2a  27ce5ad0  group       0        stop     complete  1m1s ago  0s ago
18df0793  27ce5ad0  group       0        stop     complete  1m1s ago  0s ago
e5f6b54c  27ce5ad0  group       0        stop     complete  1m1s ago  0s ago

Job file (if appropriate)

repro.hcl

job "failing-nomad-test01" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    health_check      = "checks"
    max_parallel      = 4
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    auto_promote      = false
    canary            = 4
  }

  migrate {
    max_parallel     = 3
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "group" {
    count = 4

    restart {
      attempts = 0
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "rails" {
      driver = "docker"

      env {
        RAILS_ENV                = "production"
        RAILS_SERVE_STATIC_FILES = "1"
        RAILS_LOG_TO_STDOUT      = "1"
        TEST                     = "2020-01-09 01:23:18 +0000"
      }

      config {
        image = "kaspergrubbe/diceapp:0.0.6"

        command = "bundle"
        args    = ["exec", "unicorn", "-c", "/app/config/unicorn.rb"]

        port_map {
          web = 8080
        }
      }

      resources {
        cpu    = 750
        memory = 250

        network {
          mbits = 50
          port  "web" {}
        }
      }

      service {
        name = "failing-nomad-test00"
        tags = []
        port = "web"

        check {
          name     = "failing-nomad-test00 healthcheck"
          type     = "http"
          protocol = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
        }
      }
    }
  }
}

repro-fail.hcl

job "failing-nomad-test01" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    health_check      = "checks"
    max_parallel      = 4
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    auto_promote      = false
    canary            = 4
  }

  migrate {
    max_parallel     = 3
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "group" {
    count = 4

    restart {
      attempts = 0
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "rails" {
      driver = "docker"

      env {
        RAILS_ENV                = "production"
        RAILS_SERVE_STATIC_FILES = "1"
        RAILS_LOG_TO_STDOUT      = "1"
        TEST                     = "2020-01-09 01:23:18 +0000"
      }

      config {
        image = "kaspergrubbe/diceapp:0.0.6"

        command = "false"

        port_map {
          web = 8080
        }

        dns_servers = ["172.17.0.1"]
      }

      resources {
        cpu    = 750
        memory = 250

        network {
          mbits = 50
          port  "web" {}
        }
      }

      service {
        name = "failing-nomad-test00"
        tags = []
        port = "web"

        check {
          name     = "failing-nomad-test00 healthcheck"
          type     = "http"
          protocol = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
        }
      }
    }
  }
}

originally reported issue #6864

@kaspergrubbe
Copy link

@drewbailey Which release is this live in? I can't see it mentioned in the changelog.

@drewbailey
Copy link
Contributor Author

@kaspergrubbe sorry about that, for some reason it's not in the changelog, I'll update that, it was released in 0.10.4

@kaspergrubbe
Copy link

Thank you @drewbailey

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants