Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

djenriquez · 2020-06-10T20:21:27Z

Nomad version

nomad -v
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2

Issue

A job set to autopromote ended up requiring manual promotion to complete the deployment. The only thing different in this case was one of the allocations were unhealthy, so Nomad replaced that allocation. That new allocation became healthy and the healthy threshold was met. However, rather than autopromoting, the deployment stalled.

The deployment status was reporting Deployment is running pending automatic promotion when queried, but it was only when we clicked "Promote" in the UI did the deployment complete successfully.

In this deployment, I clicked "promote" at 1:07 PM PT, which is when the deployment completed. Notice the progress_deadline values were about half an hour before this. So this tells me the canary allocations satisfied the healthy threshold as to not fail the deployment, but for some reason, required a manual promotion.

Unfortunately, I'm not sure how to replicate this as we're seeing this issue a low percentage of the time, but given as many deployments as we do, we see it a few times a week.

The text was updated successfully, but these errors were encountered:

djenriquez · 2020-06-11T23:03:03Z

FYI, just saw this happen a few times today. Seems pretty consistent, when an allocation fails, then is replaced and the replacement goes healthy, the "automatic promotion" needs a manual promotion.

Before clicking "Promote":

After clicking "Promote":

djenriquez · 2021-01-12T00:56:33Z

Hi, any thoughts on this bug? We still see it today /w v0.12.8.

tgross · 2021-02-08T14:10:37Z

Currently investigating this along with #7058, which appears to be at least somewhat related.

tgross · 2021-02-26T20:44:23Z

Update on that #7058 investigation: that's been shipped but specifically had to do with the progress deadline not being set. I kind of doubt this is related at this point. I going to try to put together a minimal reproduction before putting this on the roadmap for development, but if you have one I'd be happy to validate it.

djenriquez · 2021-03-02T16:20:01Z

Thanks for looking into this, tgross. We definitely do see this issue still and have just been kicking the deploy by changing the count on a taskgroup when it happens.

I'll refresh my original investigation and see if there was anything I did to try to reproduce it or if I had just submitted evidence when the situation occurred.

scottherlihy · 2021-08-13T15:19:58Z

Any update on this? We are still seeing it

Hamitamaru · 2022-02-01T07:12:25Z

I'm seeing something similar fairly frequently

Start deployment, and then

First alloc

Recent Events:
Time                       Type             Description
2022-01-31T22:09:23-08:00  Killing          Sent interrupt. Waiting 5s before force killing
2022-01-31T22:09:21-08:00  Alloc Unhealthy  Unhealthy because of failed task
2022-01-31T22:09:21-08:00  Not Restarting   Error was unrecoverable
2022-01-31T22:09:21-08:00  Driver Failure   failed to create container: container already exists
2022-01-31T22:01:16-08:00  Driver           Downloading image
2022-01-31T22:01:16-08:00  Task Setup       Building Task Directory
2022-01-31T22:01:16-08:00  Received         Task received by client

After first alloc fails a second alloc attempt is made

Replacement alloc

Recent Events:
Time                       Type        Description
2022-01-31T22:09:55-08:00  Started     Task started by client
2022-01-31T22:09:51-08:00  Driver      Downloading image
2022-01-31T22:09:51-08:00  Task Setup  Building Task Directory
2022-01-31T22:09:51-08:00  Received    Task received by client

Second alloc attempt succeeds and yet the deployment is stuck

Latest Deployment
ID          = 10f74cb9
Status      = running
Description = Deployment is running pending automatic promotion

Deployed
Task Group     Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
some-task-group  true         false     1        1         2       1        1          2022-02-01T06:20:48Z

Need to manually promote

nomad deployment promote 10f74cb9-7a82-3138-8355-dd9294430bd0     
==> 2022-01-31T23:01:06-08:00: Monitoring evaluation "49a23b4e"
    2022-01-31T23:01:06-08:00: Evaluation triggered by job "some-app"
    2022-01-31T23:01:06-08:00: Evaluation within deployment: "10f74cb9"
==> 2022-01-31T23:01:07-08:00: Monitoring evaluation "49a23b4e"
    2022-01-31T23:01:07-08:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-31T23:01:07-08:00: Evaluation "49a23b4e" finished with status "complete"
==> 2022-01-31T23:01:07-08:00: Monitoring deployment "10f74cb9"
  ✓ Deployment "10f74cb9" successful

    2022-01-31T23:01:07-08:00
    ID          = 10f74cb9
    Job ID      = some-app
    Job Version = 5
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group     Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
    Progress Deadline
    some-task-group  true         true      1        1         2       1        1
    2022-01-31T23:11:06-08:00

chuckyz · 2022-08-03T18:02:32Z

Looking at this:

nomad/nomad/deploymentwatcher/deployment_watcher.go

Line 296 in 0970dd2

    
           if !dstate.AutoPromote || dstate.DesiredCanaries != len(dstate.PlacedCanaries) {

If my understanding is correct, PlacedCanaries holds a struct of all canary id's placed at any point in time. Including those now unhealthy/replaced. This would mean if you're ever placing more canaries than desired this is always returning nil.

Using that, I'd propose the following change:

                if !dstate.AutoPromote || dstate.DesiredCanaries < len(dstate.PlacedCanaries) {
                        return nil
                }

                healthyCanaries := 0
                // Find the health status of each canary
                for _, c := range dstate.PlacedCanaries {
                        for _, a := range allocs {
                                if c == a.ID && a.DeploymentStatus.IsHealthy() {
                                        healthyCanaries += 1
                                }
                        }
                }

                if healthyCanaries != dstate.DesiredCanaries {
                  return nil
                }

Let me know if this makes sense!

lgfa29 · 2022-08-03T21:37:03Z

Hi @chuckyz 👋

Great investigation! This looks good to me. I think the only change is that you would want to check for the inverse condition: len(dstate.PlacedCanaries) < dstate.DesiredCanaries (return early if we haven't placed enough canaries yet).

I don't know if you have an easy repro, but just for the record, this is the job I used:

job "canary" {
  datacenters = ["dc1"]

  meta {
    uuid = uuidv4()
  }

  group "canary" {
    count = 3

    restart {
      attempts = 1
    }

    update {
      max_parallel     = 3
      canary           = 3
      auto_promote     = true
      min_healthy_time = "2s"
    }

    task "canary" {
      driver = "raw_exec"

      config {
        command = "/bin/bash"
        args    = ["local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

if [[ $NOMAD_ALLOC_ID =~ ^[a-fA-F] ]]; then
  echo "alloc ID starts with letter, bye"
  exit 1
fi

echo "alloc ID doesn't start with letter"
while true; do
  sleep 5
done
EOF

        destination = "local/script.sh"
      }
    }
  }
}

It takes a bit of luck to trigger a failure, but you can just run the job multiple times, that meta block will make sure each run is unique.

Feel free to open a PR with this patch 🙂

chuckyz · 2022-08-03T23:59:17Z

@lgfa29 open! I've opened this ahead of testing it locally so that I can get some eyeballs on the test because that exact test is... complicated and I'd by lying if I said I truly understood what I just read/wrote.

edit:
bug: fixed ✅

github-actions · 2022-12-03T02:13:54Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

djenriquez changed the title ~~Autopromote fails when an allocation goes unhealthy, but is properly replaced~~ Autopromote fails when an allocation goes unhealthy but is properly replaced Jun 10, 2020

tgross added theme/deployments stage/needs-investigation labels Jun 22, 2020

tgross mentioned this issue Feb 5, 2021

Multi task-groups jobs hit canary progress deadline in promote #7058

Closed

tgross self-assigned this Feb 8, 2021

tgross added this to In Progress in Nomad - Community Issues Triage Feb 12, 2021

tgross removed their assignment May 20, 2021

tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage May 20, 2021

tgross added the stage/waiting-reply label May 20, 2021

chuckyz mentioned this issue Aug 3, 2022

allow unhealthy canaries without blocking autopromote #14001

Merged

lgfa29 closed this as completed in #14001 Aug 4, 2022

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Aug 4, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

djenriquez commented Jun 10, 2020

djenriquez commented Jun 11, 2020

djenriquez commented Jan 12, 2021 •

edited

Loading

tgross commented Feb 8, 2021

tgross commented Feb 26, 2021

djenriquez commented Mar 2, 2021

scottherlihy commented Aug 13, 2021

Hamitamaru commented Feb 1, 2022

chuckyz commented Aug 3, 2022

lgfa29 commented Aug 3, 2022

chuckyz commented Aug 3, 2022 •

edited

Loading

github-actions bot commented Dec 3, 2022

Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

Comments

djenriquez commented Jun 10, 2020

Nomad version

Operating system and Environment details

Issue

djenriquez commented Jun 11, 2020

djenriquez commented Jan 12, 2021 • edited Loading

tgross commented Feb 8, 2021

tgross commented Feb 26, 2021

djenriquez commented Mar 2, 2021

scottherlihy commented Aug 13, 2021

Hamitamaru commented Feb 1, 2022

chuckyz commented Aug 3, 2022

lgfa29 commented Aug 3, 2022

chuckyz commented Aug 3, 2022 • edited Loading

github-actions bot commented Dec 3, 2022

djenriquez commented Jan 12, 2021 •

edited

Loading

chuckyz commented Aug 3, 2022 •

edited

Loading