Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

Closed
djenriquez opened this issue Jun 10, 2020 · 11 comments · Fixed by #14001
Closed

Autopromote fails when an allocation goes unhealthy but is properly replaced #8150

djenriquez opened this issue Jun 10, 2020 · 11 comments · Fixed by #14001

Comments

@djenriquez
Copy link

Nomad version

nomad -v
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2

Issue

A job set to autopromote ended up requiring manual promotion to complete the deployment. The only thing different in this case was one of the allocations were unhealthy, so Nomad replaced that allocation. That new allocation became healthy and the healthy threshold was met. However, rather than autopromoting, the deployment stalled.

The deployment status was reporting Deployment is running pending automatic promotion when queried, but it was only when we clicked "Promote" in the UI did the deployment complete successfully.

Screen Shot 2020-06-10 at 1 16 49 PM

In this deployment, I clicked "promote" at 1:07 PM PT, which is when the deployment completed. Notice the progress_deadline values were about half an hour before this. So this tells me the canary allocations satisfied the healthy threshold as to not fail the deployment, but for some reason, required a manual promotion.

Unfortunately, I'm not sure how to replicate this as we're seeing this issue a low percentage of the time, but given as many deployments as we do, we see it a few times a week.

@djenriquez djenriquez changed the title Autopromote fails when an allocation goes unhealthy, but is properly replaced Autopromote fails when an allocation goes unhealthy but is properly replaced Jun 10, 2020
@djenriquez
Copy link
Author

FYI, just saw this happen a few times today. Seems pretty consistent, when an allocation fails, then is replaced and the replacement goes healthy, the "automatic promotion" needs a manual promotion.

Before clicking "Promote":
Screen Shot 2020-06-11 at 4 00 13 PM

After clicking "Promote":

Screen Shot 2020-06-11 at 4 02 03 PM

@djenriquez
Copy link
Author

djenriquez commented Jan 12, 2021

Hi, any thoughts on this bug? We still see it today /w v0.12.8.

@tgross
Copy link
Member

tgross commented Feb 8, 2021

Currently investigating this along with #7058, which appears to be at least somewhat related.

@tgross tgross self-assigned this Feb 8, 2021
@tgross tgross added this to In Progress in Nomad - Community Issues Triage Feb 12, 2021
@tgross
Copy link
Member

tgross commented Feb 26, 2021

Update on that #7058 investigation: that's been shipped but specifically had to do with the progress deadline not being set. I kind of doubt this is related at this point. I going to try to put together a minimal reproduction before putting this on the roadmap for development, but if you have one I'd be happy to validate it.

@djenriquez
Copy link
Author

Thanks for looking into this, tgross. We definitely do see this issue still and have just been kicking the deploy by changing the count on a taskgroup when it happens.

I'll refresh my original investigation and see if there was anything I did to try to reproduce it or if I had just submitted evidence when the situation occurred.

@tgross tgross removed their assignment May 20, 2021
@tgross tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage May 20, 2021
@scottherlihy
Copy link

Any update on this? We are still seeing it

@Hamitamaru
Copy link

I'm seeing something similar fairly frequently

Start deployment, and then

First alloc

Recent Events:
Time                       Type             Description
2022-01-31T22:09:23-08:00  Killing          Sent interrupt. Waiting 5s before force killing
2022-01-31T22:09:21-08:00  Alloc Unhealthy  Unhealthy because of failed task
2022-01-31T22:09:21-08:00  Not Restarting   Error was unrecoverable
2022-01-31T22:09:21-08:00  Driver Failure   failed to create container: container already exists
2022-01-31T22:01:16-08:00  Driver           Downloading image
2022-01-31T22:01:16-08:00  Task Setup       Building Task Directory
2022-01-31T22:01:16-08:00  Received         Task received by client

After first alloc fails a second alloc attempt is made

Replacement alloc

Recent Events:
Time                       Type        Description
2022-01-31T22:09:55-08:00  Started     Task started by client
2022-01-31T22:09:51-08:00  Driver      Downloading image
2022-01-31T22:09:51-08:00  Task Setup  Building Task Directory
2022-01-31T22:09:51-08:00  Received    Task received by client

Second alloc attempt succeeds and yet the deployment is stuck

Latest Deployment
ID          = 10f74cb9
Status      = running
Description = Deployment is running pending automatic promotion

Deployed
Task Group     Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
some-task-group  true         false     1        1         2       1        1          2022-02-01T06:20:48Z

Need to manually promote

nomad deployment promote 10f74cb9-7a82-3138-8355-dd9294430bd0     
==> 2022-01-31T23:01:06-08:00: Monitoring evaluation "49a23b4e"
    2022-01-31T23:01:06-08:00: Evaluation triggered by job "some-app"
    2022-01-31T23:01:06-08:00: Evaluation within deployment: "10f74cb9"
==> 2022-01-31T23:01:07-08:00: Monitoring evaluation "49a23b4e"
    2022-01-31T23:01:07-08:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-31T23:01:07-08:00: Evaluation "49a23b4e" finished with status "complete"
==> 2022-01-31T23:01:07-08:00: Monitoring deployment "10f74cb9"
  ✓ Deployment "10f74cb9" successful

    2022-01-31T23:01:07-08:00
    ID          = 10f74cb9
    Job ID      = some-app
    Job Version = 5
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group     Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
    Progress Deadline
    some-task-group  true         true      1        1         2       1        1
    2022-01-31T23:11:06-08:00

@chuckyz
Copy link
Contributor

chuckyz commented Aug 3, 2022

Looking at this:

if !dstate.AutoPromote || dstate.DesiredCanaries != len(dstate.PlacedCanaries) {

If my understanding is correct, PlacedCanaries holds a struct of all canary id's placed at any point in time. Including those now unhealthy/replaced. This would mean if you're ever placing more canaries than desired this is always returning nil.

Using that, I'd propose the following change:

                if !dstate.AutoPromote || dstate.DesiredCanaries < len(dstate.PlacedCanaries) {
                        return nil
                }

                healthyCanaries := 0
                // Find the health status of each canary
                for _, c := range dstate.PlacedCanaries {
                        for _, a := range allocs {
                                if c == a.ID && a.DeploymentStatus.IsHealthy() {
                                        healthyCanaries += 1
                                }
                        }
                }

                if healthyCanaries != dstate.DesiredCanaries {
                  return nil
                }

Let me know if this makes sense!

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 3, 2022

Hi @chuckyz 👋

Great investigation! This looks good to me. I think the only change is that you would want to check for the inverse condition: len(dstate.PlacedCanaries) < dstate.DesiredCanaries (return early if we haven't placed enough canaries yet).

I don't know if you have an easy repro, but just for the record, this is the job I used:

job "canary" {
  datacenters = ["dc1"]

  meta {
    uuid = uuidv4()
  }

  group "canary" {
    count = 3

    restart {
      attempts = 1
    }

    update {
      max_parallel     = 3
      canary           = 3
      auto_promote     = true
      min_healthy_time = "2s"
    }

    task "canary" {
      driver = "raw_exec"

      config {
        command = "/bin/bash"
        args    = ["local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

if [[ $NOMAD_ALLOC_ID =~ ^[a-fA-F] ]]; then
  echo "alloc ID starts with letter, bye"
  exit 1
fi

echo "alloc ID doesn't start with letter"
while true; do
  sleep 5
done
EOF

        destination = "local/script.sh"
      }
    }
  }
}

It takes a bit of luck to trigger a failure, but you can just run the job multiple times, that meta block will make sure each run is unique.

Feel free to open a PR with this patch 🙂

@chuckyz
Copy link
Contributor

chuckyz commented Aug 3, 2022

@lgfa29 open! I've opened this ahead of testing it locally so that I can get some eyeballs on the test because that exact test is... complicated and I'd by lying if I said I truly understood what I just read/wrote.

edit:
bug: fixed ✅

image

@github-actions
Copy link

github-actions bot commented Dec 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

6 participants