Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep placed canaries aligned in raft store #6975

Merged
merged 3 commits into from
Feb 3, 2020

Conversation

drewbailey
Copy link
Contributor

@drewbailey drewbailey commented Jan 22, 2020

Nomad state store must be modified through raft; DeploymentState.PlacedCanaries is currently modified in-place which masked a bug in single node clusters, but made evaluations in other nodes unaware of the true state of Canaries during a deployment.

This change reconciles the state of PlacedCanaries when committing deployment/alloc updates to raft instead of modifying local state of a specific scheduler.

fixes #6936
fixes #6864

@drewbailey drewbailey changed the title keep placed canaries aligned with alloc status keep placed canaries aligned in raft store Jan 22, 2020
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach and fix lgtm - good catch!

Can we add a test here? Maybe a unit test in state_store_test.go? An integration tests to highlight the problem and fix might be too complicated, but not sure.
I would suggest adding a test.

nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But it looks like the test expectations for TestServiceSched_JobModify_Canaries should be changed now; probably a good idea to make sure we're capturing that same test's behavior for state store instead at @notnoop said.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - a couple of nitpicks comments for tests :) but lgtm otherwise.

scheduler/generic_sched_test.go Outdated Show resolved Hide resolved
nomad/state/state_store_test.go Show resolved Hide resolved
@@ -3442,6 +3442,20 @@ func (s *StateStore) updateDeploymentWithAlloc(index uint64, alloc, existing *st
state.HealthyAllocs += healthy
state.UnhealthyAllocs += unhealthy

// Ensure PlacedCanaries accurately reflects the alloc canary status
if alloc.DeploymentStatus != nil && alloc.DeploymentStatus.Canary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do we need to update UpdateDeploymentPromotion so that canaries that get promoted are removed from PlacedCanaries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since UpdateDeploymentPromotion deals with promoting the deployment/canaries I think that the values for Placed shouldn't ever be decremented. If you want to see the deployment status of a previous deployment the value for Placed should be the total that were placed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm.. I'm a bit concerned about the reconciler logic. Currently, handleGroupCanaries returns canaries, by checking PlacedCanaries exclusively. If a canary got promoted, how do we ensure that scheduler no longer treats it as one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe (and can followup / confirm) Canary promotion only happens when a deployment is promoted. If there is an old deployment with non-promoted canaries, or the deployment is failed, it will use PlacedCanaries that haven't been promoted to be stopped.

	// Cancel any non-promoted canaries from the older deployment
	if a.oldDeployment != nil {
		for _, s := range a.oldDeployment.TaskGroups {
			if !s.Promoted {
				stop = append(stop, s.PlacedCanaries...)
			}
		}
	}

	// Cancel any non-promoted canaries from a failed deployment
	if a.deployment != nil && a.deployment.Status == structs.DeploymentStatusFailed {
		for _, s := range a.deployment.TaskGroups {
			if !s.Promoted {
				stop = append(stop, s.PlacedCanaries...)
			}
		}
	}

And then after the deployment is promoted, the promotion eval includes the promoted deployment canaries as canaries, but then further down canaryState will be false

	canaryState := dstate != nil && dstate.DesiredCanaries != 0 && !dstate.Promoted

Which is used in computeStop to stop the previous version non canary allocs. I think after all of this, the deploy is marked as successful, and even though placed canaries stays populated, we check deployment status and no new canaries are ever placed until a new deployment

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I'm still confused about PlacedCanaries and promoted canaries, but this PR doesn't change that logic - we don't have to address the question in this PR.

@@ -529,10 +529,6 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
// If we are placing a canary and we found a match, add the canary
// to the deployment state object and mark it as a canary.
if missing.Canary() && s.deployment != nil {
if state, ok := s.deployment.TaskGroups[tg.Name]; ok {
state.PlacedCanaries = append(state.PlacedCanaries, alloc.ID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear why an inplace update is inappropriate or the source of a bug here. It's hard to follow this flow, but s.deployment is set by the reconciler which creates a copy of the deployment:

deployment: deployment.Copy(),

This seems like an intentional decision to make deployments safe for inplace updates, and it appears s.deployment == s.Plan.Deployment which means it will get submitted along with the plan.

If that's not the case I'd love to see a little more cleanup or commenting in the code to help future developers. Maybe an explicit comment on the s.deployment field about its mutability, and removing the deployment.Copy() if it is immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is definitely some misdirection going on with all these deployments I'm going to outline my understanding of them and then see about commenting or cleaning up.

but s.deployment is set by the reconciler which creates a copy of the deployment

The scheduler deployment is actually not set by the reconciler, the reconciler copies the scheduler deployment to its own deployment and uses it as a readonly reference to the current state of the deployment and returns a results.deployment and results.deploymentUpdates

https://github.com/hashicorp/nomad/blob/master/scheduler/generic_sched.go#L355-L357

Currently the reconciler will only return a results.deployment if the deployment is newly created, otherwise it will just be deploymentUpdates.

So the following logic is mutating state that never gets written to the raft log because only changes on s.plan are ultimately submitted, in a single node cluster the state will remain locally modified and show the correct number of canaries but in multi node clusters a different worker won't have these changes

if state, ok := s.deployment.TaskGroups[tg.Name]; ok {
	state.PlacedCanaries = append(state.PlacedCanaries, alloc.ID)
}

@drewbailey drewbailey added this to the 0.10.4 milestone Feb 3, 2020
@drewbailey
Copy link
Contributor Author

drewbailey commented Feb 3, 2020

Leaving reproduction steps for testing purposes

Setup and run a local dev cluster (https://github.com/hashicorp/nomad/blob/master/dev/cluster/cluster.sh)

run repo.hc.
job "failing-nomad-test01" {
  datacenters = ["dc1"]
  type        = "service"

  constraint {
    attribute = "${meta.tag}"
    value     = "foo"
  }

  update {
    health_check      = "checks"
    max_parallel      = 4
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    auto_promote      = false
    canary            = 4
  }

  migrate {
    max_parallel     = 3
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "group" {
    count = 4

    restart {
      attempts = 0
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "rails" {
      driver = "docker"

      env {
        RAILS_ENV                = "production"
        RAILS_SERVE_STATIC_FILES = "1"
        RAILS_LOG_TO_STDOUT      = "1"
        TEST                     = "2020-01-09 01:23:18 +0000"
        version                  = "129"
      }

      config {
        image = "kaspergrubbe/diceapp:0.0.6"

        command = "bundle"
        args    = ["exec", "unicorn", "-c", "/app/config/unicorn.rb"]

        port_map {
          web = 8080
        }
      }

      resources {
        cpu    = 750
        memory = 250

        network {
          mbits = 50
          port  "web" {}
        }
      }

      service {
        name = "failing-nomad-test00"
        tags = []
        port = "web"

        check {
          name     = "failing-nomad-test00 healthcheck"
          type     = "http"
          protocol = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
        }
      }
    }
  }
}

wait for initial deployment to become healthy nomad job status -verbose failing-

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
group       4        4       4        0          2020-02-03T13:41:42-05:00

Allocations
ID                                    Eval ID                               Node ID                               Node Name  Task Group  Version  Desired  Status   Created                    Modified
3c7eb6cb-aea9-055f-96e6-760eb3526d4a  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:42-05:00
3f34931c-0e25-bd86-7ca1-1535d945a180  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:41-05:00
5099bcaf-336b-99ad-40c7-dcb20d39fd47  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:39-05:00
d48180bc-021e-93c0-0569-9812329a9d21  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:39-05:00
start a deploy with failing job `nomad job run repro-fail.hcl
job "failing-nomad-test01" {
  datacenters = ["dc1"]
  type        = "service"

  constraint {
    attribute = "${meta.tag}"
    value     = "foo"
  }

  update {
    health_check      = "checks"
    max_parallel      = 4
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    auto_promote      = false
    canary            = 4
  }

  migrate {
    max_parallel     = 3
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "group" {
    count = 4

    restart {
      attempts = 0
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "rails" {
      driver = "docker"

      env {
        RAILS_ENV                = "production"
        RAILS_SERVE_STATIC_FILES = "1"
        RAILS_LOG_TO_STDOUT      = "1"
        TEST                     = "2020-01-09 01:23:18 +0000"
      }

      config {
        image = "kaspergrubbe/diceapp:0.0.6"

        command = "false"

        port_map {
          web = 8080
        }
      }

      resources {
        cpu    = 750
        memory = 250

        network {
          mbits = 50
          port  "web" {}
        }
      }

      service {
        name = "failing-nomad-test00"
        tags = []
        port = "web"

        check {
          name     = "failing-nomad-test00 healthcheck"
          type     = "http"
          protocol = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
        }
      }
    }
  }
}

observe 4 failed canaries, and delayed rescheduling attempt

Future Rescheduling Attempts
Task Group  Reschedule Policy                                     Eval ID                               Eval Time
group       unlimited with exponential delay, max_delay = 1h0m0s  1f3bc3c6-5a29-2e7f-ddda-2ba93a51b8e6  25s from now

Latest Deployment
ID          = 533e4c07-acfc-444a-01ee-ecd93829e4c4
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         4       0        4          2020-02-03T13:42:56-05:00

Allocations
ID                                    Eval ID                               Node ID                               Node Name  Task Group  Version  Desired  Status   Created                    Modified
3a19b38c-e587-3a5c-9bac-c7dc9f80da70  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed   2020-02-03T13:32:56-05:00  2020-02-03T13:32:59-05:00
e3396874-555d-5651-81e0-99a4ef74826d  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed   2020-02-03T13:32:56-05:00  2020-02-03T13:32:59-05:00
9f00f827-63ba-da46-8156-c9c5413bf1f4  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed   2020-02-03T13:32:56-05:00  2020-02-03T13:32:59-05:00
b1a8d187-5edd-688f-e3e2-19b19f5d5e0c  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed   2020-02-03T13:32:56-05:00  2020-02-03T13:32:59-05:00
3f34931c-0e25-bd86-7ca1-1535d945a180  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:41-05:00
5099bcaf-336b-99ad-40c7-dcb20d39fd47  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:39-05:00
d48180bc-021e-93c0-0569-9812329a9d21  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:39-05:00
3c7eb6cb-aea9-055f-96e6-760eb3526d4a  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        run      running  2020-02-03T13:31:22-05:00  2020-02-03T13:31:42-05:00

Once rescheduled eval runs, it should kill off the 4 healthy previous allocs instead of placing 4 new canaries

Latest Deployment
ID          = 533e4c07-acfc-444a-01ee-ecd93829e4c4
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         8       0        8          2020-02-03T13:42:56-05:00

Allocations
ID                                    Eval ID                               Node ID                               Node Name  Task Group  Version  Desired  Status    Created                    Modified
9f2a9488-0d81-ebb2-be52-c0e00d541ae9  1f3bc3c6-5a29-2e7f-ddda-2ba93a51b8e6  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed    2020-02-03T13:33:27-05:00  2020-02-03T13:33:30-05:00
b9b4bd00-510a-9846-9cac-60f34eeb02c1  1f3bc3c6-5a29-2e7f-ddda-2ba93a51b8e6  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed    2020-02-03T13:33:27-05:00  2020-02-03T13:33:30-05:00
60ed109c-3989-9ab6-4922-edbffd600efe  1f3bc3c6-5a29-2e7f-ddda-2ba93a51b8e6  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed    2020-02-03T13:33:27-05:00  2020-02-03T13:33:30-05:00
6566472b-433d-3d37-f610-0492a0ef187f  1f3bc3c6-5a29-2e7f-ddda-2ba93a51b8e6  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        run      failed    2020-02-03T13:33:27-05:00  2020-02-03T13:33:30-05:00
3a19b38c-e587-3a5c-9bac-c7dc9f80da70  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        stop     failed    2020-02-03T13:32:56-05:00  2020-02-03T13:33:27-05:00
b1a8d187-5edd-688f-e3e2-19b19f5d5e0c  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        stop     failed    2020-02-03T13:32:56-05:00  2020-02-03T13:33:27-05:00
9f00f827-63ba-da46-8156-c9c5413bf1f4  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        stop     failed    2020-02-03T13:32:56-05:00  2020-02-03T13:33:27-05:00
e3396874-555d-5651-81e0-99a4ef74826d  7a78c791-e553-3d4a-f4f1-9426e2b5c3dd  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       1        stop     failed    2020-02-03T13:32:56-05:00  2020-02-03T13:33:27-05:00
3c7eb6cb-aea9-055f-96e6-760eb3526d4a  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        stop     complete  2020-02-03T13:31:22-05:00  2020-02-03T13:33:31-05:00
3f34931c-0e25-bd86-7ca1-1535d945a180  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        stop     complete  2020-02-03T13:31:22-05:00  2020-02-03T13:33:31-05:00
5099bcaf-336b-99ad-40c7-dcb20d39fd47  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        stop     complete  2020-02-03T13:31:22-05:00  2020-02-03T13:33:31-05:00
d48180bc-021e-93c0-0569-9812329a9d21  42de1870-4824-eb91-81c1-5ee7dd22d57d  d6b606f2-7e0d-2a28-14a1-e0e2fe21caec  client1    group       0        stop     complete  2020-02-03T13:31:22-05:00  2020-02-03T13:33:31-05:00

after patch, and after the failed deployed reschedule eval runs, the state should look similar to

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
group       false     4        4         8       0        8          2020-02-03T13:47:52-05:00

Allocations
ID                                    Eval ID                               Node ID                               Node Name  Task Group  Version  Desired  Status   Created                    Modified
19d083df-aaab-2d0d-ac92-a903b2668daf  2f952062-236a-19b1-dde0-b87c812b24f4  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        run      failed   2020-02-03T13:38:23-05:00  2020-02-03T13:38:26-05:00
3961fd9a-1afb-729f-fe64-7304800f97a3  2f952062-236a-19b1-dde0-b87c812b24f4  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        run      failed   2020-02-03T13:38:23-05:00  2020-02-03T13:38:26-05:00
72188056-c001-90df-ef9f-4c69a19401c4  2f952062-236a-19b1-dde0-b87c812b24f4  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        run      failed   2020-02-03T13:38:23-05:00  2020-02-03T13:38:26-05:00
fae75b88-6afa-1983-3542-3537b5771a2e  2f952062-236a-19b1-dde0-b87c812b24f4  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        run      failed   2020-02-03T13:38:23-05:00  2020-02-03T13:38:26-05:00
de5a77cf-611a-c5a2-2f2a-0826b2301070  b7a73cde-e991-3761-6acd-a063e91c5b73  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        stop     failed   2020-02-03T13:37:52-05:00  2020-02-03T13:38:23-05:00
5ee32ced-1dbd-c73d-34ec-34779f2a5769  b7a73cde-e991-3761-6acd-a063e91c5b73  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        stop     failed   2020-02-03T13:37:52-05:00  2020-02-03T13:38:23-05:00
7cbee00a-685f-873e-f049-ee3e961082ed  b7a73cde-e991-3761-6acd-a063e91c5b73  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        stop     failed   2020-02-03T13:37:52-05:00  2020-02-03T13:38:23-05:00
c12fb7e1-8983-52bd-c1cd-a6e3d4558f73  b7a73cde-e991-3761-6acd-a063e91c5b73  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       1        stop     failed   2020-02-03T13:37:52-05:00  2020-02-03T13:38:23-05:00
38de6ad5-5454-de4d-5a59-0925324d3343  558d9046-f009-1413-5f8c-a72d070eac94  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       0        run      running  2020-02-03T13:37:22-05:00  2020-02-03T13:37:42-05:00
95cbecb4-55f4-4238-e441-33e051369c9f  558d9046-f009-1413-5f8c-a72d070eac94  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       0        run      running  2020-02-03T13:37:22-05:00  2020-02-03T13:37:37-05:00
5aa2911e-bdb3-c8fa-5aba-0670aeae451f  558d9046-f009-1413-5f8c-a72d070eac94  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       0        run      running  2020-02-03T13:37:22-05:00  2020-02-03T13:37:38-05:00
e51c5311-6c70-7c54-5b19-47597494b31c  558d9046-f009-1413-5f8c-a72d070eac94  352beb8d-6906-2997-1cf6-0777e2a37b69  client1    group       0        run      running  2020-02-03T13:37:22-05:00  2020-02-03T13:37:42-05:00

@drewbailey drewbailey merged commit c038ee0 into master Feb 3, 2020
@drewbailey drewbailey deleted the b-update-placed-canaries branch February 3, 2020 19:24
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants