Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi task-groups jobs hit canary progress deadline in promote #7058

Closed
luca-eb opened this issue Feb 3, 2020 · 8 comments · Fixed by #10042
Closed

Multi task-groups jobs hit canary progress deadline in promote #7058

luca-eb opened this issue Feb 3, 2020 · 8 comments · Fixed by #10042
Assignees
Labels
hcc/cst Admin - internal hcc/sme Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments
Milestone

Comments

@luca-eb
Copy link

luca-eb commented Feb 3, 2020

Nomad version

Nomad v0.9.5

Operating system and Environment details

Ubuntu 16.04

Issue

Multi task-groups jobs are failing due to progress deadline

Reproduction steps

Create a job with 2 task groups, use default progress deadline of 10m and canary deployment for both task groups. Start deployment, after canary are running healthy wait til you hit the deadline (should not be valid/used by now as it refers to healthy instances). Promote the deployment. Deployment fails with Failed due to progress deadline.

I am not able to reproduce this at every attempt hinting at possibly a secondary root cause that I haven't identified just yet.

We initially thought it was due to #4738 but we have now upgraded to a version that includes that fix and we are still seeing this. Furthermore I was able to reproduce this also for deployments where no canary instances have failed.

Last but not least I haven't been able to reproduce this at all for jobs with only 1 task group.

Job file

job "test_service" {
  datacenters = ["aws-us-east-1"]
  region = "us-east"

  constraint {
    attribute = "${attr.unique.hostname}"
    operator  = "regexp"
    value     = "soa-svccpu"
  }

  constraint {
    attribute = "${attr.os.name}"
    value     = "ubuntu"
  }

  constraint {
    attribute = "${attr.os.version}"
    value     = "16.04"
  }
  
  migrate {
    max_parallel = 3
  }

  spread {
    "attribute" = "${meta.iman_location}"
  }

  group "first_group" {
    count = 90

    update {
      canary = 3
      stagger = '15s'
      max_parallel = 6
      min_healthy_time = "2s"
    }

    […]
  }

  group "second_group" {
    count = 12

    update {
      canary = 3
      stagger = '15s'
      max_parallel = 3
    }

    […]
  }
}

Solutions discussion

I've already verified that promoting prior to the deadline and setting the deadline to 0 (job will fail whenever a new instance fails) work as expected.
I am going to try and provide you all with some DEBUG logs.

@luca-eb luca-eb changed the title Nomad Multi task-groups jobs hit canary progress deadline in promote Feb 3, 2020
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 4, 2020
@BijanJohn
Copy link
Contributor

I am encountering the same issue as well. We have a particular job with 20 allocations of a single task group type and another different task group with a single task/allocation. We typically like to deploy a canary (with a single alloc of each task group) and then try to promote the next day. The behavior we see is that the deployment fails immediately with the same error message Failed due to progress deadline however the next allocation in the deployment does get deployed and does go healthy. We have seen this behavior in Nomad 0.9.5 but I will be testing the same behavior in 0.10.5

@stale
Copy link

stale bot commented May 6, 2020

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Jun 5, 2020

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed Jun 5, 2020
Nomad - Community Issues Triage automation moved this from Needs Triage to Done Jun 5, 2020
@tgross tgross added hcc/cst Admin - internal hcc/sme Admin - internal stage/needs-verification Issue needs verifying it still exists and removed stage/needs-investigation stage/waiting-reply labels Feb 5, 2021
@tgross
Copy link
Member

tgross commented Feb 5, 2021

Stalebot hit this and shouldn't have. We need to do some follow-up here.

@tgross
Copy link
Member

tgross commented Feb 5, 2021

Potentially related issues: #8139 (version 0.11.2) #8150 (versions 0.11.2 and 0.12.8). Private reports for this issue include versions 0.9.5, 0.10.4, and 0.11.1.

@tgross
Copy link
Member

tgross commented Feb 12, 2021

Ok, so I've spent a good bit of the week trying to figure this one out and finally with some clues from my colleague @rc407's initial investigation I think I've got it. So I want to brain-dump what I know at this point.

An important detail to note here is that Nomad tracks the progress deadline for each task group separately even though there's only a single deployment for a given version of a job. You can configure each task group with its own update block, each with its own progress_deadline.

In the easy case without canaries, once we pass the progress deadline the deployment watcher's shouldFail function will return true and we'll just exit the watch.

When canaries are in play, shouldFail only returns true if the canaries aren't healthy. You can wait an arbitrary amount of time after that to promote the deployment. When you promote the deployment, that makes an update to the deployment object. Each time an allocation becomes healthy the client will update the Allocation and also make an update to the Deployment object (in a transaction), which includes updating the task group's progress deadline.

Any update to a deployment triggers the deployment status update channel in our watch function. This trigger calls getDeploymentProgressCutoff, which returns the lowest RequireProgressBy value for any group that hasn't yet finished. That last bit proves to be the important part. If the returned value has changed from the previously-stored deadline, the watcher will reset the deadline.

The problem that's happening is when the group that set the deadline finishes, the next getDeploymentProgressCutoff can return a value in the past. It looks like this is just barely possible to happen in any deployment where updates are spread out close to the progress deadline, but it's much more likely to trigger when you have canaries which are manually promoted after the progress deadline.

Consider a job with two task groups:

job "example" {

  group "group1" {
    count = 1

    update {
      auto_revert      = false
      canary           = 1
      progress_deadine = "10m"
    }
  }

  group "group2" {
    count = 10

    update {
      auto_revert      = false
      canary           = 1
      progress_deadine = "10m"
    }
  }

}

The first group's canary becomes healthy T01:00 and sets the group's RequireProgressBy to T11:00. The deployment watcher will see this new deadline and set to T11:00.

The second group's canary becomes healthy a minute later and sets the group's RequireProgressBy to T12:00. The deployment watcher compares the two and keeps the deadline at T11:00.

The progress deadline expires, and some time later at T15:00 we promote the deployment.

The promotion triggers the deployment update channel, but note that we don't yet have any new allocations. When we compare the groups, the we query the first group and find out it's done. So it doesn't contribute a value. So we only use the second group to set the deadline, and this is where we hit the bug. That value is a change from what we had, so we update the deadline and set it into the past! This immediately triggers our timeout and the deployment fails.

7058

So what do we do about it?

I think we can fix this by having manual promotion immediately reset the RequireProgressBy values for all incomplete task groups, rather than waiting for them to be updated by the allocation updates. This ensures the update is being made atomically across both, before we reset the timer in deploymentwatcher. Technically there's a behavior change here in that it slightly cuts down the actual time we give for progress by the amount it takes to make a placement, but that's typically on the scale of <100ms.

Another option would be to change the if !next.Equal(currentDeadline) conditional to be if next.After(currentDeadline), but I need to spend a little time reasoning thru that to make sure it doesn't break something.

I've now finally got a failing test here which proves the problem out. I'm at the end of my week here but I'll be able to drop a patch sometime next week; it looks like it should be a pretty small intervention. My working branch is b-7058-progress-deadline.

test patch
diff --git a/nomad/deploymentwatcher/deployments_watcher_test.go b/nomad/deploymentwatcher/deployments_watcher_test.go
index 8777a398f..8fa635577 100644
--- a/nomad/deploymentwatcher/deployments_watcher_test.go
+++ b/nomad/deploymentwatcher/deployments_watcher_test.go
@@ -1334,6 +1334,216 @@ func TestDeploymentWatcher_PromotedCanary_UpdatedAllocs(t *testing.T) {
 	})
 }

+func TestDeploymentWatcher_ProgressDeadline_LatePromote(t *testing.T) {
+	t.Parallel()
+	require := require.New(t)
+	mtype := structs.MsgTypeTestSetup
+
+	w, m := defaultTestDeploymentWatcher(t)
+	w.SetEnabled(true, m.state)
+
+	m.On("UpdateDeploymentStatus", mocker.MatchedBy(func(args *structs.DeploymentStatusUpdateRequest) bool {
+		return true
+	})).Return(nil).Maybe()
+
+	progressTimeout := time.Millisecond * 10000
+	j := mock.Job()
+	j.TaskGroups[0].Name = "group1"
+	j.TaskGroups[0].Update = structs.DefaultUpdateStrategy.Copy()
+	j.TaskGroups[0].Update.MaxParallel = 2
+	j.TaskGroups[0].Update.AutoRevert = false
+	j.TaskGroups[0].Update.ProgressDeadline = progressTimeout
+	j.TaskGroups = append(j.TaskGroups, j.TaskGroups[0].Copy())
+	j.TaskGroups[0].Name = "group2"
+
+	d := mock.Deployment()
+	d.JobID = j.ID
+	d.TaskGroups = map[string]*structs.DeploymentState{
+		"group1": {
+			ProgressDeadline: progressTimeout,
+			Promoted:         false,
+			PlacedCanaries:   []string{},
+			DesiredCanaries:  1,
+			DesiredTotal:     3,
+			PlacedAllocs:     0,
+			HealthyAllocs:    0,
+			UnhealthyAllocs:  0,
+		},
+		"group2": {
+			ProgressDeadline: progressTimeout,
+			Promoted:         false,
+			PlacedCanaries:   []string{},
+			DesiredCanaries:  1,
+			DesiredTotal:     1,
+			PlacedAllocs:     0,
+			HealthyAllocs:    0,
+			UnhealthyAllocs:  0,
+		},
+	}
+
+	require.NoError(m.state.UpsertJob(mtype, m.nextIndex(), j))
+	require.NoError(m.state.UpsertDeployment(m.nextIndex(), d))
+
+	// require that we get a call to UpsertDeploymentPromotion
+	matchConfig := &matchDeploymentPromoteRequestConfig{
+		Promotion: &structs.DeploymentPromoteRequest{
+			DeploymentID: d.ID,
+			All:          true,
+		},
+		Eval: true,
+	}
+	matcher := matchDeploymentPromoteRequest(matchConfig)
+	m.On("UpdateDeploymentPromotion", mocker.MatchedBy(matcher)).Return(nil).Run(func(args mocker.Arguments) {
+		fmt.Println(d.ID)
+		//		spew.Dump(d)
+	})
+
+	m1 := matchUpdateAllocDesiredTransitions([]string{d.ID})
+	m.On("UpdateAllocDesiredTransition", mocker.MatchedBy(m1)).Return(nil)
+
+	// create canaries
+
+	now := time.Now()
+
+	canary1 := mock.Alloc()
+	canary1.Job = j
+	canary1.DeploymentID = d.ID
+	canary1.TaskGroup = "group1"
+	canary1.DesiredStatus = structs.AllocDesiredStatusRun
+	canary1.ModifyTime = now.UnixNano()
+
+	canary2 := mock.Alloc()
+	canary2.Job = j
+	canary2.DeploymentID = d.ID
+	canary2.TaskGroup = "group2"
+	canary2.DesiredStatus = structs.AllocDesiredStatusRun
+	canary2.ModifyTime = now.UnixNano()
+
+	allocs := []*structs.Allocation{canary1, canary2}
+	err := m.state.UpsertAllocs(mtype, m.nextIndex(), allocs)
+	require.NoError(err)
+
+	// 2nd group's canary becomes healthy
+
+	now = time.Now()
+
+	canary2 = canary2.Copy()
+	canary2.ModifyTime = now.UnixNano()
+	canary2.DeploymentStatus = &structs.AllocDeploymentStatus{
+		Canary:    true,
+		Healthy:   helper.BoolToPtr(true),
+		Timestamp: now,
+	}
+
+	allocs = []*structs.Allocation{canary2}
+	err = m.state.UpdateAllocsFromClient(mtype, m.nextIndex(), allocs)
+	require.NoError(err)
+
+	// wait for long enough to ensure we read deployment update channel
+	// this sleep creates the race condition associated with #7058
+	time.Sleep(50 * time.Millisecond)
+
+	// 1st group's canary becomes healthy
+	now = time.Now()
+
+	canary1 = canary1.Copy()
+	canary1.ModifyTime = now.UnixNano()
+	canary1.DeploymentStatus = &structs.AllocDeploymentStatus{
+		Canary:    true,
+		Healthy:   helper.BoolToPtr(true),
+		Timestamp: now,
+	}
+
+	allocs = []*structs.Allocation{canary1}
+	err = m.state.UpdateAllocsFromClient(mtype, m.nextIndex(), allocs)
+	require.NoError(err)
+
+	// ensure progress_deadline has definitely expired
+	time.Sleep(progressTimeout)
+
+	// promote the deployment
+
+	req := &structs.DeploymentPromoteRequest{
+		DeploymentID: d.ID,
+		All:          true,
+	}
+	err = w.PromoteDeployment(req, &structs.DeploymentUpdateResponse{})
+	require.NoError(err)
+
+	// wait for long enough to ensure we read deployment update channel
+	time.Sleep(50 * time.Millisecond)
+
+	// create new allocs for promoted deployment
+	// (these come from plan_apply, not a client update)
+	now = time.Now()
+
+	alloc1a := mock.Alloc()
+	alloc1a.Job = j
+	alloc1a.DeploymentID = d.ID
+	alloc1a.TaskGroup = "group1"
+	alloc1a.ClientStatus = structs.AllocClientStatusPending
+	alloc1a.DesiredStatus = structs.AllocDesiredStatusRun
+	alloc1a.ModifyTime = now.UnixNano()
+
+	alloc1b := mock.Alloc()
+	alloc1b.Job = j
+	alloc1b.DeploymentID = d.ID
+	alloc1b.TaskGroup = "group1"
+	alloc1b.ClientStatus = structs.AllocClientStatusPending
+	alloc1b.DesiredStatus = structs.AllocDesiredStatusRun
+	alloc1b.ModifyTime = now.UnixNano()
+
+	allocs = []*structs.Allocation{alloc1a, alloc1b}
+	err = m.state.UpsertAllocs(mtype, m.nextIndex(), allocs)
+	require.NoError(err)
+
+	// allocs become healthy
+
+	now = time.Now()
+
+	alloc1a = alloc1a.Copy()
+	alloc1a.ClientStatus = structs.AllocClientStatusRunning
+	alloc1a.ModifyTime = now.UnixNano()
+	alloc1a.DeploymentStatus = &structs.AllocDeploymentStatus{
+		Canary:    false,
+		Healthy:   helper.BoolToPtr(true),
+		Timestamp: now,
+	}
+
+	alloc1b = alloc1b.Copy()
+	alloc1b.ClientStatus = structs.AllocClientStatusRunning
+	alloc1b.ModifyTime = now.UnixNano()
+	alloc1b.DeploymentStatus = &structs.AllocDeploymentStatus{
+		Canary:    false,
+		Healthy:   helper.BoolToPtr(true),
+		Timestamp: now,
+	}
+
+	allocs = []*structs.Allocation{alloc1a, alloc1b}
+	err = m.state.UpdateAllocsFromClient(mtype, m.nextIndex(), allocs)
+	require.NoError(err)
+
+	// ensure any progress deadline has expired
+	time.Sleep(progressTimeout * 1)
+
+	// without a scheduler running we'll never mark the deployment as
+	// successful, so test that healthy == desired and that we haven't failed
+	deployment, err := m.state.DeploymentByID(nil, d.ID)
+	require.NoError(err)
+	require.Equal(structs.DeploymentStatusRunning, deployment.Status)
+
+	group1 := deployment.TaskGroups["group1"]
+
+	require.Equal(group1.DesiredTotal, group1.HealthyAllocs, "not enough healthy")
+	require.Equal(group1.DesiredTotal, group1.PlacedAllocs, "not enough placed")
+	require.Equal(0, group1.UnhealthyAllocs)
+
+	group2 := deployment.TaskGroups["group2"]
+	require.Equal(group2.DesiredTotal, group2.HealthyAllocs, "not enough healthy")
+	require.Equal(group2.DesiredTotal, group2.PlacedAllocs, "not enough placed")
+	require.Equal(0, group2.UnhealthyAllocs)
+}
+
 // Test scenario where deployment initially has no progress deadline
 // After the deployment is updated, a failed alloc's DesiredTransition should be set
 func TestDeploymentWatcher_Watch_StartWithoutProgressDeadline(t *testing.T) {

Failing results for that test, demonstrating the bug:

$ go test -v ./nomad/deploymentwatcher -run LatePromote -count=1
=== RUN   TestDeploymentWatcher_ProgressDeadline_LatePromote
=== PAUSE TestDeploymentWatcher_ProgressDeadline_LatePromote
=== CONT  TestDeploymentWatcher_ProgressDeadline_LatePromote
2021-02-12T14:28:59.124-0500 [TRACE] deploymentwatcher/deployment_watcher.go:481: deployments_watcher: resetting deadline: deployment_id=0230df49-a966-ec7a-082c-b509b6a6566f job="<ns: "default", id: "mock-service-bc698b95-0e14-5431-837a-5c3b46d1005f">"
2021-02-12T14:28:59.175-0500 [TRACE] deploymentwatcher/deployment_watcher.go:481: deployments_watcher: resetting deadline: deployment_id=0230df49-a966-ec7a-082c-b509b6a6566f job="<ns: "default", id: "mock-service-bc698b95-0e14-5431-837a-5c3b46d1005f">"
2021-02-12T14:29:09.126-0500 [DEBUG] deploymentwatcher/deployment_watcher.go:443: deployments_watcher: skipping deadline: deployment_id=0230df49-a966-ec7a-082c-b509b6a6566f job="<ns: "default", id: "mock-service-bc698b95-0e14-5431-837a-5c3b46d1005f">"
0230df49-a966-ec7a-082c-b509b6a6566f
2021-02-12T14:29:09.179-0500 [TRACE] deploymentwatcher/deployment_watcher.go:481: deployments_watcher: resetting deadline: deployment_id=0230df49-a966-ec7a-082c-b509b6a6566f job="<ns: "default", id: "mock-service-bc698b95-0e14-5431-837a-5c3b46d1005f">"
2021-02-12T14:29:09.179-0500 [DEBUG] deploymentwatcher/deployment_watcher.go:447: deployments_watcher: deadline hit: deployment_id=0230df49-a966-ec7a-082c-b509b6a6566f job="<ns: "default", id: "mock-service-bc698b95-0e14-5431-837a-5c3b46d1005f">" rollback=false
    deployments_watcher_test.go:1533:
                Error Trace:    deployments_watcher_test.go:1533
                Error:          Not equal:
                                expected: "running"
                                actual  : "failed"

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1 +1 @@
                                -running
                                +failed
                Test:           TestDeploymentWatcher_ProgressDeadline_LatePromote
--- FAIL: TestDeploymentWatcher_ProgressDeadline_LatePromote (20.11s)
FAIL
FAIL    github.com/hashicorp/nomad/nomad/deploymentwatcher      20.395s
FAIL

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation stage/needs-verification Issue needs verifying it still exists labels Feb 12, 2021
@BijanJohn
Copy link
Contributor

Fantastic troubleshooting here @tgross , thanks so much for digging into this. This bug has plagued us when we deploy one of our largest services with 2 task groups and canaries, where we want to wait a day until the canary promotion happens to allow for more time to determine if we want to promote our new version.

@tgross tgross added this to the 1.0.4 milestone Feb 17, 2021
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 17, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Feb 17, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Feb 22, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/cst Admin - internal hcc/sme Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants