Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep placed canaries aligned in raft store #6975

Merged
merged 3 commits into from
Feb 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions nomad/state/state_store.go
Original file line number Diff line number Diff line change
Expand Up @@ -3578,6 +3578,20 @@ func (s *StateStore) updateDeploymentWithAlloc(index uint64, alloc, existing *st
state.HealthyAllocs += healthy
state.UnhealthyAllocs += unhealthy

// Ensure PlacedCanaries accurately reflects the alloc canary status
if alloc.DeploymentStatus != nil && alloc.DeploymentStatus.Canary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do we need to update UpdateDeploymentPromotion so that canaries that get promoted are removed from PlacedCanaries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since UpdateDeploymentPromotion deals with promoting the deployment/canaries I think that the values for Placed shouldn't ever be decremented. If you want to see the deployment status of a previous deployment the value for Placed should be the total that were placed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm.. I'm a bit concerned about the reconciler logic. Currently, handleGroupCanaries returns canaries, by checking PlacedCanaries exclusively. If a canary got promoted, how do we ensure that scheduler no longer treats it as one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe (and can followup / confirm) Canary promotion only happens when a deployment is promoted. If there is an old deployment with non-promoted canaries, or the deployment is failed, it will use PlacedCanaries that haven't been promoted to be stopped.

	// Cancel any non-promoted canaries from the older deployment
	if a.oldDeployment != nil {
		for _, s := range a.oldDeployment.TaskGroups {
			if !s.Promoted {
				stop = append(stop, s.PlacedCanaries...)
			}
		}
	}

	// Cancel any non-promoted canaries from a failed deployment
	if a.deployment != nil && a.deployment.Status == structs.DeploymentStatusFailed {
		for _, s := range a.deployment.TaskGroups {
			if !s.Promoted {
				stop = append(stop, s.PlacedCanaries...)
			}
		}
	}

And then after the deployment is promoted, the promotion eval includes the promoted deployment canaries as canaries, but then further down canaryState will be false

	canaryState := dstate != nil && dstate.DesiredCanaries != 0 && !dstate.Promoted

Which is used in computeStop to stop the previous version non canary allocs. I think after all of this, the deploy is marked as successful, and even though placed canaries stays populated, we check deployment status and no new canaries are ever placed until a new deployment

found := false
for _, canary := range state.PlacedCanaries {
if alloc.ID == canary {
found = true
break
}
}
if !found {
state.PlacedCanaries = append(state.PlacedCanaries, alloc.ID)
}
}

// Update the progress deadline
if pd := state.ProgressDeadline; pd != 0 {
// If we are the first placed allocation for the deployment start the progress deadline.
Expand Down
103 changes: 103 additions & 0 deletions nomad/state/state_store_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6382,6 +6382,109 @@ func TestStateStore_UpsertDeploymentAllocHealth_BadAlloc_Nonexistent(t *testing.
}
}

// Test that a deployments PlacedCanaries is properly updated
func TestStateStore_UpsertDeploymentAlloc_Canaries(t *testing.T) {
t.Parallel()

state := testStateStore(t)

// Create a deployment
d1 := mock.Deployment()
require.NoError(t, state.UpsertDeployment(2, d1))

// Create a Job
job := mock.Job()
require.NoError(t, state.UpsertJob(3, job))

// Create alloc with canary status
a := mock.Alloc()
a.JobID = job.ID
a.DeploymentID = d1.ID
a.DeploymentStatus = &structs.AllocDeploymentStatus{
Healthy: helper.BoolToPtr(false),
Canary: true,
}
require.NoError(t, state.UpsertAllocs(4, []*structs.Allocation{a}))

// Pull the deployment from state
ws := memdb.NewWatchSet()
deploy, err := state.DeploymentByID(ws, d1.ID)
require.NoError(t, err)

// Ensure that PlacedCanaries is accurate
require.Equal(t, 1, len(deploy.TaskGroups[job.TaskGroups[0].Name].PlacedCanaries))
drewbailey marked this conversation as resolved.
Show resolved Hide resolved

// Create alloc without canary status
b := mock.Alloc()
b.JobID = job.ID
b.DeploymentID = d1.ID
b.DeploymentStatus = &structs.AllocDeploymentStatus{
Healthy: helper.BoolToPtr(false),
Canary: false,
}
require.NoError(t, state.UpsertAllocs(4, []*structs.Allocation{b}))

// Pull the deployment from state
ws = memdb.NewWatchSet()
deploy, err = state.DeploymentByID(ws, d1.ID)
require.NoError(t, err)

// Ensure that PlacedCanaries is accurate
require.Equal(t, 1, len(deploy.TaskGroups[job.TaskGroups[0].Name].PlacedCanaries))

// Create a second deployment
d2 := mock.Deployment()
require.NoError(t, state.UpsertDeployment(5, d2))

c := mock.Alloc()
c.JobID = job.ID
c.DeploymentID = d2.ID
c.DeploymentStatus = &structs.AllocDeploymentStatus{
Healthy: helper.BoolToPtr(false),
Canary: true,
}
require.NoError(t, state.UpsertAllocs(6, []*structs.Allocation{c}))

ws = memdb.NewWatchSet()
deploy2, err := state.DeploymentByID(ws, d2.ID)
require.NoError(t, err)

// Ensure that PlacedCanaries is accurate
require.Equal(t, 1, len(deploy2.TaskGroups[job.TaskGroups[0].Name].PlacedCanaries))
}

func TestStateStore_UpsertDeploymentAlloc_NoCanaries(t *testing.T) {
t.Parallel()

state := testStateStore(t)

// Create a deployment
d1 := mock.Deployment()
require.NoError(t, state.UpsertDeployment(2, d1))

// Create a Job
job := mock.Job()
require.NoError(t, state.UpsertJob(3, job))

// Create alloc with canary status
a := mock.Alloc()
a.JobID = job.ID
a.DeploymentID = d1.ID
a.DeploymentStatus = &structs.AllocDeploymentStatus{
Healthy: helper.BoolToPtr(true),
Canary: false,
}
require.NoError(t, state.UpsertAllocs(4, []*structs.Allocation{a}))

// Pull the deployment from state
ws := memdb.NewWatchSet()
deploy, err := state.DeploymentByID(ws, d1.ID)
require.NoError(t, err)

// Ensure that PlacedCanaries is accurate
require.Equal(t, 0, len(deploy.TaskGroups[job.TaskGroups[0].Name].PlacedCanaries))
}

// Test that allocation health can't be set for an alloc with mismatched
// deployment ids
func TestStateStore_UpsertDeploymentAllocHealth_BadAlloc_MismatchDeployment(t *testing.T) {
Expand Down
4 changes: 0 additions & 4 deletions scheduler/generic_sched.go
Original file line number Diff line number Diff line change
Expand Up @@ -529,10 +529,6 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
// If we are placing a canary and we found a match, add the canary
// to the deployment state object and mark it as a canary.
if missing.Canary() && s.deployment != nil {
if state, ok := s.deployment.TaskGroups[tg.Name]; ok {
state.PlacedCanaries = append(state.PlacedCanaries, alloc.ID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear why an inplace update is inappropriate or the source of a bug here. It's hard to follow this flow, but s.deployment is set by the reconciler which creates a copy of the deployment:

deployment: deployment.Copy(),

This seems like an intentional decision to make deployments safe for inplace updates, and it appears s.deployment == s.Plan.Deployment which means it will get submitted along with the plan.

If that's not the case I'd love to see a little more cleanup or commenting in the code to help future developers. Maybe an explicit comment on the s.deployment field about its mutability, and removing the deployment.Copy() if it is immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is definitely some misdirection going on with all these deployments I'm going to outline my understanding of them and then see about commenting or cleaning up.

but s.deployment is set by the reconciler which creates a copy of the deployment

The scheduler deployment is actually not set by the reconciler, the reconciler copies the scheduler deployment to its own deployment and uses it as a readonly reference to the current state of the deployment and returns a results.deployment and results.deploymentUpdates

https://github.com/hashicorp/nomad/blob/master/scheduler/generic_sched.go#L355-L357

Currently the reconciler will only return a results.deployment if the deployment is newly created, otherwise it will just be deploymentUpdates.

So the following logic is mutating state that never gets written to the raft log because only changes on s.plan are ultimately submitted, in a single node cluster the state will remain locally modified and show the correct number of canaries but in multi node clusters a different worker won't have these changes

if state, ok := s.deployment.TaskGroups[tg.Name]; ok {
	state.PlacedCanaries = append(state.PlacedCanaries, alloc.ID)
}


alloc.DeploymentStatus = &structs.AllocDeploymentStatus{
Canary: true,
}
Expand Down
27 changes: 19 additions & 8 deletions scheduler/generic_sched_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2029,17 +2029,28 @@ func TestServiceSched_JobModify_Canaries(t *testing.T) {
if plan.Deployment == nil {
t.Fatalf("bad: %#v", plan)
}
state, ok := plan.Deployment.TaskGroups[job.TaskGroups[0].Name]
if !ok {
t.Fatalf("bad: %#v", plan)
}
if state.DesiredTotal != 10 && state.DesiredCanaries != desiredUpdates {
t.Fatalf("bad: %#v", state)
}

// Ensure local state was not altered in scheduler
staleState, ok := plan.Deployment.TaskGroups[job.TaskGroups[0].Name]
require.True(t, ok)

require.Equal(t, 0, len(staleState.PlacedCanaries))

ws := memdb.NewWatchSet()

// Grab the latest state
deploy, err := h.State.DeploymentByID(ws, plan.Deployment.ID)
require.NoError(t, err)

state, ok := deploy.TaskGroups[job.TaskGroups[0].Name]
require.True(t, ok)

require.Equal(t, 10, state.DesiredTotal)
require.Equal(t, state.DesiredCanaries, desiredUpdates)

// Assert the canaries were added to the placed list
if len(state.PlacedCanaries) != desiredUpdates {
t.Fatalf("bad: %#v", state)
assert.Fail(t, "expected PlacedCanaries to equal desiredUpdates", state)
}
}

Expand Down