Respect alloc job version for lost/failed allocs #8691

notnoop · 2020-08-19T14:17:31Z

This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.

Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.

Implementation High Level

In high level, the fix makes the following changes:

First, when rescheduling (or migrating) non-canary allocations, the reconciler marks the resulting allocPlaceResult struct with downgrade indicator along with the minimum version

the minimum version is used to ensure that no allocations is accidentally rollback when rescheduling

When making the placement and computing resources, the generic scheduler finds the latest non-canary deployment associated with the task group and uses it for scheduling purposes
When inserting the resulting alloc in the plan, we ensure that the job info is attached to the alloc.

Typically, we don't populate Allocation.Job field in the plan, as it's typically duplicated, and nomad FSM ensures that the alloc.Job field is populated from plan. Here, we want the alloc to differ from the plan one

When applying the plan results to the FSM, nomad will respect the alloc.Job field and avoid populating it

This is already current behavior, no changes there, but the test ensures that's the case

FAQ

Why use latest promoted or non-canary version? Why not latest job Stable version?

The promotion semantics is somewhat group specific. Consider a job has two TaskGroups: A, B in Version 0, and then Version 1 updated A significantly (requiring a Canary) but only updated B's count and metadata. B allocations should always use Version 1, and non-Canary A should use version 0. Using latest Stable job will not appropriate here

In this approach, does the scheduler respect the count and resources found in latest promoted deployment?

The scheduler will use the resources of the appropriate job version for the alloc, if the resources changed. However, the latest job Count field is the canonical value, and this PR doesn't change that. If a job has TG count=3 in Version 0, and changed count to 2 and requested large resources; immediately, a Version 0 allocation will be stopped, so we'd have 2 Version=0 allocations along with some canaries.

Can this PR handle cases where we reschedule a Canary and NonCanary failed allocations

Yes, when migrating allocations, the canary will be replaced by another canary, and non-canaries will be replaced by non-canary instances, each with the expected versions for them.

Fixes #8439

This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.

notnoop · 2020-08-19T14:19:10Z

scheduler/generic_sched.go

+				}
+
+				// Defensive check - if there is no appropriate deployment for this job, use the latest
+				if job != nil && job.Version >= missing.MinJobVersion() && job.LookupTaskGroup(tg.Name) != nil {


I made few defensive checks here, where if we see unexpected state (e.g. jobs without expected TaskGroup, no non-promoted version), we'd fallback to using the latest version. This seems better than a panic, but not sure if we should simplify this.

is this unexpected? for jobs without update stanza, there won't be deployments, so that downgradedJobForPlacement will return null. (in that case, latest job is exactly what we want.)

Yes, it's unexpected. missing.DowngradeNonCanary() should be always false.

notnoop · 2020-08-19T14:22:02Z

scheduler/reconcile.go

 	strategy := tg.Update
 	canariesPromoted := dstate != nil && dstate.Promoted
-	requireCanary := numDestructive != 0 && strategy != nil && len(canaries) < strategy.Canary && !canariesPromoted
+	requireCanary := (len(destructive) != 0 || (len(untainted) == 0 && len(migrate)+len(lost) != 0)) &&
+		strategy != nil && len(canaries) < strategy.Canary && !canariesPromoted


This is semi-related band-aid that we'll probably need to investigate further. The code here determines if canaries are needed by checking if we have any destructive update. However, if all allocations are dead (because the nodes are lost), len(destructive) will be 0. I changed the condition to account for such scenario.

it might be nice to break this conditional up a bit, and capture some of what's going on here.

notnoop · 2020-08-19T14:22:40Z

scheduler/reconcile.go

@@ -533,9 +533,12 @@ func (a *allocReconciler) computeGroup(group string, all allocSet) bool {
 		})
 		a.result.place = append(a.result.place, allocPlaceResult{
 			name:          alloc.Name,
-			canary:        false,
+			canary:        alloc.DeploymentStatus.IsCanary(),


The code here assumed that all alloc migrations are non-canary. An odd assumption.

schmichael · 2020-08-21T18:41:06Z

scheduler/generic_sched.go

+		//
+		// Zero dstate.DesiredCanaries indicates that the TaskGroup allocates were updated in-place without using canaries.
+		if dstate := d.TaskGroups[tgName]; dstate != nil && (dstate.Promoted || dstate.DesiredCanaries == 0) {
+			job, err := s.state.JobByIDAndVersion(nil, ns, jobID, d.JobVersion)


Should we first compare d.JobVersion against s.job.Version and if they're equal: return nil since they're equivalent?

That's reasonable but also seems like a micro-optimization - I may consider it when addressing reviews.

schmichael

Great work. I'm tempted to ask for refactoring the inner loop of computePlacements to make local variables and SetJob(...) state easier to follow, but I'm not sure it'd help readability.

Does this fix #8439? If so can we make the steps in that issue (or similar) into an e2e test? We didn't have the e2e infrastructure around when deployments were written, so it would be nice to backfill.

schmichael · 2020-08-21T18:43:03Z

scheduler/generic_sched.go

+					if job != nil {
+						jobVersion = int(job.Version)
+					}
+					s.logger.Warn("failed to find appropriate job; using the latest", "expected_version", missing.MinJobVersion, "found_version", jobVersion)


Our server logs are notoriously difficult for operators to determine how to react, so I'm wondering if there's something else we can do here:

Can we log differently if job is nil instead of using a sentinel value? I think they would improve clarity of that case considerably.

Can we lower the log level to debug? I'm unsure what use this log line is outside of development. If an invariant has failed perhaps we need to be more aggressive in our wording?

If there's anything an operator can and should do to remediate this, let's explicitly call it out.

I'll downgrade to debug.

schmichael · 2020-08-21T18:44:04Z

scheduler/generic_sched.go

+					s.logger.Warn("failed to find appropriate job; using the latest", "expected_version", missing.MinJobVersion, "found_version", jobVersion)
+				}
+			}
+
 			// Check if this task group has already failed
 			if metric, ok := s.failedTGAllocs[tg.Name]; ok {
 				metric.CoalescedFailures += 1


Do we need to restore the stack's original Job here?

It needs to happen below, after a placement is made - particularly after s.selectNextOption is called. Will update the comment.

schmichael · 2020-08-21T18:50:43Z

scheduler/generic_sched.go

@@ -489,6 +541,11 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
 			// Compute top K scoring node metadata
 			s.ctx.Metrics().PopulateScoreMetaData()

+			// restore stack to use the latest job version again


Why are we doing this here? Might be worth including the reasoning in the comment as the actual code behavior is fairly obvious and trivial.

schmichael · 2020-08-21T19:04:22Z

nomad/structs/structs.go

+// To save space, it clears the Job field so it can be derived from the plan Job.
+// If keepJob is true, the normalizatin skipped to accommodate cases where a plan
+// needs to support multiple versions of the same job.
+func (p *Plan) AppendAlloc(alloc *Allocation, keepJob bool) {


Perhaps this should accept the job as an argument to simplify the one place it gets set to a non-nil value? Simplifies a couple checks and be as readable if not a tiny bit more?

Just an idea.

notnoop · 2020-08-21T22:49:28Z

Does this fix #8439? If so can we make the steps in that issue (or similar) into an e2e test? We didn't have the e2e infrastructure around when deployments were written, so it would be nice to backfill.

Yes! Will add an e2e in a follow up PR.

cgbaker

some comments, but i didn't see any problems with this fix.

i did not perform any manual testing.

cgbaker · 2020-08-21T21:06:38Z

scheduler/reconcile.go

 	strategy := tg.Update
 	canariesPromoted := dstate != nil && dstate.Promoted
-	requireCanary := numDestructive != 0 && strategy != nil && len(canaries) < strategy.Canary && !canariesPromoted
+	requireCanary := (len(destructive) != 0 || (len(untainted) == 0 && len(migrate)+len(lost) != 0)) &&
+		strategy != nil && len(canaries) < strategy.Canary && !canariesPromoted


it might be nice to break this conditional up a bit, and capture some of what's going on here.

scheduler/reconcile.go

cgbaker · 2020-08-21T23:44:07Z

scheduler/generic_sched.go

+				}
+
+				// Defensive check - if there is no appropriate deployment for this job, use the latest
+				if job != nil && job.Version >= missing.MinJobVersion() && job.LookupTaskGroup(tg.Name) != nil {


is this unexpected? for jobs without update stanza, there won't be deployments, so that downgradedJobForPlacement will return null. (in that case, latest job is exactly what we want.)

cgbaker · 2020-08-21T23:45:07Z

scheduler/generic_sched.go

+					if job != nil {
+						jobVersion = int(job.Version)
+					}
+					s.logger.Warn("failed to find appropriate job; using the latest", "expected_version", missing.MinJobVersion, "found_version", jobVersion)


not sure if this deserves a warning when job.Update.MaxParallel == 0 (i.e., no deployments)

if MaxParallel == 0, it will be canonicalized to job.Update = nil, and downgrading will not be relevant, and this path isn't executed.

cgbaker · 2020-08-21T23:50:06Z

scheduler/generic_sched.go

@@ -489,6 +541,11 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
 			// Compute top K scoring node metadata
 			s.ctx.Metrics().PopulateScoreMetaData()

+			// restore stack to use the latest job version again


this makes me a little uncomfortable. maybe it feels a little fragile, to swap out the job and then have to un-swap it later?

i'm not sure i have a constructive criticism here; maybe, if it's not too expensive, we should drop the conditional and always restore.

cgbaker · 2020-08-21T23:53:31Z

scheduler/generic_sched.go

@@ -489,6 +541,11 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
 			// Compute top K scoring node metadata
 			s.ctx.Metrics().PopulateScoreMetaData()

+			// restore stack to use the latest job version again
+			if downgradedJob != nil {


Suggested change

if downgradedJob != nil {

if *s.stack.jobVersion != s.job.Version {

maybe this is better?

cgbaker · 2020-08-21T23:57:22Z

Great work. I'm tempted to ask for refactoring the inner loop of computePlacements to make local variables and SetJob(...) state easier to follow, but I'm not sure it'd help readability.

i didn't see @schmichael's review before finishing mine, i had some of the same concerns, which i address in my review.

To address review comments

`(alloc.DeploymentStatus == nil || !alloc.DeploymentStatus.IsCanary())` and `!alloc.DeploymentStatus.IsCanary()` are equivalent.

Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>

github-actions · 2022-12-20T02:16:26Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop requested review from schmichael and cgbaker August 19, 2020 14:17

notnoop commented Aug 19, 2020

View reviewed changes

schmichael reviewed Aug 21, 2020

View reviewed changes

schmichael approved these changes Aug 21, 2020

View reviewed changes

cgbaker approved these changes Aug 21, 2020

View reviewed changes

Mahmood Ali and others added 4 commits August 25, 2020 17:22

Have Plan.AppendAlloc accept the job

cb038b1

tweak stack job manipulation

92bb372

To address review comments

simplify canary check

3a28b85

`(alloc.DeploymentStatus == nil || !alloc.DeploymentStatus.IsCanary())` and `!alloc.DeploymentStatus.IsCanary()` are equivalent.

Update scheduler/reconcile.go

f075bcc

Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>

notnoop force-pushed the b-reschedule-job-versions branch from 26dc560 to f075bcc Compare August 25, 2020 21:37

notnoop merged commit 1afd415 into master Aug 25, 2020

notnoop deleted the b-reschedule-job-versions branch August 25, 2020 22:02

github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect alloc job version for lost/failed allocs #8691

Respect alloc job version for lost/failed allocs #8691

notnoop commented Aug 19, 2020 •

edited

Loading

notnoop Aug 19, 2020

cgbaker Aug 21, 2020

notnoop Aug 25, 2020

notnoop Aug 19, 2020

cgbaker Aug 21, 2020

notnoop Aug 19, 2020

schmichael Aug 21, 2020

notnoop Aug 21, 2020

schmichael left a comment

schmichael Aug 21, 2020

notnoop Aug 21, 2020

schmichael Aug 21, 2020

notnoop Aug 21, 2020

schmichael Aug 21, 2020

schmichael Aug 21, 2020

notnoop commented Aug 21, 2020

cgbaker left a comment

cgbaker Aug 21, 2020

cgbaker Aug 21, 2020

cgbaker Aug 21, 2020

notnoop Aug 25, 2020

cgbaker Aug 21, 2020

cgbaker Aug 21, 2020

cgbaker Aug 21, 2020

cgbaker commented Aug 21, 2020

github-actions bot commented Dec 20, 2022

	if downgradedJob != nil {
	if *s.stack.jobVersion != s.job.Version {

Respect alloc job version for lost/failed allocs #8691

Respect alloc job version for lost/failed allocs #8691

Conversation

notnoop commented Aug 19, 2020 • edited Loading

Implementation High Level

FAQ

Why use latest promoted or non-canary version? Why not latest job Stable version?

In this approach, does the scheduler respect the count and resources found in latest promoted deployment?

Can this PR handle cases where we reschedule a Canary and NonCanary failed allocations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop commented Aug 21, 2020

cgbaker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgbaker commented Aug 21, 2020

github-actions bot commented Dec 20, 2022

notnoop commented Aug 19, 2020 •

edited

Loading