-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delayed evaluations for stop_after_client_disconnect
can cause unwanted extra followup evaluations around job garbage collection
#8099
Changes from all commits
ec96413
321a2cc
1c9c768
bc0cb1b
9082e16
eaec537
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -87,6 +87,8 @@ type GenericScheduler struct { | |
ctx *EvalContext | ||
stack *GenericStack | ||
|
||
// followUpEvals are evals with WaitUntil set, which are delayed until that time | ||
// before being rescheduled | ||
followUpEvals []*structs.Evaluation | ||
|
||
deployment *structs.Deployment | ||
|
@@ -258,11 +260,13 @@ func (s *GenericScheduler) process() (bool, error) { | |
|
||
// If there are failed allocations, we need to create a blocked evaluation | ||
// to place the failed allocations when resources become available. If the | ||
// current evaluation is already a blocked eval, we reuse it by submitting | ||
// a new eval to the planner in createBlockedEval. If the current eval is | ||
// pending with WaitUntil set, it's delayed rather than blocked. | ||
// current evaluation is already a blocked eval, we reuse it. If not, submit | ||
// a new eval to the planner in createBlockedEval. If rescheduling should | ||
// be delayed, do that instead. | ||
Comment on lines
+264
to
+265
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume the delay clause is only relevent for the new evals? If current evaluation is reused, its delay value will not change. Is that correct? |
||
delayInstead := len(s.followUpEvals) > 0 && s.eval.WaitUntil.IsZero() | ||
|
||
if s.eval.Status != structs.EvalStatusBlocked && len(s.failedTGAllocs) != 0 && s.blocked == nil && | ||
s.eval.WaitUntil.IsZero() { | ||
!delayInstead { | ||
if err := s.createBlockedEval(false); err != nil { | ||
s.logger.Error("failed to make blocked eval", "error", err) | ||
return false, err | ||
|
@@ -276,8 +280,9 @@ func (s *GenericScheduler) process() (bool, error) { | |
return true, nil | ||
} | ||
|
||
// Create follow up evals for any delayed reschedule eligible allocations | ||
if len(s.followUpEvals) > 0 { | ||
// Create follow up evals for any delayed reschedule eligible allocations, except in | ||
// the case that this evaluation was already delayed. | ||
if delayInstead { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm rusty here - do we ever have a case where a delayed reschedule eligible evals result into more follow ups? Like a delayed reschedule eval is created, but then on its processing attempt, cluster is full, and one more blocking eval is created. In such case would we factor in whether .eval.WaitUntil has passed, not that it's just zero? |
||
for _, eval := range s.followUpEvals { | ||
eval.PreviousEval = s.eval.ID | ||
// TODO(preetha) this should be batching evals before inserting them | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -355,18 +355,12 @@ func (a *allocReconciler) computeGroup(group string, all allocSet) bool { | |
|
||
// Find delays for any lost allocs that have stop_after_client_disconnect | ||
lostLater := lost.delayByStopAfterClientDisconnect() | ||
rescheduleLater = append(rescheduleLater, lostLater...) | ||
a.handleDelayedLost(lostLater, all, tg.Name) | ||
|
||
// Create batched follow up evaluations for allocations that are | ||
// reschedulable later and mark the allocations for in place updating | ||
a.handleDelayedReschedules(rescheduleLater, all, tg.Name) | ||
|
||
// Allocs that are lost and delayed have an attributeUpdate that correctly links to | ||
// the eval, but incorrectly has the current (running) status | ||
for _, d := range lostLater { | ||
a.result.attributeUpdates[d.allocID].SetStop(structs.AllocClientStatusLost, structs.AllocClientStatusLost) | ||
} | ||
Comment on lines
-364
to
-368
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Context for why this is needed? should this logic be moved to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, status is handled sort of twice in the planner, the alloc needs to be marked with the correct status but also sent to the planner in the I've been using this graph to follow the code: https://github.com/langmartin/nomad-dev/blob/master/doc/delayed-reschedules.svg. That's a hand-drawn graph, so it may have errors. red is control flow and green is data flow. |
||
|
||
// Create a structure for choosing names. Seed with the taken names which is | ||
// the union of untainted and migrating nodes (includes canaries) | ||
nameIndex := newAllocNameIndex(a.jobID, group, tg.Count, untainted.union(migrate, rescheduleNow)) | ||
|
@@ -423,7 +417,7 @@ func (a *allocReconciler) computeGroup(group string, all allocSet) bool { | |
// * The deployment is not paused or failed | ||
// * Not placing any canaries | ||
// * If there are any canaries that they have been promoted | ||
// * There is no delayed stop_after_client_disconnect alloc | ||
// * There is no delayed stop_after_client_disconnect alloc, which delays scheduling for the whole group | ||
var place []allocPlaceResult | ||
if len(lostLater) == 0 { | ||
place = a.computePlacements(tg, nameIndex, untainted, migrate, rescheduleNow) | ||
|
@@ -845,6 +839,17 @@ func (a *allocReconciler) computeUpdates(group *structs.TaskGroup, untainted all | |
// handleDelayedReschedules creates batched followup evaluations with the WaitUntil field set | ||
// for allocations that are eligible to be rescheduled later | ||
func (a *allocReconciler) handleDelayedReschedules(rescheduleLater []*delayedRescheduleInfo, all allocSet, tgName string) { | ||
a.handleDelayedReschedulesImpl(rescheduleLater, all, tgName, true) | ||
} | ||
|
||
// handleDelayedLost creates batched followup evaluations with the WaitUntil field set for lost allocations | ||
func (a *allocReconciler) handleDelayedLost(rescheduleLater []*delayedRescheduleInfo, all allocSet, tgName string) { | ||
a.handleDelayedReschedulesImpl(rescheduleLater, all, tgName, false) | ||
} | ||
|
||
// handleDelayedReschedulesImpl creates batched followup evaluations with the WaitUntil field set | ||
func (a *allocReconciler) handleDelayedReschedulesImpl(rescheduleLater []*delayedRescheduleInfo, all allocSet, tgName string, | ||
createUpdates bool) { | ||
if len(rescheduleLater) == 0 { | ||
return | ||
} | ||
|
@@ -905,10 +910,12 @@ func (a *allocReconciler) handleDelayedReschedules(rescheduleLater []*delayedRes | |
} | ||
|
||
// Create in-place updates for every alloc ID that needs to be updated with its follow up eval ID | ||
for allocID, evalID := range allocIDToFollowupEvalID { | ||
existingAlloc := all[allocID] | ||
updatedAlloc := existingAlloc.Copy() | ||
updatedAlloc.FollowupEvalID = evalID | ||
a.result.attributeUpdates[updatedAlloc.ID] = updatedAlloc | ||
if createUpdates { | ||
for allocID, evalID := range allocIDToFollowupEvalID { | ||
existingAlloc := all[allocID] | ||
updatedAlloc := existingAlloc.Copy() | ||
updatedAlloc.FollowupEvalID = evalID | ||
a.result.attributeUpdates[updatedAlloc.ID] = updatedAlloc | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rescheduled
sounds unclear to me - I believe it means the scheduler re-processes in this context, not necessarily that these evals are for scheduled allocations due to client loss/drain/etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're submitted to the worker via RPC, which goes through the
eval_endpoint
, raft, the fsm, state_store, and thenevalbroker.processEnqueue
, where it getsdelayHeap.Push
ed.evalbroker.runDelayedEvalsWatcher
checks the head of the delay heap, and waits until the first eval is due to add it to the regular eval queue.worker.run
gets it from the channel and creates a new scheduler to process it then.The
followupEvals
are only used for these delays, which hold up all of the reschedule processing. After they're due, they may become blocked or otherwise stopped if the job is changed.Does that make sense? The context isn't saved, they go all the way around the eval system. On client loss or drain, the node drain eval creates the plan that changes all the affected allocs to
lost
. If thereschedule
rules don't prevent it, replacement allocs will also be in that plan request. Ifreschedule
orstop_after_client_disconnect
prevent creating an immediate replacement alloc, it's only in that case you get afollowupEval
.