Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed evaluations for stop_after_client_disconnect can cause unwanted extra followup evaluations around job garbage collection #8099

Merged
merged 6 commits into from
Jun 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ IMPROVEMENTS:
BUG FIXES:

* api: Fixed a bug where setting connect sidecar task resources could fail [[GH-7993](https://github.com/hashicorp/nomad/issues/7993)]
* core: Fixed a bug where stop_after_client_disconnect could cause the server to become unresponsive [[GH-8098](https://github.com/hashicorp/nomad/issues/8098)
* client: Fixed a bug where artifact downloads failed on redirects [[GH-7854](https://github.com/hashicorp/nomad/issues/7854)]
* csi: Validate empty volume arguments in API. [[GH-8027](https://github.com/hashicorp/nomad/issues/8027)]

Expand Down
2 changes: 1 addition & 1 deletion client/heartbeatstop.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ func (h *heartbeatStop) shouldStop(alloc *structs.Allocation) bool {
func (h *heartbeatStop) shouldStopAfter(now time.Time, interval time.Duration) bool {
lastOk := h.getLastOk()
if lastOk.IsZero() {
return h.startupGrace.After(now)
return now.After(h.startupGrace)
}
return now.After(lastOk.Add(interval))
}
Expand Down
2 changes: 1 addition & 1 deletion nomad/plan_apply.go
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ func evaluatePlanPlacements(pool *EvaluatePool, snap *state.StateSnapshot, plan
if !fit {
// Log the reason why the node's allocations could not be made
if reason != "" {
logger.Debug("plan for node rejected", "node_id", nodeID, "reason", reason)
logger.Debug("plan for node rejected", "node_id", nodeID, "reason", reason, "eval_id", plan.EvalID)
}
// Set that this is a partial commit
partialCommit = true
Expand Down
17 changes: 11 additions & 6 deletions scheduler/generic_sched.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@ type GenericScheduler struct {
ctx *EvalContext
stack *GenericStack

// followUpEvals are evals with WaitUntil set, which are delayed until that time
// before being rescheduled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rescheduled sounds unclear to me - I believe it means the scheduler re-processes in this context, not necessarily that these evals are for scheduled allocations due to client loss/drain/etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're submitted to the worker via RPC, which goes through the eval_endpoint, raft, the fsm, state_store, and then evalbroker.processEnqueue, where it gets delayHeap.Pushed. evalbroker.runDelayedEvalsWatcher checks the head of the delay heap, and waits until the first eval is due to add it to the regular eval queue. worker.run gets it from the channel and creates a new scheduler to process it then.

The followupEvals are only used for these delays, which hold up all of the reschedule processing. After they're due, they may become blocked or otherwise stopped if the job is changed.

Does that make sense? The context isn't saved, they go all the way around the eval system. On client loss or drain, the node drain eval creates the plan that changes all the affected allocs to lost. If the reschedule rules don't prevent it, replacement allocs will also be in that plan request. If reschedule or stop_after_client_disconnect prevent creating an immediate replacement alloc, it's only in that case you get a followupEval.

followUpEvals []*structs.Evaluation

deployment *structs.Deployment
Expand Down Expand Up @@ -258,11 +260,13 @@ func (s *GenericScheduler) process() (bool, error) {

// If there are failed allocations, we need to create a blocked evaluation
// to place the failed allocations when resources become available. If the
// current evaluation is already a blocked eval, we reuse it by submitting
// a new eval to the planner in createBlockedEval. If the current eval is
// pending with WaitUntil set, it's delayed rather than blocked.
// current evaluation is already a blocked eval, we reuse it. If not, submit
// a new eval to the planner in createBlockedEval. If rescheduling should
// be delayed, do that instead.
Comment on lines +264 to +265
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the delay clause is only relevent for the new evals? If current evaluation is reused, its delay value will not change. Is that correct?

delayInstead := len(s.followUpEvals) > 0 && s.eval.WaitUntil.IsZero()

if s.eval.Status != structs.EvalStatusBlocked && len(s.failedTGAllocs) != 0 && s.blocked == nil &&
s.eval.WaitUntil.IsZero() {
!delayInstead {
if err := s.createBlockedEval(false); err != nil {
s.logger.Error("failed to make blocked eval", "error", err)
return false, err
Expand All @@ -276,8 +280,9 @@ func (s *GenericScheduler) process() (bool, error) {
return true, nil
}

// Create follow up evals for any delayed reschedule eligible allocations
if len(s.followUpEvals) > 0 {
// Create follow up evals for any delayed reschedule eligible allocations, except in
// the case that this evaluation was already delayed.
if delayInstead {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rusty here - do we ever have a case where a delayed reschedule eligible evals result into more follow ups? Like a delayed reschedule eval is created, but then on its processing attempt, cluster is full, and one more blocking eval is created. In such case would we factor in whether .eval.WaitUntil has passed, not that it's just zero?

for _, eval := range s.followUpEvals {
eval.PreviousEval = s.eval.ID
// TODO(preetha) this should be batching evals before inserting them
Expand Down
4 changes: 2 additions & 2 deletions scheduler/generic_sched_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2846,8 +2846,8 @@ func TestServiceSched_StopAfterClientDisconnect(t *testing.T) {
require.Equal(t, h.Evals[0].Status, structs.EvalStatusComplete)
require.Len(t, h.Plans, 1, "plan")

// Followup eval created
require.True(t, len(h.CreateEvals) > 0)
// One followup eval created, either delayed or blocked
require.Len(t, h.CreateEvals, 1)
e := h.CreateEvals[0]
require.Equal(t, eval.ID, e.PreviousEval)

Expand Down
33 changes: 20 additions & 13 deletions scheduler/reconcile.go
Original file line number Diff line number Diff line change
Expand Up @@ -355,18 +355,12 @@ func (a *allocReconciler) computeGroup(group string, all allocSet) bool {

// Find delays for any lost allocs that have stop_after_client_disconnect
lostLater := lost.delayByStopAfterClientDisconnect()
rescheduleLater = append(rescheduleLater, lostLater...)
a.handleDelayedLost(lostLater, all, tg.Name)

// Create batched follow up evaluations for allocations that are
// reschedulable later and mark the allocations for in place updating
a.handleDelayedReschedules(rescheduleLater, all, tg.Name)

// Allocs that are lost and delayed have an attributeUpdate that correctly links to
// the eval, but incorrectly has the current (running) status
for _, d := range lostLater {
a.result.attributeUpdates[d.allocID].SetStop(structs.AllocClientStatusLost, structs.AllocClientStatusLost)
}
Comment on lines -364 to -368
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for why this is needed? should this logic be moved to handleDelayedLost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, status is handled sort of twice in the planner, the alloc needs to be marked with the correct status but also sent to the planner in the NodeUpdate collection not the NodeAllocation collection. attributeUpdates all get added to the NodeAllocation part of the planner, which is wrong for our purposes. There's some followup to this behavior in #8105, which clarifies how the status gets applied.

I've been using this graph to follow the code: https://github.com/langmartin/nomad-dev/blob/master/doc/delayed-reschedules.svg. That's a hand-drawn graph, so it may have errors. red is control flow and green is data flow.


// Create a structure for choosing names. Seed with the taken names which is
// the union of untainted and migrating nodes (includes canaries)
nameIndex := newAllocNameIndex(a.jobID, group, tg.Count, untainted.union(migrate, rescheduleNow))
Expand Down Expand Up @@ -423,7 +417,7 @@ func (a *allocReconciler) computeGroup(group string, all allocSet) bool {
// * The deployment is not paused or failed
// * Not placing any canaries
// * If there are any canaries that they have been promoted
// * There is no delayed stop_after_client_disconnect alloc
// * There is no delayed stop_after_client_disconnect alloc, which delays scheduling for the whole group
var place []allocPlaceResult
if len(lostLater) == 0 {
place = a.computePlacements(tg, nameIndex, untainted, migrate, rescheduleNow)
Expand Down Expand Up @@ -845,6 +839,17 @@ func (a *allocReconciler) computeUpdates(group *structs.TaskGroup, untainted all
// handleDelayedReschedules creates batched followup evaluations with the WaitUntil field set
// for allocations that are eligible to be rescheduled later
func (a *allocReconciler) handleDelayedReschedules(rescheduleLater []*delayedRescheduleInfo, all allocSet, tgName string) {
a.handleDelayedReschedulesImpl(rescheduleLater, all, tgName, true)
}

// handleDelayedLost creates batched followup evaluations with the WaitUntil field set for lost allocations
func (a *allocReconciler) handleDelayedLost(rescheduleLater []*delayedRescheduleInfo, all allocSet, tgName string) {
a.handleDelayedReschedulesImpl(rescheduleLater, all, tgName, false)
}

// handleDelayedReschedulesImpl creates batched followup evaluations with the WaitUntil field set
func (a *allocReconciler) handleDelayedReschedulesImpl(rescheduleLater []*delayedRescheduleInfo, all allocSet, tgName string,
createUpdates bool) {
if len(rescheduleLater) == 0 {
return
}
Expand Down Expand Up @@ -905,10 +910,12 @@ func (a *allocReconciler) handleDelayedReschedules(rescheduleLater []*delayedRes
}

// Create in-place updates for every alloc ID that needs to be updated with its follow up eval ID
for allocID, evalID := range allocIDToFollowupEvalID {
existingAlloc := all[allocID]
updatedAlloc := existingAlloc.Copy()
updatedAlloc.FollowupEvalID = evalID
a.result.attributeUpdates[updatedAlloc.ID] = updatedAlloc
if createUpdates {
for allocID, evalID := range allocIDToFollowupEvalID {
existingAlloc := all[allocID]
updatedAlloc := existingAlloc.Copy()
updatedAlloc.FollowupEvalID = evalID
a.result.attributeUpdates[updatedAlloc.ID] = updatedAlloc
}
}
}