Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection #8099

langmartin · 2020-06-02T19:11:39Z

A couple of very small changes, and then the changes in generic_sched and reconciler. The risk should be limited to the extra delayInstead condition check, which is also applied to evals that are delayed for the reschedule block.

Manual testing (e2e version coming): https://github.com/langmartin/nomad-dev/tree/master/cfg/heartyeet

Fixes #8098

schmichael

Don't forget the changelog entry.

Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.

Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval.

…nted extra followup evaluations around job garbage collection (#8099) * client/heartbeatstop: reversed time condition for startup grace * scheduler/generic_sched: use `delayInstead` to avoid a loop Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval. * scheduler/reconcile: lostLater are different than delayedReschedules Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.

notnoop

The change makes sense - I have a couple of questions here. I'll dig into it a bit more and take it for a test drive later today to understand the implications.

notnoop · 2020-06-04T11:49:00Z

scheduler/generic_sched.go

+	// a new eval to the planner in createBlockedEval. If rescheduling should
+	// be delayed, do that instead.


I assume the delay clause is only relevent for the new evals? If current evaluation is reused, its delay value will not change. Is that correct?

notnoop · 2020-06-04T11:52:42Z

scheduler/reconcile.go

-	// Allocs that are lost and delayed have an attributeUpdate that correctly links to
-	// the eval, but incorrectly has the current (running) status
-	for _, d := range lostLater {
-		a.result.attributeUpdates[d.allocID].SetStop(structs.AllocClientStatusLost, structs.AllocClientStatusLost)
-	}


Context for why this is needed? should this logic be moved to handleDelayedLost?

yes, status is handled sort of twice in the planner, the alloc needs to be marked with the correct status but also sent to the planner in the NodeUpdate collection not the NodeAllocation collection. attributeUpdates all get added to the NodeAllocation part of the planner, which is wrong for our purposes. There's some followup to this behavior in #8105, which clarifies how the status gets applied.

I've been using this graph to follow the code: https://github.com/langmartin/nomad-dev/blob/master/doc/delayed-reschedules.svg. That's a hand-drawn graph, so it may have errors. red is control flow and green is data flow.

notnoop · 2020-06-04T11:59:04Z

scheduler/generic_sched.go

-	if len(s.followUpEvals) > 0 {
+	// Create follow up evals for any delayed reschedule eligible allocations, except in
+	// the case that this evaluation was already delayed.
+	if delayInstead {


I'm rusty here - do we ever have a case where a delayed reschedule eligible evals result into more follow ups? Like a delayed reschedule eval is created, but then on its processing attempt, cluster is full, and one more blocking eval is created. In such case would we factor in whether .eval.WaitUntil has passed, not that it's just zero?

notnoop · 2020-06-04T12:02:12Z

scheduler/generic_sched.go

@@ -87,6 +87,8 @@ type GenericScheduler struct {
 	ctx        *EvalContext
 	stack      *GenericStack

+	// followUpEvals are evals with WaitUntil set, which are delayed until that time
+	// before being rescheduled


rescheduled sounds unclear to me - I believe it means the scheduler re-processes in this context, not necessarily that these evals are for scheduled allocations due to client loss/drain/etc.

They're submitted to the worker via RPC, which goes through the eval_endpoint, raft, the fsm, state_store, and then evalbroker.processEnqueue, where it gets delayHeap.Pushed. evalbroker.runDelayedEvalsWatcher checks the head of the delay heap, and waits until the first eval is due to add it to the regular eval queue. worker.run gets it from the channel and creates a new scheduler to process it then.

The followupEvals are only used for these delays, which hold up all of the reschedule processing. After they're due, they may become blocked or otherwise stopped if the job is changed.

Does that make sense? The context isn't saved, they go all the way around the eval system. On client loss or drain, the node drain eval creates the plan that changes all the affected allocs to lost. If the reschedule rules don't prevent it, replacement allocs will also be in that plan request. If reschedule or stop_after_client_disconnect prevent creating an immediate replacement alloc, it's only in that case you get a followupEval.

github-actions · 2023-01-03T02:14:37Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

langmartin marked this pull request as ready for review June 2, 2020 19:58

langmartin requested a review from schmichael June 2, 2020 19:58

schmichael requested a review from notnoop June 2, 2020 21:46

schmichael approved these changes Jun 2, 2020

View reviewed changes

langmartin added 6 commits June 3, 2020 09:28

client/heartbeatstop: reversed time condition for startup grace

ec96413

nomad/plan_apply: log the eval id

321a2cc

scheduler/generic_sched_test: require exactly one followup eval

9082e16

CHANGELOG

eaec537

langmartin force-pushed the b-heartyeet-evals branch from e02c326 to eaec537 Compare June 3, 2020 13:29

langmartin merged commit 422493f into master Jun 3, 2020

langmartin deleted the b-heartyeet-evals branch June 3, 2020 13:48

notnoop reviewed Jun 4, 2020

View reviewed changes

github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection #8099

Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection #8099

langmartin commented Jun 2, 2020 •

edited

Loading

schmichael left a comment •

edited

Loading

notnoop left a comment

notnoop Jun 4, 2020

notnoop Jun 4, 2020

langmartin Jun 4, 2020

notnoop Jun 4, 2020

notnoop Jun 4, 2020

langmartin Jun 4, 2020

github-actions bot commented Jan 3, 2023

		// a new eval to the planner in createBlockedEval. If rescheduling should
		// be delayed, do that instead.

Delayed evaluations for stop_after_client_disconnect can cause unwanted extra followup evaluations around job garbage collection #8099

Delayed evaluations for stop_after_client_disconnect can cause unwanted extra followup evaluations around job garbage collection #8099

Conversation

langmartin commented Jun 2, 2020 • edited Loading

schmichael left a comment • edited Loading

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

notnoop Jun 4, 2020

Choose a reason for hiding this comment

notnoop Jun 4, 2020

Choose a reason for hiding this comment

langmartin Jun 4, 2020

Choose a reason for hiding this comment

notnoop Jun 4, 2020

Choose a reason for hiding this comment

notnoop Jun 4, 2020

Choose a reason for hiding this comment

langmartin Jun 4, 2020

Choose a reason for hiding this comment

github-actions bot commented Jan 3, 2023

Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection #8099

Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection #8099

langmartin commented Jun 2, 2020 •

edited

Loading

schmichael left a comment •

edited

Loading