Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed evaluations for stop_after_client_disconnect can cause unwanted extra followup evaluations around job garbage collection #8099

Merged
merged 6 commits into from
Jun 3, 2020

Conversation

langmartin
Copy link
Contributor

@langmartin langmartin commented Jun 2, 2020

A couple of very small changes, and then the changes in generic_sched and reconciler. The risk should be limited to the extra delayInstead condition check, which is also applied to evals that are delayed for the reschedule block.

Manual testing (e2e version coming): https://github.com/langmartin/nomad-dev/tree/master/cfg/heartyeet

Fixes #8098

@langmartin langmartin marked this pull request as ready for review June 2, 2020 19:58
@schmichael schmichael requested a review from notnoop June 2, 2020 21:46
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the changelog entry.

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.
@langmartin langmartin merged commit 422493f into master Jun 3, 2020
@langmartin langmartin deleted the b-heartyeet-evals branch June 3, 2020 13:48
langmartin added a commit that referenced this pull request Jun 3, 2020
…nted extra followup evaluations around job garbage collection (#8099)

* client/heartbeatstop: reversed time condition for startup grace

* scheduler/generic_sched: use `delayInstead` to avoid a loop

Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.

* scheduler/reconcile: lostLater are different than delayedReschedules

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense - I have a couple of questions here. I'll dig into it a bit more and take it for a test drive later today to understand the implications.

Comment on lines +264 to +265
// a new eval to the planner in createBlockedEval. If rescheduling should
// be delayed, do that instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the delay clause is only relevent for the new evals? If current evaluation is reused, its delay value will not change. Is that correct?

Comment on lines -364 to -368
// Allocs that are lost and delayed have an attributeUpdate that correctly links to
// the eval, but incorrectly has the current (running) status
for _, d := range lostLater {
a.result.attributeUpdates[d.allocID].SetStop(structs.AllocClientStatusLost, structs.AllocClientStatusLost)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for why this is needed? should this logic be moved to handleDelayedLost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, status is handled sort of twice in the planner, the alloc needs to be marked with the correct status but also sent to the planner in the NodeUpdate collection not the NodeAllocation collection. attributeUpdates all get added to the NodeAllocation part of the planner, which is wrong for our purposes. There's some followup to this behavior in #8105, which clarifies how the status gets applied.

I've been using this graph to follow the code: https://github.com/langmartin/nomad-dev/blob/master/doc/delayed-reschedules.svg. That's a hand-drawn graph, so it may have errors. red is control flow and green is data flow.

if len(s.followUpEvals) > 0 {
// Create follow up evals for any delayed reschedule eligible allocations, except in
// the case that this evaluation was already delayed.
if delayInstead {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rusty here - do we ever have a case where a delayed reschedule eligible evals result into more follow ups? Like a delayed reschedule eval is created, but then on its processing attempt, cluster is full, and one more blocking eval is created. In such case would we factor in whether .eval.WaitUntil has passed, not that it's just zero?

@@ -87,6 +87,8 @@ type GenericScheduler struct {
ctx *EvalContext
stack *GenericStack

// followUpEvals are evals with WaitUntil set, which are delayed until that time
// before being rescheduled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rescheduled sounds unclear to me - I believe it means the scheduler re-processes in this context, not necessarily that these evals are for scheduled allocations due to client loss/drain/etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're submitted to the worker via RPC, which goes through the eval_endpoint, raft, the fsm, state_store, and then evalbroker.processEnqueue, where it gets delayHeap.Pushed. evalbroker.runDelayedEvalsWatcher checks the head of the delay heap, and waits until the first eval is due to add it to the regular eval queue. worker.run gets it from the channel and creates a new scheduler to process it then.

The followupEvals are only used for these delays, which hold up all of the reschedule processing. After they're due, they may become blocked or otherwise stopped if the job is changed.

Does that make sense? The context isn't saved, they go all the way around the eval system. On client loss or drain, the node drain eval creates the plan that changes all the affected allocs to lost. If the reschedule rules don't prevent it, replacement allocs will also be in that plan request. If reschedule or stop_after_client_disconnect prevent creating an immediate replacement alloc, it's only in that case you get a followupEval.

@github-actions
Copy link

github-actions bot commented Jan 3, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants