Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back-pressure on Nacks and ensure scheduling progress on failures #2555

Merged
merged 5 commits into from
Apr 14, 2017

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Apr 12, 2017

This PR adds two things to increase robustness under high contention:

  1. Evaluation's that are Nack'd are not immediately re-enqueued, providing some back-pressure.
  2. If an evaluations hits its delivery limit and is marked as failed, when reaped, a follow up evaluation will be created with a fairly substantial delay. This gives the cluster time to recover and ensures progress for jobs with failed evaluations.

Add a delay when an evaluation is nacked that starts off small but
compounds to a larger delay for subsequent Nacks. This creates some
back pressure.
Create a follow up evaluation when reaping failed evaluations. This
ensures that a job will still make eventual progress.
@dadgar dadgar requested a review from armon April 12, 2017 21:53
var delay time.Duration

switch {
case prevDequeues <= 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make each case just return an explicit value? Makes it easier, especially since we don't post-process the value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here. The number of retries is a config option so not sure how you enumerate all possibilities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean instead of setting delay in the outer block and an empty clause, do an explicit return in each and remove the outer variable.


// initialNackReenqueueDelay is the delay applied before re-enqueuing a
// Nacked evaluation for the first time
initialNackReenqueueDelay = time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's thread these through a config like EvalNackTimeout, that way there isn't a weird hack for testing either.

nomad/leader.go Outdated
// failedEvalFollowUpWaitRange defines the the range of additional time from
// the minimum in which to wait before retrying a failed evaluation. A value
// from this range should be selected using a uniform distribution.
failedEvalFollowUpWaitRange = 9 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really wide window. I would start more conservative, like 1 minute baseline with 5 minute max.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might consider making these threaded through via config as well. I can see wanting to tune this potentially.

EvalTriggerNodeUpdate = "node-update"
EvalTriggerScheduled = "scheduled"
EvalTriggerRollingUpdate = "rolling-update"
EvalTriggerFailedFollowUp = "failed-eval-follow-up"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can omit "eval" -> "failed-follow-up", since it's on an eval, that is implied.

nomad/leader.go Outdated
// Update via Raft
req := structs.EvalUpdateRequest{
Evals: []*structs.Evaluation{newEval},
Evals: []*structs.Evaluation{newEval, followupEval},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename newEval to updateEval. It took me a while to realize it's not a "new" eval, just an update of the existing one.

Copy link
Member

@armon armon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but LGTM!

@dadgar dadgar merged commit 477c97e into master Apr 14, 2017
@dadgar dadgar deleted the f-nack-delay branch April 14, 2017 22:27
@github-actions
Copy link

github-actions bot commented Apr 2, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants