Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved scheduler retry logic under high contention #787

Merged
merged 2 commits into from
Feb 11, 2016

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Feb 10, 2016

This PR resets the retry count if progress is made during scheduling and fails by creating a blocked eval.

@armon

@c4milo
Copy link
Contributor

c4milo commented Feb 10, 2016

I wonder if retry attempts should be randomized, in order to avoid overwhelming the server when too many blocked evaluations are queued. Or does the max retries achieve the same effect?

}

e := s.ctx.Eligibility()
classes := e.GetClasses()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well not track this if HasEscaped

@armon
Copy link
Member

armon commented Feb 11, 2016

Minor feedback, LGTM

dadgar added a commit that referenced this pull request Feb 11, 2016
Improved scheduler retry logic under high contention
@dadgar dadgar merged commit 49b4d39 into master Feb 11, 2016
@dadgar dadgar deleted the f-scheduler-retries branch February 11, 2016 17:49
@armon
Copy link
Member

armon commented Feb 11, 2016

@c4milo the retry limit is there to prevent overwhelming the servers, exactly as you said!

@c4milo
Copy link
Contributor

c4milo commented Feb 11, 2016

Nice! shouldn't retries be randomized then? So that in case of any general failure all the queued allocs aren't tried to be scheduled at the same time, DoSing the servers? Or is it unlikely to happen?

@c4milo
Copy link
Contributor

c4milo commented Feb 11, 2016

I've seen similar scenarios happening before in other distributed systems, where a service would be unable recover due to all clients retrying at the same time and DoSing/overwhelming the service.

@armon
Copy link
Member

armon commented Feb 12, 2016

@c4milo The evaluation broker handles this case. The scheduler limits how many retries it does in a hot loop, before yielding the scheduler thread and moving back into the evaluation broker. There is also randomization in the placement order to reduce contention under extremely high load as well.

@c4milo
Copy link
Contributor

c4milo commented Feb 12, 2016

Great! Thanks Armon for explaining further!
On Thu, Feb 11, 2016 at 7:57 PM Armon Dadgar notifications@github.com
wrote:

@c4milo https://github.com/c4milo The evaluation broker handles this
case. The scheduler limits how many retries it does in a hot loop, before
yielding the scheduler thread and moving back into the evaluation broker.
There is also randomization in the placement order to reduce contention
under extremely high load as well.


Reply to this email directly or view it on GitHub
#787 (comment).

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants