Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically down-rank or quarantine nodes with failing placements #12920

Closed
mikenomitch opened this issue May 6, 2022 · 1 comment · Fixed by #13421
Closed

Automatically down-rank or quarantine nodes with failing placements #12920

mikenomitch opened this issue May 6, 2022 · 1 comment · Fixed by #13421

Comments

@mikenomitch
Copy link
Contributor

Proposal

When the scheduler attempts to place a task on a node, and the allocation fails, that node should be ranked lower (or quarantined) the next time the scheduler is looking for a node to place.

Use-cases

In some instances, the Nomad scheduler repeatedly selects the same node to schedue on, but for some reason deployments on that node fail repeatedly. This can be seen in this issue. In these cases, the entire cluster can become blocked due to several bad nodes. This feature would make Nomad far more resilient in when there are bad nodes that the scheduler does not know are bad.

Attempted Solutions

A person can monitor for failed placements and/or blocked evaluations and intervene. This takes a lot of effort and knowledge for the Nomad operator to do, and shouldn't be necessary in the first place.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants