Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: mark allocations pendingPlacement instead of failure if there aren't nodes available to place the allocations #1010

Closed
g0t4 opened this issue Mar 30, 2016 · 4 comments

Comments

@g0t4
Copy link

g0t4 commented Mar 30, 2016

Nomad version

v0.3.1

Operating system and Environment details

centos 7 3.10.0-327.10.1.el7.x86_64
Docker version 1.10.3, build 20f81dd

Issue

When I have a batch job that has allocations that can't immediately be placed, those allocations are marked failed. Eventually new allocations are created. The queue works great in this case, but the allocation history becomes difficult to manage.

Often, there are hundreds to thousands of failed allocations, with "failed to find a node for placement". This makes status calls to the HTTP api take lots of time to return, sometimes with results that are 8 MB in size. This isn't an issue for colocated queries, but remotely it can be a headache.

Would it make more sense to just keep the allocations pending, or maybe pendingPlacement? And then in the TaskStates, add an event entry that says "failed placement" with the last time placement was tried?

FYI, I brought this up on the mailing list too: https://groups.google.com/forum/#!topic/nomad-tool/LcvMgHN_RPU

Reproduction steps

Create a batch job with enough tasks to saturate all nodes in a cluster, with some remaining allocations that can't be immediately placed.

@dadgar
Copy link
Contributor

dadgar commented Mar 30, 2016

Hey @g0t4, thanks for bringing this up. We have some ideas to fix this and they will land in 0.4. The failed allocations contain debug information that we will be moving to the evaluation.

@dadgar
Copy link
Contributor

dadgar commented May 28, 2016

Hey @g0t4, this has been fixed in master by PRs #1188 and #1199

@dadgar dadgar closed this as completed May 28, 2016
@g0t4
Copy link
Author

g0t4 commented May 31, 2016

You all rock, thanks!

On Fri, May 27, 2016 at 8:02 PM, Alex Dadgar notifications@github.com
wrote:

Closed #1010 #1010.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1010 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAK_3UNw-hHdw4p_BEEKnI-2EINMpvkkks5qF4YMgaJpZM4H8Fm0
.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants