Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: set the deregistration eval priority to the job priority. #11426

Merged
merged 2 commits into from
Nov 5, 2021

Conversation

jrasell
Copy link
Member

@jrasell jrasell commented Nov 2, 2021

Previously when creating an eval for job deregistration, the eval
priority was set to the default value irregardless of the job
priority. In situations where an operator would want to deregister
a high priority job so they could re-register; the evaluation may
get blocked for some time on a busy cluster because of the
deregsiter priority.

If a job had a lower than default priority and was deregistered,
the deregister eval would get a priority higher than that of the
job. If we attempted to register another job with a higher
priority than this, but still below the default, the deregister
would be actioned before the register.

Both situations described above seem incorrect and unexpected from
a user perspective.

This fix modifies to behaviour to set the deregister eval priority
to that of the job, if available. Otherwise the default value is
still used. This is the same behaviour found within
BatchDeregister.

Previously when creating an eval for job deregistration, the eval
priority was set to the default value irregardless of the job
priority. In situations where an operator would want to deregister
a high priority job so they could re-register; the evaluation may
get blocked for some time on a busy cluster because of the
deregsiter priority.

If a job had a lower than default priority and was deregistered,
the deregister eval would get a priority higher than that of the
job. If we attempted to register another job with a higher
priority than this, but still below the default, the deregister
would be actioned before the register.

Both situations described above seem incorrect and unexpected from
a user prespective.

This fix modifies to behaviour to set the deregister eval priority
to that of the job, if available. Otherwise the default value is
still used.
Copy link
Contributor

@DerekStrickland DerekStrickland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems right.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But I'm also wondering if we should set a floor on deregistration priority to ensure that jobs are getting removed on a busy cluster?

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, and it does what it says in the tin 😄 , but I wonder if we're conflating three different priority values into one, not just on this PR and in general.

There's the job's priority, which is used for preemption, then the job registration eval priority and the job deregistration priority, which are use the eval broker queue order.

The problem I see in mixing these 3 is that the situation you described would only happen if job priorities were already being used, which is not always the case as preemption is not something everyone needs to think about.

So if all jobs don't have a priority set, they would need to modify the jobspec, re-register it, and then stop it, potentially affecting preemption and introducing more load into the system before being able to finally reduce it.

I also think that the scheduler and the eval broker interpret priorities in different ways. For the scheduler, a high priority job should be "hard" to stop, meaning that the run and stop priority are the inverse.

For the eval broker, the job registration priority shouldn't really impact the job deregistration priority since they are two somewhat isolated operations. The priority order is tied to the type of operation being performed. So a EvalTriggerJobDeregister should have a higher priority relative than a EvalTriggerJobRegister eval because you usually want to be able to reduce load before adding more. So maybe EvalTriggerJobDeregister should always be 100? Tough that sounds a bit drastic 😅 Or maybe job.Priority + 5?

But I think the real solution would be to allow operators to set eval priority independently of job priority. For example, add an -eval-priority flag to nomad job run and nomad job stop to override the default values.

@schmichael
Copy link
Member

tl;dr - Let's merge this. I don't think it will have a significant positive impact, but I do think it brings some behavior more in line with user expectations.

Due to https://github.com/hashicorp/nomad/blob/v1.1.6/nomad/eval_broker.go#L865-L873 all pending evals for the same job are ordered by CreateIndex (aka the total global submission order as determined by Raft), so I don't think there's an opportunity for misordering De-regs and Re-regs-with-a-higher-priority for the same job.

I think this PR does still help in the general sense of consider high priority jobs first. I think this matches user expectations and docs for priority. Whether those expectations or docs are really the best way to model the various ways Nomad considers priorities (as @lgfa29 points out) is another question.

So maybe EvalTriggerJobDeregister should always be 100? Or maybe job.Priority + 5?

This is a great question. The only case I can think of for not hardcoding de-registrations to P=100 is to prevent DoSing or priority inversion. Say a single cluster runs both low-priority batch and mixed-priority service workloads with a few P=99 services constrained to a small set of nodes that batch workloads aren't allowed on. I think it's easy to imagine the low-priority batch jobs are submitted by code rather than a human, perhaps a CI/CD system or email sending.

If due to bug or backpressure or operator intervention the batch jobs aren't completing quickly and need to be stopped, Nomad could receive a flood of low-priority de-registrations.

If the P=99 service also needs updating, perhaps to react to the same systemwide incident, its evals are suddenly "stuck" behind the batch de-regs even though the batch de-regs can't affect anything related to the P=99 service's scheduling due to constraints!

I think this is the pathological case (please try to come up with a worse one!).

On the plus side processing evals for stopped jobs is way less work than for pending jobs so why not try to do stopped first in all but the most pathological DoS situations?

But I think the real solution would be to allow operators to set eval priority independently of job priority. For example, add an -eval-priority flag to nomad job run and nomad job stop to override the default values.

I agree with this. While arbitrarily bumping priority of de-reg evals is an interesting idea to try to get stopped jobs processed first, I think this is a much safer approach. Let operators "skip the line" when desired, but otherwise use the logic proposed in this PR because it is intuitively what users will expect.

I'll open an issue for that enhancement.

@jrasell jrasell merged commit 2bb8313 into main Nov 5, 2021
@jrasell jrasell deleted the b-set-dereg-eval-priority-correctly branch November 5, 2021 14:53
lgfa29 pushed a commit that referenced this pull request Nov 15, 2021
…rrectly

rpc: set the deregistration eval priority to the job priority.
lgfa29 pushed a commit that referenced this pull request Nov 15, 2021
…rrectly

rpc: set the deregistration eval priority to the job priority.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants