rpc: set the deregistration eval priority to the job priority. #11426

jrasell · 2021-11-02T08:46:36Z

Previously when creating an eval for job deregistration, the eval
priority was set to the default value irregardless of the job
priority. In situations where an operator would want to deregister
a high priority job so they could re-register; the evaluation may
get blocked for some time on a busy cluster because of the
deregsiter priority.

If a job had a lower than default priority and was deregistered,
the deregister eval would get a priority higher than that of the
job. If we attempted to register another job with a higher
priority than this, but still below the default, the deregister
would be actioned before the register.

Both situations described above seem incorrect and unexpected from
a user perspective.

This fix modifies to behaviour to set the deregister eval priority
to that of the job, if available. Otherwise the default value is
still used. This is the same behaviour found within
BatchDeregister.

Previously when creating an eval for job deregistration, the eval priority was set to the default value irregardless of the job priority. In situations where an operator would want to deregister a high priority job so they could re-register; the evaluation may get blocked for some time on a busy cluster because of the deregsiter priority. If a job had a lower than default priority and was deregistered, the deregister eval would get a priority higher than that of the job. If we attempted to register another job with a higher priority than this, but still below the default, the deregister would be actioned before the register. Both situations described above seem incorrect and unexpected from a user prespective. This fix modifies to behaviour to set the deregister eval priority to that of the job, if available. Otherwise the default value is still used.

DerekStrickland

This seems right.

tgross

LGTM. But I'm also wondering if we should set a floor on deregistration priority to ensure that jobs are getting removed on a busy cluster?

lgfa29

The change LGTM, and it does what it says in the tin 😄 , but I wonder if we're conflating three different priority values into one, not just on this PR and in general.

There's the job's priority, which is used for preemption, then the job registration eval priority and the job deregistration priority, which are use the eval broker queue order.

The problem I see in mixing these 3 is that the situation you described would only happen if job priorities were already being used, which is not always the case as preemption is not something everyone needs to think about.

So if all jobs don't have a priority set, they would need to modify the jobspec, re-register it, and then stop it, potentially affecting preemption and introducing more load into the system before being able to finally reduce it.

I also think that the scheduler and the eval broker interpret priorities in different ways. For the scheduler, a high priority job should be "hard" to stop, meaning that the run and stop priority are the inverse.

For the eval broker, the job registration priority shouldn't really impact the job deregistration priority since they are two somewhat isolated operations. The priority order is tied to the type of operation being performed. So a EvalTriggerJobDeregister should have a higher priority relative than a EvalTriggerJobRegister eval because you usually want to be able to reduce load before adding more. So maybe EvalTriggerJobDeregister should always be 100? Tough that sounds a bit drastic 😅 Or maybe job.Priority + 5?

But I think the real solution would be to allow operators to set eval priority independently of job priority. For example, add an -eval-priority flag to nomad job run and nomad job stop to override the default values.

schmichael · 2021-11-02T20:08:18Z

tl;dr - Let's merge this. I don't think it will have a significant positive impact, but I do think it brings some behavior more in line with user expectations.

Due to https://github.com/hashicorp/nomad/blob/v1.1.6/nomad/eval_broker.go#L865-L873 all pending evals for the same job are ordered by CreateIndex (aka the total global submission order as determined by Raft), so I don't think there's an opportunity for misordering De-regs and Re-regs-with-a-higher-priority for the same job.

I think this PR does still help in the general sense of consider high priority jobs first. I think this matches user expectations and docs for priority. Whether those expectations or docs are really the best way to model the various ways Nomad considers priorities (as @lgfa29 points out) is another question.

So maybe EvalTriggerJobDeregister should always be 100? Or maybe job.Priority + 5?

This is a great question. The only case I can think of for not hardcoding de-registrations to P=100 is to prevent DoSing or priority inversion. Say a single cluster runs both low-priority batch and mixed-priority service workloads with a few P=99 services constrained to a small set of nodes that batch workloads aren't allowed on. I think it's easy to imagine the low-priority batch jobs are submitted by code rather than a human, perhaps a CI/CD system or email sending.

If due to bug or backpressure or operator intervention the batch jobs aren't completing quickly and need to be stopped, Nomad could receive a flood of low-priority de-registrations.

If the P=99 service also needs updating, perhaps to react to the same systemwide incident, its evals are suddenly "stuck" behind the batch de-regs even though the batch de-regs can't affect anything related to the P=99 service's scheduling due to constraints!

I think this is the pathological case (please try to come up with a worse one!).

On the plus side processing evals for stopped jobs is way less work than for pending jobs so why not try to do stopped first in all but the most pathological DoS situations?

But I think the real solution would be to allow operators to set eval priority independently of job priority. For example, add an -eval-priority flag to nomad job run and nomad job stop to override the default values.

I agree with this. While arbitrarily bumping priority of de-reg evals is an interesting idea to try to get stopped jobs processed first, I think this is a much safer approach. Let operators "skip the line" when desired, but otherwise use the logic proposed in this PR because it is intuitively what users will expect.

I'll open an issue for that enhancement.

…rrectly rpc: set the deregistration eval priority to the job priority.

github-actions · 2022-11-10T02:35:11Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jrasell added type/bug theme/scheduling labels Nov 2, 2021

jrasell requested review from schmichael, lgfa29 and davemay99 November 2, 2021 08:46

jrasell self-assigned this Nov 2, 2021

changelog: add entry for #11426

8b0e476

vercel bot temporarily deployed to Preview – nomad November 2, 2021 10:43 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2021 10:43 View deployment

DerekStrickland approved these changes Nov 2, 2021

View reviewed changes

tgross approved these changes Nov 2, 2021

View reviewed changes

lgfa29 approved these changes Nov 2, 2021

View reviewed changes

schmichael approved these changes Nov 2, 2021

View reviewed changes

schmichael mentioned this pull request Nov 2, 2021

Allow overriding evaluation priority for job commands and APIs #11434

Closed

jrasell added backport/1.0 labels Nov 5, 2021

jrasell merged commit 2bb8313 into main Nov 5, 2021

jrasell deleted the b-set-dereg-eval-priority-correctly branch November 5, 2021 14:53

lgfa29 pushed a commit that referenced this pull request Nov 15, 2021

Merge pull request #11426 from hashicorp/b-set-dereg-eval-priority-co…

bccb337

…rrectly rpc: set the deregistration eval priority to the job priority.

lgfa29 pushed a commit that referenced this pull request Nov 15, 2021

Merge pull request #11426 from hashicorp/b-set-dereg-eval-priority-co…

1cd7d21

…rrectly rpc: set the deregistration eval priority to the job priority.

lgfa29 removed stage/needs-backporting labels Nov 16, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: set the deregistration eval priority to the job priority. #11426

rpc: set the deregistration eval priority to the job priority. #11426

jrasell commented Nov 2, 2021 •

edited

Loading

DerekStrickland left a comment

tgross left a comment

lgfa29 left a comment

schmichael commented Nov 2, 2021

github-actions bot commented Nov 10, 2022

rpc: set the deregistration eval priority to the job priority. #11426

rpc: set the deregistration eval priority to the job priority. #11426

Conversation

jrasell commented Nov 2, 2021 • edited Loading

DerekStrickland left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

lgfa29 left a comment

Choose a reason for hiding this comment

schmichael commented Nov 2, 2021

github-actions bot commented Nov 10, 2022

jrasell commented Nov 2, 2021 •

edited

Loading