System job with constrains fails to plan #12748

chilloutman · 2022-04-22T07:38:12Z

Nomad version

v1.2.6

(Nomad v1.2.6 has problem described below, while Nomad v1.1.5 works as expected.)

Operating system and Environment details

Nomad nodes are running Ubuntu. Docker driver is used for all tasks.

A set of nodes has node.class set to worker and there are few other nodes in the cluster.

Issue

System job with constrains fails to plan.

Reproduction steps

A job with type = "system" is used to schedule tasks on the worker nodes. So the following constraint is added to the worker group:

constraint {
  attribute = "${node.class}"
  operator  = "="
  value     = "worker"
}

Expected Result

All the worker nodes should run the worker task, all other nodes should not.

Actual Result

This works sometimes, in particular when there are no allocations on the cluster. But running nomad job plan after allocations are running displays the following warning:

Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
  Task Group "worker" (failed to place 1 allocation):
    * Class "entry": 1 nodes excluded by filter
    * Constraint "${node.class} = worker": 1 nodes excluded by filter

This should not be a warning, as the allocations match the job definition, considering the constraints.
nomad job run produces the desired state and the job state is displayed as “not scheduled” on all non-worker nodes.

Removing the constrains shows no warning, but obviously schedules the worker task on non-worker nodes, which is unwanted.

The only workaround seems be to ignore warnings, which defeats the purpose of nomad job plan, or create a entire separate cluster for the workers.

Possibly related:

The text was updated successfully, but these errors were encountered:

cr0c0dylus · 2022-04-25T15:49:52Z

I'm facing the same problem (1.2.6):

Job: "stage-cron"
Task Group: "cron" (1 ignore)
Task: "cron"

Scheduler dry-run:

WARNING: Failed to place allocations on all nodes.
Task Group "cron" (failed to place 1 allocation):
- Constraint "${meta.env} = stage": 5 nodes excluded by filter

But if I stop job before submitting a new job, it works as expected:

$ nomad job stop stage-cron
==> 2022-04-25T18:45:07+03:00: Monitoring evaluation "86e8c675"
2022-04-25T18:45:07+03:00: Evaluation triggered by job "stage-cron"
==> 2022-04-25T18:45:08+03:00: Monitoring evaluation "86e8c675"
2022-04-25T18:45:08+03:00: Evaluation status changed: "pending" -> "complete"
==> 2022-04-25T18:45:08+03:00: Evaluation "86e8c675" finished with status "complete"

$ nomad job plan ...

+/- Job: "stage-cron"
+/- Stop: "true" => "false"
Task Group: "cron" (1 create)
Task: "cron"

Scheduler dry-run:

All tasks successfully allocated.

cr0c0dylus · 2022-04-25T18:12:18Z

I have found a temporary workaround. You need to add 1.1.x server to the cluster and stop-start 1.2.6 leaders until 1.1.x becomes a leader.

tgross · 2022-05-02T15:42:38Z

Hi @chilloutman! This definitely seems like it could be related to #12016. I'm not going to mark it as a duplicate just in case it's not but I'll cross-reference here so that whomever tackles that issue will see this as well. I don't have a good workaround for you other than to ignore warnings (they're warnings and not errors), but I realize that isn't ideal.

Just FYI @cr0c0dylus:

I have found a temporary workaround. You need to add 1.1.x server to the cluster and stop-start 1.2.6 leaders until 1.1.x becomes a leader.

This is effectively downgrading Nomad into a mixed-version cluster, which is not supported and highly likely to result in state store corruption. Doing so in order to suppress something that's only a warning is not advised.

cr0c0dylus · 2022-05-02T19:52:21Z

Doing so in order to suppress something that's only a warning is not advised.

Unfortunately, it is not only a warning. It cannot allocate a job at all. Another trick - to change one of the limits in resources stanza. For example, to add +1 to the CPU limit. But it doesn't work with some of my jobs.

ygersie · 2022-05-31T08:14:29Z

I wonder if this is related #11778 (comment) It really looks like some bug in the scheduler that incorrectly fails placement during the node feasibility check. It is almost like it's not iterating through all nodes but for some reason returns a placement failure while it hasn't exhausted the full list yet.

lssilva · 2022-06-07T13:00:41Z

I am also facing this issue and I had to downgrade nomad.

chilloutman · 2022-06-20T12:17:25Z

I'm wondering if this could be the cause: https://github.com/hashicorp/nomad/pull/11111/files#diff-c4e3135b7aa83ba07d59d003a8ab006915207425b8728c4cf070eee20ab9157a

"// track node filtering, to only report an error if all nodes have been filtered" might not be working as intended. Or maybe instead of only warnings #11111 ended up causing errors.

jmwilkinson · 2022-06-29T23:16:13Z

Verified we hit this with constraints on 1.2.6 as well.

Mitigation was reverting this to 1.1.5.

I do not know how bugs are prioritized but this should probably be pretty high.

cr0c0dylus · 2022-06-30T07:49:55Z

BTW, it would be great if I those warnings can be completely disabled in config. If I have 50 nodes in cluster and make constraint for 3 nodes - what the sense to see "47 Not Scheduled"? System jobs are very useful for scaling in HA configuration - I don't need to modify job stanza, just add or remove nodes with a special meta variable.

dext0r · 2022-06-30T07:53:09Z

I'm wondering if this could be the cause: https://github.com/hashicorp/nomad/pull/11111/files#diff-c4e3135b7aa83ba07d59d003a8ab006915207425b8728c4cf070eee20ab9157a

"// track node filtering, to only report an error if all nodes have been filtered" might not be working as intended. Or maybe instead of only warnings #11111 ended up causing errors.

It's the cause indeed. Reverting this pull request fixed the issue for me on 1.3.1.

cr0c0dylus · 2022-07-14T14:09:56Z

Nomad v1.2.9 (86192e4)

The problem persists. I still need to stop the 1.2.9 masters in sequence until 1.0.18 becomes the leader and allows deployment.

jmwilkinson · 2022-07-21T15:25:12Z

There may be a fix in 1.3.2, at least it looks that way: https://github.com/hashicorp/nomad/blob/v1.3.2/scheduler/scheduler_system.go#L298

seanamos · 2023-06-14T11:02:40Z

Issue still exists in v1.5.3, frequently run into this when upgrading system jobs.

While the nomad CLI reports this error, the rollout will still actually happen in Nomad.

nCrazed · 2023-12-20T20:21:12Z

I am seeing the same behavior as @seanamos in v1.6.3

cr0c0dylus · 2024-01-22T14:17:50Z

The problem continues to occur in v1.7.3

elgatopanzon · 2024-06-26T21:14:07Z

Can confirm still present in Nomad v1.7.7.

chilloutman added the type/bug label Apr 22, 2022

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 22, 2022

tgross added the theme/system-scheduler label May 2, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 2, 2022

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage May 2, 2022

tgross mentioned this issue May 2, 2022

changing constraints results in "failed to place all allocations" #12016

Open

tgross added the theme/scheduling label May 2, 2022

tgross mentioned this issue May 31, 2022

CSI volume per_alloc availability zone placement #11778

Closed

tgross mentioned this issue Jul 5, 2022

error when submitting system job with constraints #13455

Open

tgross mentioned this issue Jan 17, 2024

Inconsistent job placement response with constraints on system jobs #19413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System job with constrains fails to plan #12748

System job with constrains fails to plan #12748

chilloutman commented Apr 22, 2022

cr0c0dylus commented Apr 25, 2022 •

edited

Loading

cr0c0dylus commented Apr 25, 2022

tgross commented May 2, 2022

cr0c0dylus commented May 2, 2022

ygersie commented May 31, 2022

lssilva commented Jun 7, 2022

chilloutman commented Jun 20, 2022

jmwilkinson commented Jun 29, 2022

cr0c0dylus commented Jun 30, 2022

dext0r commented Jun 30, 2022

cr0c0dylus commented Jul 14, 2022

jmwilkinson commented Jul 21, 2022

seanamos commented Jun 14, 2023

nCrazed commented Dec 20, 2023

cr0c0dylus commented Jan 22, 2024

elgatopanzon commented Jun 26, 2024

System job with constrains fails to plan #12748

System job with constrains fails to plan #12748

Comments

chilloutman commented Apr 22, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

cr0c0dylus commented Apr 25, 2022 • edited Loading

cr0c0dylus commented Apr 25, 2022

tgross commented May 2, 2022

cr0c0dylus commented May 2, 2022

ygersie commented May 31, 2022

lssilva commented Jun 7, 2022

chilloutman commented Jun 20, 2022

jmwilkinson commented Jun 29, 2022

cr0c0dylus commented Jun 30, 2022

dext0r commented Jun 30, 2022

cr0c0dylus commented Jul 14, 2022

jmwilkinson commented Jul 21, 2022

seanamos commented Jun 14, 2023

nCrazed commented Dec 20, 2023

cr0c0dylus commented Jan 22, 2024

elgatopanzon commented Jun 26, 2024

cr0c0dylus commented Apr 25, 2022 •

edited

Loading