Pending batch jobs not run when resources become available #17278

mikenomitch · 2023-05-22T22:26:16Z

Opening bug on behalf of a customer:

Description

We have observed Nomad 1.1.2+ent (with namepsaces) will put jobs into Pending state when not enough resources are available to execute them, however when resources become available, it will not automatically run those pending jobs. Kicking the scheduler via either marking any node as ineligible/eligible, or running nomad job eval -force-reschedule will kick it to life, and another single set of tasks will run. On completion, pending tasks will remain stuck.

Example

ID                                        Namespace  Type                 Priority  Status   Submit Date
dummy-batch                               ns    batch/parameterized  50        running  2021-12-11T00:56:41Z
dummy-batch/dispatch-1639184302-6ff8ba90  ns    batch                50        dead     2021-12-11T00:58:22Z
dummy-batch/dispatch-1639184308-10976bc8  ns    batch                50        dead     2021-12-11T00:58:28Z
dummy-batch/dispatch-1639184312-fc11df5f  ns    batch                50        dead     2021-12-11T00:58:32Z
dummy-batch/dispatch-1639184334-dfb747a1  ns    batch                50        dead     2021-12-11T00:58:54Z
dummy-batch/dispatch-1639184343-629df0b0  ns    batch                50        dead     2021-12-11T00:59:03Z
dummy-batch/dispatch-1639184357-69745c20  ns    batch                50        dead     2021-12-11T00:59:17Z
dummy-batch/dispatch-1639184364-c11562cf  ns    batch                50        dead     2021-12-11T00:59:24Z
dummy-batch/dispatch-1639184703-36ea2cae  ns    batch                50        dead     2021-12-11T01:05:03Z
dummy-batch/dispatch-1639184726-6594e44b  ns    batch                50        dead     2021-12-11T01:05:26Z
dummy-batch/dispatch-1639184727-be0b74d8  ns    batch                50        dead     2021-12-11T01:05:27Z
dummy-batch/dispatch-1639184728-e0bdf5e2  ns    batch                50        pending  2021-12-11T01:05:28Z
dummy-batch/dispatch-1639184729-d95f3413  ns    batch                50        pending  2021-12-11T01:05:29Z
dummy-batch/dispatch-1639184730-9a59c7fc  ns    batch                50        pending  2021-12-11T01:05:30Z
dummy-batch/dispatch-1639184730-c95a4559  ns    batch                50        pending  2021-12-11T01:05:30Z
dummy-batch/dispatch-1639184731-3c10abb4  ns    batch                50        pending  2021-12-11T01:05:31Z
dummy-batch/dispatch-1639184732-5da2e4e0  ns    batch                50        pending  2021-12-11T01:05:32Z
dummy-batch/dispatch-1639184734-1e8033f0  ns    batch                50        pending  2021-12-11T01:05:34Z
dummy-batch/dispatch-1639184734-58f7528d  ns    batch                50        pending  2021-12-11T01:05:34Z
dummy-batch/dispatch-1639184734-5c383b5b  ns    batch                50        pending  2021-12-11T01:05:34Z
dummy-batch/dispatch-1639184734-5d4f2f37  ns    batch                50        pending  2021-12-11T01:05:34Z
dummy-batch/dispatch-1639184734-77500782  ns    batch                50        pending  2021-12-11T01:05:34Z
dummy-batch/dispatch-1639184735-592db3e0  ns    batch                50        pending  2021-12-11T01:05:35Z
dummy-batch/dispatch-1639184736-ed84e2fa  ns    batch                50        pending  2021-12-11T01:05:36Z
dummy-batch/dispatch-1639184737-93fd8d42  ns    batch                50        pending  2021-12-11T01:05:37Z
dummy-batch/dispatch-1639184738-0074dbe9  ns    batch                50        pending  2021-12-11T01:05:38Z
dummy-batch/dispatch-1639184738-6042c481  ns    batch                50        pending  2021-12-11T01:05:38Z
dummy-batch/dispatch-1639184739-34bff7f2  ns    batch                50        pending  2021-12-11T01:05:39Z
dummy-batch/dispatch-1639184740-2ed98ebf  ns    batch                50        pending  2021-12-11T01:05:40Z

vagrant@ubuntu-focal:~$ nomad job status dummy-batch/dispatch-1639184728-e0bdf5e2
ID            = dummy-batch/dispatch-1639184728-e0bdf5e2
Name          = dummy-batch/dispatch-1639184728-e0bdf5e2
Submit Date   = 2021-12-11T01:05:28Z
Type          = batch
Priority      = 50
Datacenters   = london
Namespace     = ns
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
default     2       0         0        0       0         0

Placement Failure
Task Group "default":
  * Resources exhausted on 1 nodes
  * Dimension "cpu" exhausted on 1 nodes

Allocations
No allocations placed

vagrant@ubuntu-focal:~$ nomad job eval -verbose -force-reschedule -namespace brobert dummy-batch/dispatch-1639184728-e0bdf5e2
==> 2021-12-11T01:18:54Z: Monitoring evaluation "74be850f-2d8a-9ce1-8c7d-57c29e1ff88a"
    2021-12-11T01:18:54Z: Evaluation triggered by job "dummy-batch/dispatch-1639184728-e0bdf5e2"
    2021-12-11T01:18:54Z: Allocation "a5a208f1-af49-ae98-032e-bab5ef71cf58" created: node "38826b89-4e80-a867-c92b-b8391008fe89", group "default"
    2021-12-11T01:18:54Z: Allocation "0ffd5b2b-0382-63f2-a39c-ab7851a915d8" created: node "38826b89-4e80-a867-c92b-b8391008fe89", group "default"
==> 2021-12-11T01:18:55Z: Monitoring evaluation "74be850f-2d8a-9ce1-8c7d-57c29e1ff88a"
    2021-12-11T01:18:55Z: Evaluation status changed: "pending" -> "complete"
==> 2021-12-11T01:18:55Z: Evaluation "74be850f-2d8a-9ce1-8c7d-57c29e1ff88a" finished with status "complete"
vagrant@ubuntu-focal:~$ nomad job status dummy-batch/dispatch-1639184728-e0bdf5e2
ID            = dummy-batch/dispatch-1639184728-e0bdf5e2
Name          = dummy-batch/dispatch-1639184728-e0bdf5e2
Submit Date   = 2021-12-11T01:05:28Z
Type          = batch
Priority      = 50
Datacenters   = london
Namespace     = brobert
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
default     0       0         2        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
0ffd5b2b  38826b89  default     0        run      running  17s ago  13s ago
a5a208f1  38826b89  default     0        run      running  17s ago  13s ago

vagrant@ubuntu-focal:~$ nomad job status dummy-batch/dispatch-1639184728-e0bdf5e2
ID            = dummy-batch/dispatch-1639184728-e0bdf5e2
Name          = dummy-batch/dispatch-1639184728-e0bdf5e2
Submit Date   = 2021-12-11T01:05:28Z
Type          = batch
Priority      = 50
Datacenters   = london
Namespace     = brobert
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
default     0       0         0        0       2         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
0ffd5b2b  38826b89  default     0        run      complete  30s ago  6s ago
a5a208f1  38826b89  default     0        run      complete  30s ago  6s ago```

The text was updated successfully, but these errors were encountered:

gulducat · 2023-10-09T20:46:20Z

I was not able to reproduce this with the following script and jobs, against a single agent in both server/client mode:

repro.sh

#!/bin/bash

set -x

# check nomad version
nomad version
curl -sS -H "X-Nomad-Token: $NOMAD_TOKEN" "$NOMAD_ADDR/v1/agent/self" | jq .config.Version

# issue menions namespace, so set one up
nomad namespace apply 'my-ns'
export NOMAD_NAMESPACE='my-ns'

# take up all the cpu
nomad run cpu-eater.nomad.hcl

# run and dispatch parameterized job,
# will sit pending due to cpu-eater
nomad run batch.nomad.hcl
nomad job dispatch bat
# i tried running a bunch, but got the same result.
#for _ in {1..20}; do
#  nomad job dispatch -detach bat
#done

# free up cpu
nomad stop -detach cpu-eater

# wait for pending job to run
for _ in {1..3}; do
  nomad status bat | grep pending || break
  sleep 10
done

# confirm 'hello' from batch job output
nomad logs -job $(nomad status bat | awk '/dead/ {print$1}' | head -n1)

# cleanup
nomad stop bat
nomad system gc

cpu-eater.nomad.hcl

job "cpu-eater" {
  datacenters = ["dc1"]
  namespace   = "my-ns"

  group "g" {
    task "t" {
      resources {
        cpu = 92000 # my machine is pretty beefy.
      }
      driver = "docker"
      config {
        image = "busybox:1"
        command = "sleep"
        args = ["infinity"]
      }
    }
  }
}

batch.nomad.hcl

job "bat" {
  datacenters = ["dc1"]
  namespace   = "my-ns"

  type = "batch"
  parameterized {}

  group "bat" {
    task "bat" {
      driver = "docker"
      config {
        image = "busybox:1"
        command = "echo"
        args = ["hello"]
      }
    }
  }
}

resulting output shows that the pending batch job did get run:

$ ./repro.sh
+ nomad version
Nomad v1.1.2+ent (466cfb4a246f76bfe8d745630d58c95106c63797)
+ curl -sS -H 'X-Nomad-Token: xxx' https://localhost:4646/v1/agent/self
+ jq .config.Version
{
  "Revision": "466cfb4a246f76bfe8d745630d58c95106c63797",
  "Version": "1.1.2",
  "VersionMetadata": "ent",
  "VersionPrerelease": ""
}
+ nomad namespace apply my-ns
Successfully applied namespace "my-ns"!
+ export NOMAD_NAMESPACE=my-ns
+ NOMAD_NAMESPACE=my-ns
+ nomad run cpu-eater.nomad.hcl
... etc etc ....
    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    g           1        1       1        0          2023-10-09T15:43:06-05:00
+ nomad run batch.nomad.hcl
Job registration successful
+ nomad job dispatch bat
Dispatched Job ID = bat/dispatch-1696883587-a339a083
Evaluation ID     = 1f244183

==> 2023-10-09T15:33:07-05:00: Monitoring evaluation "1f244183"
    2023-10-09T15:33:07-05:00: Evaluation triggered by job "bat/dispatch-1696883587-a339a083"
==> 2023-10-09T15:33:08-05:00: Monitoring evaluation "1f244183"
    2023-10-09T15:33:08-05:00: Evaluation status changed: "pending" -> "complete"
==> 2023-10-09T15:33:08-05:00: Evaluation "1f244183" finished with status "complete" but failed to place all allocations:
    2023-10-09T15:33:08-05:00: Task Group "bat" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "cpu" exhausted on 1 nodes
    2023-10-09T15:33:08-05:00: Evaluation "e8ec97e4" waiting for additional capacity to place remainder
+ nomad stop -detach cpu-eater
56d6be13-ed44-1846-0ecf-b41407ec03aa
+ for _ in {1..3}
+ nomad status bat
+ grep pending
bat/dispatch-1696883587-a339a083  pending
+ sleep 10
+ for _ in {1..3}
+ nomad status bat
+ grep pending
+ break
++ nomad status bat
++ head -n1
++ awk '/dead/ {print$1}'
+ nomad logs -job bat/dispatch-1696883526-11107904
hello

Perhaps I'm missing a component of the replication steps?

gulducat · 2023-10-18T21:50:26Z

Working with @louievandyke today, we narrowed down the extra piece required to reproduce this, all the way up to current latest Nomad 1.6.2+ent, and enterprise main!

Not only do we need a namespace, but that namespace needs a quota on it (an enterprise feature), specifically a quota where the resource (cpu, in this case) is set to zero 0.

Modifying the namespace piece of repro.sh above like so:

nomad quota apply quota.hcl
nomad namespace apply -quota 'my-quota' 'my-ns'

where quota.hcl is

name = "my-quota"
limit {
  region = "global"
  region_limit {} # default zero everything
}

results in the dispatched job getting stuck pending:

$ nomad status
ID                                Namespace  Type                 Priority  Status          Submit Date
bat                               my-ns      batch/parameterized  50        running         2023-10-18T16:35:00-05:00
bat/dispatch-1697664900-14f2784a  my-ns      batch                50        pending         2023-10-18T16:35:00-05:00
cpu-eater                         my-ns      service              50        dead (stopped)  2023-10-18T16:34:48-05:00

Specifically, this is due to the eval getting blocked until something triggers a re-eval.

There will be a server log entry like

[DEBUG] scheduler/generic_sched.go:307: worker.batch_sched: failed to place all allocations, blocked eval created: eval_id=16cce1bf-dc3a-9cc4-9e1f-0586c490537c job_id=bat/dispatch-1697664900-14f2784a namespace=my-ns worker_id=ae4c34d3-7a0c-c97b-17a2-4a4a59f3cf1c blocked_eval_id=ec85a56a-28f0-d4e1-b149-d76c21f344e1

The blocked eval can be seen with:

$ nomad eval list | grep blocked
ec85a56a  50        queued-allocs       bat/dispatch-1697664900-14f2784a  my-ns      <none>   blocked   N/A - In Progress

These metrics are pretty neat too, and show that the reason in this case is CPU (this can also happen with memory)

$ nomad operator metrics -format=prometheus | grep ^nomad_nomad_blocked_evals
nomad_nomad_blocked_evals_cpu{datacenter="dc1",host="cool-host",node_class=""} 100
nomad_nomad_blocked_evals_job_cpu{host="cool-host",job="bat/dispatch-1697664900-14f2784a",namespace="my-ns"} 100
nomad_nomad_blocked_evals_job_memory{host="cool-host",job="bat/dispatch-1697664900-14f2784a",namespace="my-ns"} 0
nomad_nomad_blocked_evals_memory{datacenter="dc1",host="cool-host",node_class=""} 0
nomad_nomad_blocked_evals_total_blocked{host="cool-host"} 1
nomad_nomad_blocked_evals_total_escaped{host="cool-host"} 0
nomad_nomad_blocked_evals_total_quota_limit{host="cool-host"} 0

I haven't quite tracked down yet why quotas with zero values produces this result, but I have confirmed that setting CPU in the quota results in jobs getting serialized appropriately based on the quota with a (temporary) message like

Task Group "bat" (failed to place 1 allocation):
      * Quota limit hit "cpu exhausted (41300 needed > 41200 limit)"

I'll pick this back up tomorrow and try to narrow down where we might handle the zero value differently to avoid this.

gulducat · 2023-10-20T22:40:25Z

I believe I've found the cause. An evaluation getting blocked is totally normal when resources are occupied, but such evals should be _un_blocked automatically when the resources are freed up. So something is preventing them from being unblocked when the previous alloc(s) complete.

Seems this here is the culprit:
https://github.com/hashicorp/nomad/blob/v1.6.2/nomad/blocked_evals.go#L570

func (b *BlockedEvals) unblock(computedClass, quota string, index uint64) {
// .....
	if quota != "" && wrapped.eval.QuotaLimitReached != quota {
		// We are unblocking based on quota and this eval doesn't match
		continue
// .....

The quota that gets passed in there is the quota that just happens to be on the namespace of the job / desired allocation (here), and QuotaLimitReached would be equal to it if the block was due to the quota. But if the block is not due to the quota, we get something like

if "my-quota" != "" && "" != "my-quota" { // don't unblock }
// ^ quota on the ns   ^ quota that caused the block (none!)

This can also happen when the quota is very large, like 999999999, or at least large enough that cluster-wide available CPU (or memory) are more restrictive than the quota. The quota itself is behaving properly in the sense that 0 is effectively unlimited, but the existence of the quota on the namespace prevents it from being auto-unblocked.

I have a tentative fix for it, will validate some more and write tests for it next week!

mikenomitch added type/bug theme/batch Issues related to batch jobs and scheduling hcc/cst Admin - internal labels May 22, 2023

mikenomitch assigned gulducat Oct 18, 2023

gulducat mentioned this issue Oct 23, 2023

scheduler: appropriately unblock evals with quotas #18838

Merged

gulducat closed this as completed in #18838 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pending batch jobs not run when resources become available #17278

Pending batch jobs not run when resources become available #17278

mikenomitch commented May 22, 2023 •

edited

Loading

gulducat commented Oct 9, 2023

gulducat commented Oct 18, 2023

gulducat commented Oct 20, 2023

Pending batch jobs not run when resources become available #17278

Pending batch jobs not run when resources become available #17278

Comments

mikenomitch commented May 22, 2023 • edited Loading

Description

Example

gulducat commented Oct 9, 2023

gulducat commented Oct 18, 2023

gulducat commented Oct 20, 2023

mikenomitch commented May 22, 2023 •

edited

Loading