Can't schedule a job because "resources exhausted" #146

adriaandejonge · 2015-09-29T06:57:36Z

I tried starting nomad on both CoreOS (local and GCE) and Debian (GCE) and running a cluster (both with -dev and with client/server. For installation, I follow the steps described in the Vagrantfile.

As soon as I try to run:

nomad init
nomad run example.nomad

I get a message that my resources are exhausted:

username@instance-2:~$ nomad run example.nomad
==> Monitoring evaluation "bac37878-9a39-bd3e-5ef2-acc51cde7981"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "2263a54c-ea0e-ab68-adee-30f8d2e0827a" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: bandwidth exceeded" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bac37878-9a39-bd3e-5ef2-acc51cde7981" finished with status "complete"

Or this variant:

core@core-01 ~ $ ./nomad run example.nomad 
==> Monitoring evaluation "ee52b439-49e1-f3e4-a3d2-0ad6a1fc48d2"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "c23ca37e-2de5-b10d-d90a-d36bc468092c" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ee52b439-49e1-f3e4-a3d2-0ad6a1fc48d2" finished with status "complete"

This is the first job I am scheduling so it is unlikely that the network resources are actually exhausted. Am I missing some kind of configuration telling nomad about the network?

The text was updated successfully, but these errors were encountered:

kelseyhightower · 2015-09-29T13:51:56Z

I think the problem here is that nomad assumes eth0 for all systems, which is not true for systemd. See the code here: https://github.com/hashicorp/nomad/blob/master/client/fingerprint/network_unix.go#L38

sethvargo · 2015-09-29T15:38:05Z

Cross-linking with #158

adriaandejonge · 2015-09-29T17:45:32Z

Thanks for your reply @kelseyhightower & @sethvargo

Trying to test this hypothesis:

I forked the nomad repository at https://github.com/adriaandejonge/nomad/
I changed the line in the network_unix.go from eth0 to ens4v1 (see commit)

I choose this because:

username@instance-group-1-w000 /tmp $ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens4v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc fq_codel state UP group default qlen 1000
    link/ether 42:01:0a:f0:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.0.0/32 brd 10.240.0.0 scope global ens4v1
       valid_lft forever preferred_lft forever
    inet6 fe80::4001:aff:fef0:0/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0@NONE: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 172.17.42.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::50d3:89ff:fe54:12e8/64 scope link 
       valid_lft forever preferred_lft forever

After compiling the modified version of nomad and running, I still get this result:

username@instance-group-1-w000 /tmp $ ./nomad run example.nomad 
==> Monitoring evaluation "f59bcb83-a2fb-7100-0ca5-426918f97e11"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "af0cd1a2-199e-04cc-a8f7-f8dfe4323760" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f59bcb83-a2fb-7100-0ca5-426918f97e11" finished with status "complete"

Of course this is just a quick hack to test the hypothesis and not an actual fix. Are my steps above correct? If so, the problem might be different.

ghost · 2015-10-01T03:21:12Z

+1 Nomad 0.1.0 docker images do not currently function on CentOS 7 or CoreOS stable (766.4.0), similar error (using example.nomad)

==> Monitoring evaluation "129a2fab-8d5a-4ff6-71ab-e6463c12e854"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "9c141829-63bb-81ae-8267-802a7154c4e4" status "failed" (0/3 nodes filtered)
      * Resources exhausted on 3 nodes
      * Dimension "network: no networks available" exhausted on 3 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "129a2fab-8d5a-4ff6-71ab-e6463c12e854" finished with status "complete"

Installing net-tools (provides ifconfig) and setting net.ifnames 0 to rename the interface back to eth0 also does not make a difference on CentOS 7. Nomad server/agent log:

Sep 30 23:19:45 nomad.example.org nomad[2114]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Sep 30 23:19:45 nomad.example.org nomad[2114]: ==> Starting Nomad agent...
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:47 [ERR] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
Sep 30 23:19:47 nomad.example.org nomad[2114]: ==> Nomad agent configuration:
Sep 30 23:19:47 nomad.example.org nomad[2114]: Atlas: <disabled>
Sep 30 23:19:47 nomad.example.org nomad[2114]: Client: true
Sep 30 23:19:47 nomad.example.org nomad[2114]: Log Level: INFO
Sep 30 23:19:47 nomad.example.org nomad[2114]: Region: global (DC: dc1)
Sep 30 23:19:47 nomad.example.org nomad[2114]: Server: true
Sep 30 23:19:47 nomad.example.org nomad[2114]: ==> Nomad agent started! Log data will stream in below:
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] serf: EventMemberJoin: nomad.example.org.global 10.42.0.60
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] nomad: starting 1 scheduling worker(s) for [batch service _core]
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] client: using state directory /tmp/nomad/client
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] client: using alloc directory /tmp/nomad/alloc
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] raft: Node at 10.42.0.60:4647 [Follower] entering Follower state
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [WARN] serf: Failed to re-join any previously known node
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] nomad: adding server nomad.example.org.global (Addr: 10.42.0.60:4647) (DC: dc1)
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [ERR] fingerprint.network: Error calling ifconfig (/usr/sbin/ifconfig): %!s(<nil>)
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [WARN] raft: Heartbeat timeout reached, starting election
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Node at 10.42.0.60:4647 [Candidate] entering Candidate state
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Election won. Tally: 1
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Node at 10.42.0.60:4647 [Leader] entering Leader state
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] nomad: cluster leadership acquired
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Disabling EnableSingleNode (bootstrap)

Interesting line

Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [ERR] fingerprint.network: Error calling ifconfig (/usr/sbin/ifconfig): %!s(<nil>)

cdrage · 2015-10-02T09:57:09Z

It seems I am getting the same issue on Debian:

▶ cat /etc/debian_version 
8.1

▶ sudo ./nomad agent -dev
==> Starting Nomad agent...
2015/10/02 05:52:23 [ERR] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true

==> Nomad agent started! Log data will stream in below:

    2015/10/02 05:52:20 [INFO] serf: EventMemberJoin: wikus.global 127.0.0.1
    2015/10/02 05:52:20 [INFO] nomad: starting 4 scheduling worker(s) for [batch service _core]
    2015/10/02 05:52:20 [INFO] client: using alloc directory /tmp/NomadClient180079229
    2015/10/02 05:52:20 [INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state
    2015/10/02 05:52:20 [INFO] nomad: adding server wikus.global (Addr: 127.0.0.1:4647) (DC: dc1)
    2015/10/02 05:52:21 [ERR] fingerprint.network: Error calling ifconfig (/sbin/ifconfig): %!s(<nil>)
    2015/10/02 05:52:21 [WARN] fingerprint.network: Ethtool output did not match regex
    2015/10/02 05:52:21 [WARN] fingerprint.network: Ethtool not found, checking /sys/net speed file
    2015/10/02 05:52:22 [WARN] raft: Heartbeat timeout reached, starting election
    2015/10/02 05:52:22 [INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
    2015/10/02 05:52:22 [DEBUG] raft: Votes needed: 1
    2015/10/02 05:52:22 [DEBUG] raft: Vote granted. Tally: 1
    2015/10/02 05:52:22 [INFO] raft: Election won. Tally: 1
    2015/10/02 05:52:22 [INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state
    2015/10/02 05:52:22 [INFO] raft: Disabling EnableSingleNode (bootstrap)
    2015/10/02 05:52:22 [DEBUG] raft: Node 127.0.0.1:4647 updated peer set (2): [127.0.0.1:4647]
    2015/10/02 05:52:22 [INFO] nomad: cluster leadership acquired
    2015/10/02 05:52:23 [DEBUG] client: applied fingerprints [arch cpu host memory storage network]
    2015/10/02 05:52:23 [DEBUG] client: available drivers [exec java qemu docker]
    2015/10/02 05:52:23 [DEBUG] client: node registration complete
    2015/10/02 05:52:23 [DEBUG] client: updated allocations at index 1 (0 allocs)
    2015/10/02 05:52:23 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/02 05:52:23 [DEBUG] client: state updated to ready
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/jobs (584.169µs)
    2015/10/02 05:52:26 [DEBUG] worker: dequeued evaluation f6991d44-1e21-ec4e-aedf-81a010e725ff
    2015/10/02 05:52:26 [DEBUG] sched: <Eval 'f6991d44-1e21-ec4e-aedf-81a010e725ff' JobID: 'example'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0)
    2015/10/02 05:52:26 [DEBUG] worker: submitted plan for evaluation f6991d44-1e21-ec4e-aedf-81a010e725ff
    2015/10/02 05:52:26 [DEBUG] sched: <Eval 'f6991d44-1e21-ec4e-aedf-81a010e725ff' JobID: 'example'>: setting status to complete
    2015/10/02 05:52:26 [DEBUG] worker: updated evaluation <Eval 'f6991d44-1e21-ec4e-aedf-81a010e725ff' JobID: 'example'>
    2015/10/02 05:52:26 [DEBUG] worker: ack for evaluation f6991d44-1e21-ec4e-aedf-81a010e725ff
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/evaluation/f6991d44-1e21-ec4e-aedf-81a010e725ff (56.99µs)
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/evaluation/f6991d44-1e21-ec4e-aedf-81a010e725ff/allocations (80.542µs)
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/allocation/25924f64-41c9-9376-a1fe-f88ef6af81af (166.524µs)

▶ sudo ./nomad run example.nomad
==> Monitoring evaluation "f6991d44-1e21-ec4e-aedf-81a010e725ff"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "25924f64-41c9-9376-a1fe-f88ef6af81af" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f6991d44-1e21-ec4e-aedf-81a010e725ff" finished with status "complete"

github-actions · 2022-12-30T02:14:19Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

sethvargo added type/bug theme/networking labels Sep 29, 2015

dadgar self-assigned this Oct 2, 2015

legal90 mentioned this issue Oct 2, 2015

Add support of Parallels Desktop for Mac #200

Merged

dadgar mentioned this issue Oct 3, 2015

Add a default link speed #205

Merged

dadgar closed this as completed in #205 Oct 3, 2015

snyk-bot mentioned this issue Oct 20, 2021

[Snyk] Upgrade next-mdx-remote from 3.0.1 to 3.0.4 admon-snyk/nomad#5

Open

This was referenced Feb 18, 2022

[Snyk] Fix for 2 vulnerabilities ekmixon/nomad#72

Open

[Snyk] Fix for 2 vulnerabilities admon-snyk/nomad#11

Open

github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't schedule a job because "resources exhausted" #146

Can't schedule a job because "resources exhausted" #146

adriaandejonge commented Sep 29, 2015

kelseyhightower commented Sep 29, 2015

sethvargo commented Sep 29, 2015

adriaandejonge commented Sep 29, 2015

ghost commented Oct 1, 2015

cdrage commented Oct 2, 2015

github-actions bot commented Dec 30, 2022

Can't schedule a job because "resources exhausted" #146

Can't schedule a job because "resources exhausted" #146

Comments

adriaandejonge commented Sep 29, 2015

kelseyhightower commented Sep 29, 2015

sethvargo commented Sep 29, 2015

adriaandejonge commented Sep 29, 2015

ghost commented Oct 1, 2015

cdrage commented Oct 2, 2015

github-actions bot commented Dec 30, 2022