Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad v0.6.0-dev arm32v7 "network: no networks available" #3005

Closed
minusdelta opened this issue Aug 10, 2017 · 12 comments
Closed

Nomad v0.6.0-dev arm32v7 "network: no networks available" #3005

minusdelta opened this issue Aug 10, 2017 · 12 comments

Comments

@minusdelta
Copy link

Nomad version

Nomad v0.6.0-dev c075349 from here #2963 (comment)

Operating system and Environment details

  • Ubuntu 16.04.3 LTS
  • test setup, 1 server (amd64), 1 client (arm32v7)

Issue

nomad plan http-test.job

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "http" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "network: no networks available" exhausted on 1 nodes

verify

curl -s 127.0.0.1:4646/v1/node/0d7d067d-2abf-44a9-87b4-f4461ae79061 |jq -rM .Resources
{
  "CPU": 5472,
  "MemoryMB": 1985,
  "DiskMB": 27064,
  "IOPS": 0,
  "Networks": []
}

but on client

ip -o l sh eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP ...

ip -o -4 a sh eth0
2: eth0    inet 169.254.155.20/16 brd 169.254.255.255 scope global eth0 ...

ethtool eth0 |tail -9
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 0
Transceiver: external
Auto-negotiation: on
Current message level: 0x00000000 (0)
Link detected: yes

cat /sys/class/net/eth0/speed
1000

even setting it explicitly doesnt help

/etc/nomad.d/client.hcl

client {
  enabled       = true
  servers       = [ "nomad.service.consul:4647" ]
  network_interface = "eth0"
  network_speed = 100
}
@dadgar
Copy link
Contributor

dadgar commented Aug 10, 2017

Can you set your clients log level to DEBUG and start it up and provide the logs?

@minusdelta
Copy link
Author

@dadgar , sorry forgot those, but nothing important there (to me). Here is a replay:

     Loaded configuration from /etc/nomad.d/client.hcl
 ==> Starting Nomad agent...
 ==> Nomad agent configuration:
                 Client: true
              Log Level: DEBUG
                 Region: global (DC: dc1)
                 Server: false
                Version: 0.6.0dev
 ==> Nomad agent started! Log data will stream in below:
     2017/08/11 10:51:40.015429 [INFO] client: using state directory /var/lib/nomad/client
     2017/08/11 10:51:40.016443 [INFO] client: using alloc directory /var/lib/nomad/alloc
     2017/08/11 10:51:40.036089 [DEBUG] client: built-in fingerprints: [arch cgroup consul cpu host memory network nomad signal storage vault env_aws env_gc
     2017/08/11 10:51:40.037582 [INFO] fingerprint.cgroups: cgroups are available
     2017/08/11 10:51:40.038201 [DEBUG] client: fingerprinting cgroup every 15s
     2017/08/11 10:51:40.068199 [INFO] fingerprint.consul: consul agent is available
     2017/08/11 10:51:40.068566 [DEBUG] client: fingerprinting consul every 15s
     2017/08/11 10:51:40.072231 [DEBUG] fingerprint.cpu: frequency: 1368 MHz
     2017/08/11 10:51:40.072341 [DEBUG] fingerprint.cpu: core count: 4
     2017/08/11 10:51:40.105728 [DEBUG] fingerprint.network: setting link speed to user configured speed: 100
     2017/08/11 10:51:40.120143 [DEBUG] client: fingerprinting vault every 15s
     2017/08/11 10:51:42.120788 [DEBUG] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
     2017/08/11 10:51:43.266089 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
     2017/08/11 10:51:43.266186 [DEBUG] fingerprint.env_gce: Error querying GCE Metadata URL, skipping
     2017/08/11 10:51:43.266323 [DEBUG] client: applied fingerprints [arch cgroup consul cpu host memory network nomad signal storage]
     2017/08/11 10:51:43.267257 [DEBUG] driver.docker: using client connection initialized from environment
     2017/08/11 10:51:43.267510 [DEBUG] client: fingerprinting rkt every 15s
     2017/08/11 10:51:43.286995 [DEBUG] driver.exec: exec driver is enabled
     2017/08/11 10:51:43.287218 [DEBUG] client: available drivers [docker exec]
     2017/08/11 10:51:43.287248 [DEBUG] client: fingerprinting docker every 15s
     2017/08/11 10:51:43.287328 [DEBUG] client: fingerprinting exec every 15s
     2017/08/11 10:51:43.307753 [INFO] client: Node ID "ce94e4d8-c62f-7eba-bf4a-293edca206cc"
     2017/08/11 10:51:43.315813 [DEBUG] client: updated allocations at index 4361 (total 0) (pulled 0) (filtered 0)
     2017/08/11 10:51:43.316628 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
     2017/08/11 10:51:43.345182 [INFO] client: node registration complete
     2017/08/11 10:51:43.345605 [DEBUG] client: periodically checking for node changes at duration 5s
     2017/08/11 10:51:43.568818 [DEBUG] consul.sync: registered 1 services, 1 checks; deregistered 0 services, 0 checks
     2017/08/11 10:51:50.293970 [DEBUG] http: Request /v1/agent/servers (4.350197ms)
     2017/08/11 10:51:51.706383 [DEBUG] client: state updated to ready
     2017/08/11 10:52:00.304277 [DEBUG] http: Request /v1/agent/servers (2.173911ms)
     2017/08/11 10:52:10.316009 [DEBUG] http: Request /v1/agent/servers (3.750804ms)

... last 2 lines every 10s.

@minusdelta
Copy link
Author

Ok, a short follow up [after accepting the fact that Nomad on arm wants to be babysitted]
... going back to v0.5.6 on the client, job starts up immediately.

@jstoja
Copy link

jstoja commented Aug 16, 2017

Hello guys,

I'm having the same issue on amd64, the hosts have 4 ethernet interfaces in bonding mode. After the upgrade from 0.5.6 to 0.6.0, all the allocs stayed, but if we try to plan new ones, we have the same message:

- WARNING: Failed to place all allocations.
  Task Group "http" (failed to place 1 allocation):
    * Resources exhausted on 5 nodes
    * Dimension "network: no networks available" exhausted on 5 nodes

The logs in debug show the following lines:

[...]
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.448166 [INFO] fingerprint.cgroups: cgroups are available
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.448281 [DEBUG] client: fingerprinting cgroup every 15s
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.449814 [INFO] fingerprint.consul: consul agent is available
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.449957 [DEBUG] client: fingerprinting consul every 15s
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.451236 [DEBUG] fingerprint.cpu: frequency: 2397 MHz
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.451243 [DEBUG] fingerprint.cpu: core count: 16
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.521172 [DEBUG] fingerprint.network: link speed for eth0 set to 1000
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:15.524414 [DEBUG] client: fingerprinting vault every 15s
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:17.524523 [DEBUG] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.524732 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.524745 [DEBUG] fingerprint.env_gce: Error querying GCE Metadata URL, skipping
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.524761 [DEBUG] client: applied fingerprints [arch cgroup consul cpu host memory network nomad signal storage]
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.566736 [DEBUG] driver.docker: using client connection initialized from environment
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.566836 [DEBUG] client: fingerprinting rkt every 15s
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.569636 [DEBUG] driver.exec: exec driver is enabled
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.569658 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.569668 [DEBUG] client: available drivers [qemu docker exec raw_exec]
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.569724 [DEBUG] client: fingerprinting docker every 15s
Aug 16 10:51:25 lux4 nomad[4004]: 2017/08/16 10:51:19.569769 [DEBUG] client: fingerprinting exec every 15s
[...]

Wasn't having the issue with the 0.5.6.
Edit: I'm downgrading to 0.5.6 and the jobs are being scheduled.

@jstoja
Copy link

jstoja commented Aug 18, 2017

@minusdelta What was the network interfaces configuration on your side?

@minusdelta
Copy link
Author

@jstoja Absolutely nothing fancy like an 4x aggregation here ... (abbr. vers):

docker0   Link encap:Ethernet  HWaddr 02:42: ..
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0

eth0      Link encap:Ethernet  HWaddr 02:81: ..
          inet addr:169.254.155.20  Bcast:169.254.255.255  Mask:255.255.0.0

eth0.1816 Link encap:Ethernet  HWaddr 02:81: ..

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0

eth0.1816 is a macvlan iface, therefore no ip.

Only special things here are:

  • eth0 with an ipv4ll ip
  • first octet of macaddr means "locally administered"

I observed exactly the same issue with a CoreOS test instance (vm) with only one (v)nic and no macvlan.

@dadgar I think the "platform-arm" label doesnt fit anymore here, unfortunately.

@nealmchugh
Copy link

I ran into this problem on Ubuntu 16.04.3 on amd64. I solved this in Nomad 0.6.0 by remembering I use LXC as well and had a bridge interface. Once I switched from the equivalent of eth0 to br0 in the nomad conf file, all was well.

@dadgar
Copy link
Contributor

dadgar commented Aug 21, 2017

Any one have a vagrant image or terraform that brings something up to reproduce this? Do not have any raspberry pis lying around.

@minusdelta
Copy link
Author

@dadgar IMHO most "nomad-on-arm" users (#1693 #2291) != rpi (depending on the model that could also mean armv6) but are from here: https://www.scaleway.com/baremetal-cloud-servers : armv7 = "C1".
Perhaps this could be a possibility for reproducing also (64 bit/armv8 offer there too).

@dadgar
Copy link
Contributor

dadgar commented Aug 22, 2017

@minusdelta So you are using "C1" and experiencing this?

@angrycub
Copy link
Contributor

It would seem that this commit might have something to do with it if it's always a link local address.

ad00ec8

dadgar added a commit that referenced this issue Aug 23, 2017
This PR changes the fingerprint handling of network interfaces that only
contain link local addresses. The new behavior is to prefer globally
routable addresses and if none are detected, to fall back to link local
addresses if the operator hasn't disallowed it. This gives us pre 0.6
behavior for interfaces with only link local addresses but 0.6+ behavior
for IPv6 interfaces that will always have a link-local address.

Fixes #3005

/cc diptanuc
@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants