Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to start allocation due to binding to non-existent address #3487

Closed
lovwal opened this issue Nov 2, 2017 · 11 comments
Closed

Failed to start allocation due to binding to non-existent address #3487

lovwal opened this issue Nov 2, 2017 · 11 comments

Comments

@lovwal
Copy link

lovwal commented Nov 2, 2017

Nomad version

Nomad v0.7.0

Operating system and Environment details

CentOS 7 3.10.0-514.el7.x86_64

Issue

Related to #3159
After upgrade to v0.7.0, nomad tries to bind the docker proxy address to a non-existent ipv6 link-local address (this only affected one node in our cluster):

Nov  2 14:34:55 x nomad: 2017/11/02 14:34:55.293399 [WARN] client: failed to start task "x" for alloc "72f48c1c-6e58-449c-d1cc-a6a451465b7f": Failed to start container cc3eda2db8b870293bdb30fe998a5c614d0f82177cf8a01cfc21aef51cfa6432: API error (500): {"message":"driver failed programming external connectivity on endpoint frontend-72f48c1c-6e58-449c-d1cc-a6a451465b7f (01ec695d1e87502807245627b6e073e8561826493c3a2a055beb4a5c4e7fd283): Error starting userland proxy: listen tcp [fe80::24e7:83ff:fe50:bdd1]:21085: bind: invalid argument"}

fe80::24e7:83ff:fe50:bdd1 does not exist on the host.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 fe80::5054:ff:fe60:4642/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN 
    inet6 fe80::42:33ff:febd:3c73/64 scope link 
       valid_lft forever preferred_lft forever

An idea where the address may come from is the virtual interfaces for docker, as for example
43: veth874dca5@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP 
    inet6 fe80::943e:e8ff:fe12:f5cb/64 scope link 
       valid_lft forever preferred_lft forever

So if you restart nomad, the docker containers will continue to live upon shutdown, and later be cleared after upstart. Which means the docker interfaces will be available for consumption when nomad starts.
As for example:
vagrant@n4:~$ ip -6 addr
...
16: veth9891048: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::105c:d9ff:fed2:667d/64 scope link 
       valid_lft forever preferred_lft forever
18: veth9e352ad: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::74c7:81ff:fe89:a0e9/64 scope link 
       valid_lft forever preferred_lft forever
20: veth0f9cb10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::2c54:feff:fe2f:dfc4/64 scope link 
       valid_lft forever preferred_lft forever
22: vethe12d485: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::bc54:1aff:fe55:e8f3/64 scope link 
       valid_lft forever preferred_lft forever
vagrant@n4:~$ sudo systemctl restart nomad
vagrant@n4:~$ ip -6 addr
...
16: veth9891048: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::105c:d9ff:fed2:667d/64 scope link 
       valid_lft forever preferred_lft forever
18: veth9e352ad: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::74c7:81ff:fe89:a0e9/64 scope link 
       valid_lft forever preferred_lft forever
20: veth0f9cb10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::2c54:feff:fe2f:dfc4/64 scope link 
       valid_lft forever preferred_lft forever
22: vethe12d485: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::bc54:1aff:fe55:e8f3/64 scope link 
       valid_lft forever preferred_lft forever
vagrant@n4:~$ ip -6 addr
...
28: vethf5ee1e5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::60d6:eaff:fe3f:fa4d/64 scope link 
       valid_lft forever preferred_lft forever
30: veth2d1a267: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::4855:61ff:fec3:66d8/64 scope link 
       valid_lft forever preferred_lft forever
32: veth78a706a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::307d:64ff:fe2f:4ce8/64 scope link 
       valid_lft forever preferred_lft forever
34: veth152cfdb: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 fe80::5c63:2aff:fe15:b506/64 scope link 
       valid_lft forever preferred_lft forever

Affected node is in drain mode, I can pull further information if required.

Reproduction steps

Restart nomad with running docker containers, unclear when/how often behavior appears.

@diptanu
Copy link
Contributor

diptanu commented Nov 2, 2017

@lovwal which interface did Nomad fingerprint after you restarted?

@dadgar
Copy link
Contributor

dadgar commented Nov 2, 2017

@lovwal Could you share the client's logs when it started up. If you don't have them and you can reproduce would appreciate the logs in DEBUG level

@lovwal
Copy link
Author

lovwal commented Nov 3, 2017

@diptanu Is there any way to pull this information from nomad? Can't see it in the logs.
@dadgar Journal logs from shutdown/start up until drain attached, INFO level unfortunately.

nomad.txt

@dadgar
Copy link
Contributor

dadgar commented Nov 3, 2017

@lovwal Hmm, unfortunately all of that will be in the debug level logs. The INFO level logs doesn't provide much in this case. Would it be possible to collect some DEBUG logs?

@lovwal
Copy link
Author

lovwal commented Nov 3, 2017

@dadgar I'll see if I can get something next week.

@lovwal
Copy link
Author

lovwal commented Nov 6, 2017

@dadgar As suspected it fingerprinted the correct interface after restart. However, it still attempts to start docker containers with the faulty address. Do note I removed the drain before restart, so this is most likely an old state saved.
nomad.txt

@dadgar
Copy link
Contributor

dadgar commented Nov 7, 2017

@lovwal What is the networking like for this node. It looks like in those logs it picked up a IPv4 address and in your earlier dumps of the interfaces there were only link-local IPv6 address. Is it not possible that when Nomad started the first time you didn't have a routable address and that the veth pair was valid at the time?

@lovwal
Copy link
Author

lovwal commented Nov 7, 2017

@dadgar I doubt it (although not impossible of course) since this machine has been online ~100 days and I had an active ssh session to it.
Either way I found the root cause. net.Interface() from standard lib does not always return interfaces in a list according to if_index order, which means veth interfaces can appear before eth in the returned list, as for example on the same machine (intf just calls the stdlib function and prints the struct):

$ ./intf 

[]net.Interface{net.Interface{Index:7937, MTU:1500, Name:"vethaa51c15", HardwareAddr:net.HardwareAddr{0x4a, 0xdc, 0xea, 0xa0, 0x2c, 0xfa}, Flags:0x13}, 

net.Interface{Index:1, MTU:65536, Name:"lo", HardwareAddr:net.HardwareAddr(nil), Flags:0x5}, 

net.Interface{Index:2, MTU:1500, Name:"eth0", HardwareAddr:net.HardwareAddr{0x52, 0x54, 0x0, 0x60, 0x46, 0x42}, Flags:0x13}, 

net.Interface{Index:3, MTU:1500, Name:"docker0", HardwareAddr:net.HardwareAddr{0x2, 0x42, 0x33, 0xbd, 0x3c, 0x73}, Flags:0x13}, 

net.Interface{Index:7951, MTU:1500, Name:"vethec5db7b", HardwareAddr:net.HardwareAddr{0xd6, 0xf5, 0xa5, 0x2e, 0xd0, 0xab}, Flags:0x13}, 

net.Interface{Index:7953, MTU:1500, Name:"veth96ad95d", HardwareAddr:net.HardwareAddr{0x9e, 0x21, 0xa, 0xfe, 0xf9, 0x35}, Flags:0x13}, 

net.Interface{Index:7955, MTU:1500, Name:"veth0706d56", HardwareAddr:net.HardwareAddr{0x22, 0xec, 0x8f, 0x64, 0xc7, 0x3c}, Flags:0x13}}

The docker interface will pass the check here https://github.com/hashicorp/nomad/blob/master/client/fingerprint/network.go#L204
Because the first interface is returned from the resulting list of valid interfaces, the veth will be returned as it's processed first due to the net.Interface order.

In case it is of interest as well, I broke out the network fingerprint module and printed out the return values on the machine:

The resulting list:
[]net.Interface{net.Interface{Index:7937, MTU:1500, Name:"vethaa51c15", HardwareAddr:net.HardwareAddr{0x4a, 0xdc, 0xea, 0xa0, 0x2c, 0xfa}, Flags:0x13}, net.Interface{Index:2, MTU:1500, Name:"eth0", HardwareAddr:net.HardwareAddr{0x52, 0x54, 0x0, 0x60, 0x46, 0x42}, Flags:0x13}, net.Interface{Index:3, MTU:1500, Name:"docker0", HardwareAddr:net.HardwareAddr{0x2, 0x42, 0x33, 0xbd, 0x3c, 0x73}, Flags:0x13}, net.Interface{Index:7951, MTU:1500, Name:"vethec5db7b", HardwareAddr:net.HardwareAddr{0xd6, 0xf5, 0xa5, 0x2e, 0xd0, 0xab}, Flags:0x13}, net.Interface{Index:7953, MTU:1500, Name:"veth96ad95d", HardwareAddr:net.HardwareAddr{0x9e, 0x21, 0xa, 0xfe, 0xf9, 0x35}, Flags:0x13}, net.Interface{Index:7955, MTU:1500, Name:"veth0706d56", HardwareAddr:net.HardwareAddr{0x22, 0xec, 0x8f, 0x64, 0xc7, 0x3c}, Flags:0x13}}

&{7937 1500 vethaa51c15 4a:dc:ea:a0:2c:fa up|broadcast|multicast} <- returned interface
[fe80::48dc:eaff:fea0:2cfa/64] <nil> <- interface address

dadgar added a commit that referenced this issue Nov 13, 2017
This PR introduces a better interface selection heuristic such that we
select interfaces with globally routable unicast addresses over link
local addresses.

Fixes #3487
dadgar added a commit that referenced this issue Nov 13, 2017
This PR introduces a better interface selection heuristic such that we
select interfaces with globally routable unicast addresses over link
local addresses.

Fixes #3487
dadgar added a commit that referenced this issue Nov 13, 2017
This PR introduces a better interface selection heuristic such that we
select interfaces with globally routable unicast addresses over link
local addresses.

Fixes #3487
@dadgar
Copy link
Contributor

dadgar commented Nov 13, 2017

@lovwal PR #3546 should prevent this moving forward! It will now prefer interfaces that have global unicast addresses regardless of the order the interfaces are returned in

@lovwal
Copy link
Author

lovwal commented Nov 15, 2017

@dadgar Thanks for the quick fix! Cheers!

@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants