Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad error's on TCP lookups for retry_join entries when IPv6 DNS resolvers are used #1317

Closed
The-Loeki opened this issue Jun 20, 2016 · 17 comments
Labels
stage/needs-verification Issue needs verifying it still exists theme/ipv6 type/bug

Comments

@The-Loeki
Copy link

The-Loeki commented Jun 20, 2016

Nomad version

Nomad v0.4.0

Operating system and Environment details

CentOS 7 / Ubuntu 16.04 up-to-date

Issue

When IPv6 resolvers are used in /etc/resolv.conf and entries are in the retry_join config, Nomad will always report:

nomad[21903]: 2016/06/20 16:25:08 [DEBUG] memberlist: TCP-first lookup failed for 'nomad01.example.com:4648', falling back to UDP: dial tcp: too many colons in address 2a00:f10:ff04:153::53

(2a00:f10:ff04:153::53 is the DNS server in resolv.conf)

Reproduction steps

Setup & use IPv6 resolvers, kick off a Nomad cluster and add some servers in the retry_join section

I'm not entirely sure how, but the first part of the error msg seems to stem from
https://github.com/hashicorp/nomad/blob/v0.4.0-rc1/vendor/github.com/hashicorp/memberlist/memberlist.go#L301
and the second part from
https://github.com/golang/go/blob/master/src/net/ipsock.go

@dadgar
Copy link
Contributor

dadgar commented Jun 20, 2016

@slackpad An issue resolving IPv6 addresses in memberlist?

@jovandeginste
Copy link

@The-Loeki we are working on some ipv6 issues, but I can't reproduce this one. Is it still an issue with the current master? We built with go 1.7rc3 - maybe it's an issue in (older) go?

@jovandeginste
Copy link

+@42wim

@The-Loeki
Copy link
Author

hi,
I can still reproduce this one using Nomad 0.4.0 on a CentOS 7 up-to-date rig.

Beware the specs; the DNS resolver is an IPv6 address, which causes this bug, not the Nomad server or any of it's config per se (I got the other bugs for those ;) )

@jovandeginste
Copy link

Yes, I have one single ipv6 address in the resolv.conf (but I'm running 0.4.1-dev)

@The-Loeki
Copy link
Author

Aighty then, maybe it semi-accidentally got fixxed along with e.g. #1465 et al. ?

Unfortunately I have no access to master binaries at this time, but I'll report back on this one as soon as a new release hits the wires.

In the meantime the bug seems of little real consequence for now; there's a possibility DNSSEC validation fails if UDP is force-used, but other than that I'll be more than happy to sit it out 'till that new release ;)

@jovandeginste
Copy link

I tried reproducing with this release but it seems to work too ...

@The-Loeki
Copy link
Author

hmmm... That's odd...
Other specifics I can think of:

  • (of course) we have 2 IPv6 resolvers in /etc/resolv.conf
  • Servers are IPv6-only (no A-records, no IPv4 addresses)
  • We bind on [::1]
  • We advertise like [2a00:f10:123:345:1c2:2d3:fe00:79a]:4647

@jovandeginste
Copy link

I tried with a nomad client now (not a server); are you using consul?

@The-Loeki
Copy link
Author

The-Loeki commented Jul 26, 2016

You might be on to something here; looking at the logs it does seem only to happen on server nodes. And yes, same setup (IPv6 only w/socket binds)

@jovandeginste
Copy link

Also on a server no problem. I now have two ipv6 addresses in resolv.conf, running 0.4.0, and have (redacted) log:

2016/07/26 15:17:02 [INFO] raft: Node at [2a02:xxx::997]:4647 [Follower] entering Follower state (Leader: "")
2016/07/26 15:17:02 [WARN] memberlist: Binding to public address without encryption!
2016/07/26 15:17:02 [INFO] serf: EventMemberJoin: server-1.global 2a02:xxx::997
2016/07/26 15:17:02.797868 [INFO] nomad: starting 2 scheduling worker(s) for [system service batch _core]
2016/07/26 15:17:02 [INFO] serf: Attempting re-join to previously known node: server-3.global: [2a02:xxx::1001]:4648
2016/07/26 15:17:02.798561 [INFO] nomad: adding server server-1.global (Addr: [2a02:xxx::997]:4647) (DC: dc1)
2016/07/26 15:17:02 [DEBUG] memberlist: Initiating push/pull sync with: [2a02:xxx::1001]:4648
2016/07/26 15:17:02.799449 [DEBUG] server.consul: lost contact with Nomad quorum, falling back to Consul for server list
2016/07/26 15:17:02 [INFO] serf: EventMemberJoin: server-3.global 2a02:xxx::1001
2016/07/26 15:17:02 [WARN] memberlist: Refuting an alive message
2016/07/26 15:17:02 [INFO] serf: EventMemberJoin: server-2.global 2a02:xxx::999
2016/07/26 15:17:02 [INFO] serf: Re-joined to previously known node: server-3.global: [2a02:xxx::1001]:4648
2016/07/26 15:17:02.800759 [INFO] nomad: adding server server-3.global (Addr: [2a02:xxx::1001]:4647) (DC: dc1)
2016/07/26 15:17:02.800780 [INFO] nomad: adding server server-2.global (Addr: [2a02:xxx::999]:4647) (DC: dc1)
2016/07/26 15:17:02 [DEBUG] memberlist: Failed to join 2a02:xxx::997: dial tcp [2a02:xxx::997]:0: getsockopt: connection refused
2016/07/26 15:17:02 [DEBUG] memberlist: Initiating push/pull sync with: [2a02:xxx::1001]:4648
2016/07/26 15:17:02 [DEBUG] memberlist: Initiating push/pull sync with: [2a02:xxx::999]:4648
2016/07/26 15:17:02.806711 [INFO] server.consul: successfully contacted 2 Nomad Servers
2016/07/26 15:17:03 [DEBUG] raft-net: [2a02:xxx::997]:4647 accepted connection from: [2a02:xxx::1001]:55878
2016/07/26 15:17:03 [DEBUG] raft: Node [2a02:xxx::997]:4647 updated peer set (2): [[2a02:xxx::999]:4647 [2a02:xxx::1001]:4647 [2a02:xxx::997]:4647]
2016/07/26 15:17:03 [DEBUG] raft: Node [2a02:xxx::997]:4647 updated peer set (2): [[2a02:xxx::997]:4647 [2a02:xxx::999]:4647 [2a02:xxx::1001]:4647]
2016/07/26 15:17:03 [DEBUG] raft: Node [2a02:xxx::997]:4647 updated peer set (2): [[2a02:xxx::1001]:4647 [2a02:xxx::997]:4647 [2a02:xxx::999]:4647]
2016/07/26 15:17:03 [DEBUG] serf: messageJoinType: server-1.global
2016/07/26 15:17:03 [DEBUG] serf: messageJoinType: server-1.global
2016/07/26 15:17:03 [DEBUG] raft-net: [2a02:xxx::997]:4647 accepted connection from: [2a02:xxx::1001]:55880
2016/07/26 15:17:03 [DEBUG] serf: messageJoinType: server-1.global
2016/07/26 15:17:03 [DEBUG] serf: messageJoinType: server-1.global

@The-Loeki
Copy link
Author

This is really odd then...

[root@man02 ~]# nomad --version
Nomad v0.4.0
[root@man02 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search example.com
nameserver 2a00:f10:ff04:153::53
nameserver 2a00:f10:ff04:253::53

[root@man02 nomad]# cat /etc/nomad/* (slightly edited for brevity) 
###
### SALTSTACK MANAGED
###
# https://www.nomadproject.io/docs/agent/config.html

leave_on_interrupt = true
data_dir = "/var/nomad"

### LOCALIZATION
region = "ams"
datacenter = "ams03"

### NETWORK STUFF
bind_addr = "[::]"

# We want HTTP to only listen on localhost
# No sockets are supported yet
# https://github.com/hashicorp/nomad/issues/741
addresses {
    http = "[::1]"
}

advertise {
    # We need to specify our host's IP because we can't
    # advertise 0.0.0.0 to other nodes in our cluster.
    # This should be the IP of THIS MACHINE and must be routable by every node
    # in your cluster

    http = "[::1]:4646"

    rpc = "[2001:db8:123:2:aabb:c12a:5578:345]:4647"
    serf = "[2001:db8:123:2:aabb:c12a:5578:345]:4648"
}

### CONSUL INTEGRATION
consul {
    address = "unix:///var/consul/http.sock"
}

### SERVER STUFF
server {
    enabled = true
    bootstrap_expect = 3
    retry_join = [
      "man01.example.com",
      "man02.example.com",
      "man03.example.com",
    ]
}

"log_level": "DEBUG"

nomad log:

Jul 28 11:34:52 man02.example.com nomad[3083]: Loaded configuration from /etc/nomad/10-common.hcl, /etc/nomad/50-server.hcl, /etc/nomad/80-from-pillar.json
Jul 28 11:34:52 man02.example.com nomad[3083]: ==> Starting Nomad agent...
Jul 28 11:34:52 man02.example.com nomad[3083]: ==> Nomad agent configuration:
Jul 28 11:34:52 man02.example.com nomad[3083]: Atlas: <disabled>
Jul 28 11:34:52 man02.example.com nomad[3083]: Client: false
Jul 28 11:34:52 man02.example.com nomad[3083]: Log Level: DEBUG
Jul 28 11:34:52 man02.example.com nomad[3083]: Region: ams (DC: ams03)
Jul 28 11:34:52 man02.example.com nomad[3083]: Server: true
Jul 28 11:34:52 man02.example.com nomad[3083]: ==> Nomad agent started! Log data will stream in below:
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] raft: Restored from snapshot 720-90125-1469304978037
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] raft: Node at [2001:db8:123:2:aabb:c12a:5578:345]:4647 [Follower] entering Follower state (Leader: "")
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [WARN] memberlist: Binding to public address without encryption!
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] serf: EventMemberJoin: man02.example.com.ams 2001:db8:123:2:aabb:c12a:5578:345
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52.844232 [INFO] nomad: starting 2 scheduling worker(s) for [system service batch _core]
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52.845473 [INFO] nomad: adding server man02.example.com.ams (Addr: [2001:db8:123:2:aabb:c12a:5578:345]:4647) (DC: ams03)
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52.847206 [INFO] agent: Joining cluster...
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: TCP-first lookup failed for 'man01.example.com:4648', falling back to UDP: dial tcp: too many colons in address 2a00:f10:ff04:153::53
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] serf: Attempting re-join to previously known node: man03.example.com.ams: [2001:db8:123:2:1f24:a13c:3233:aafb]:4648
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: Initiating push/pull sync with: [2001:db8:123:2:1f24:a13c:3233:aafb]:4648
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52.850227 [DEBUG] server.consul: lost contact with Nomad quorum, falling back to Consul for server list
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] serf: EventMemberJoin: man01.example.com.ams 2001:db8:123:2:8d12:4daa:1213:81b
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] serf: EventMemberJoin: man03.example.com.ams 2001:db8:123:2:1f24:a13c:3233:aafb
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [WARN] memberlist: Refuting an alive message
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [INFO] serf: Re-joined to previously known node: man03.example.com.ams: [2001:db8:123:2:1f24:a13c:3233:aafb]:4648
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52.851805 [INFO] nomad: adding server man01.example.com.ams (Addr: [2001:db8:123:2:8d12:4daa:1213:81b]:4647) (DC: ams03)
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52.851823 [INFO] nomad: adding server man03.example.com.ams (Addr: [2001:db8:123:2:1f24:a13c:3233:aafb]:4647) (DC: ams03)
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: Initiating push/pull sync with: [2001:db8:123:2:8d12:4daa:1213:81b]:4648
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: TCP-first lookup failed for 'man02.example.com:4648', falling back to UDP: dial tcp: too many colons in address 2a00:f10:ff04:153::53
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: Initiating push/pull sync with: [2001:db8:123:2:aabb:c12a:5578:345]:4648
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: TCP connection from=[2001:db8:123:2:aabb:c12a:5578:345]:33410
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: TCP-first lookup failed for 'man03.example.com:4648', falling back to UDP: dial tcp: too many colons in address 2a00:f10:ff04:153::53
Jul 28 11:34:52 man02.example.com nomad[3083]: 2016/07/28 11:34:52 [DEBUG] memberlist: Initiating push/pull sync with: [2001:db8:123:2:1f24:a13c:3233:aafb]:4648

What can I do to help?

@jovandeginste
Copy link

Ok some major differences are:

  • we use consul for discovery, so no retry_join entries
  • we use routable/reachable addresses as advertise, not localhost6

I'm on holiday from today so may take a while before I can try to reproduce
your setup...

@The-Loeki
Copy link
Author

The-Loeki commented Jul 28, 2016

Thanks, that was the hint that was needed; the retry_join entries are the culprits.
Removing them from the config removes the error from the log.

I'm not sure I'm following how this works here; the Consul services don't get registered properly as per #1318 ;)

Anyway, we're closer to the source of the bug. I've adjusted the description of this issue to match

@The-Loeki The-Loeki changed the title Nomad error's on TCP lookups when IPv6 DNS resolvers are used Nomad error's on TCP lookups for retry_join entries when IPv6 DNS resolvers are used Jul 28, 2016
@jovandeginste
Copy link

jovandeginste commented Jul 28, 2016 via email

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross added the stage/needs-verification Issue needs verifying it still exists label Mar 3, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 3, 2021
@mikenomitch
Copy link
Contributor

Hi, we're closing this out as there hasn't been any action on this ticket a long time and there have been a lot of changes to Nomad networking since then.

If this is still an issue for anybody, please feel free to open a new Issue and link to this one!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/needs-verification Issue needs verifying it still exists theme/ipv6 type/bug
Projects
None yet
Development

No branches or pull requests

6 participants