Nomad agent ignores retry-join server address and uses consul discovery instead on boot. #11404

YafimK · 2021-10-28T11:39:11Z

Nomad version

Client - Nomad v1.1.0 (2678c36)
Server -

Operating system and Environment details

Linux (Ubuntu focal) - DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
Linux 127 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

On boot, nomad tries to connect to servers and uses consul discovered servers with private addresses instead of public IP which are advertised.
Nomad servers have public IP used for RPC interface and Serf / HTTP are exposed via private IP (the nomad servers and clients don't reside in the same cloud provider).

we have almost been able to solve this by restarting both nomad and consul and also by putting the following in our nomad client configuration:

consul {
client_auto_join = false
}

Reproduction steps

Set up nomad cluster as prescribed whereas nomad agent should be only accessible to nomad server via public servers, and vice versa.

Expected Result

nomad agent should use staticlly defines servers block in nomad.hcl config file and not use consul discovery or the very least use the public advertised rpc address of nomad server and not private.

Actual Result

nomad client agent get the wrong ip of nomad servers (logs above)

Nomad Client logs (if appropriate)

we see the following in the nomad agent logs (1.2.3.1-3 are the public addresses) -

Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.131Z [WARN] client.server_mgr: no servers available
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.131Z [WARN] client.server_mgr: no servers available
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.131Z [INFO] agent.joiner: starting retry join: servers="1.2.3.1 1.2.3.2 1.2.3.3"
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.445Z [INFO] agent.joiner: retry join completed: initial_servers=3 agent_mode=client
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.563Z [INFO] client.consul: discovered following servers: servers=[10.0.241.2:4647, 10.0.4.1:4647, 10.0.5.1:4647]
Oct 28 11:22:28 127 nomad[859]: 2021-10-28T11:22:28.059Z [INFO] client.fingerprint_mgr.consul: consul agent is available

The text was updated successfully, but these errors were encountered:

tgross · 2021-11-08T19:16:56Z

Hi @YafimK!

I dug into the client code a bit and although client_auto_join defaults to true and does appear to control the initial discovery (see client/client.go#L507-L514), there are a few other cases where clients can trigger it as well.

(1) If clients miss a heartbeat they'll trigger Consul discovery. We're missing a conditional there to check for the client_auto_join config value. So that's a bug for sure. But if we're hitting that case I would expect to see logs about the failed heartbeat, which I don't see in the logs you've provided.
(2) If the Node.UpdateStatus RPC fails, we trigger Consul discovery and we're missing the check for the config value there too.
(3) It happens again a bit after that RPC if there's no Nomad leader, again without checking the config value.

Those behaviors are captured in the client_auto_join but from my reading I would expect this config value to cover those as well:

Specifies if the Nomad clients should automatically discover servers in the same region by searching for the Consul service name defined in the server_service_name option. The search occurs if the client is not registered with any servers or it is unable to heartbeat to the leader of the region, in which case it may be partitioned and searches for other servers.

So in the scenario you've described my hunch is that the client can't reach the server and the Node.UpdateStatus RPC is timing out after 5 sec, and then it's trying to find a server and falls back to Consul as described in (2) above. We'll probably want to fix the configuration issue by moving or copying the config check into the consulDiscoveryImpl function so that no one can miss it in the future. Or maybe we can move the check before we send on the channel in triggerDiscovery and then just never spin up the goroutine that does that work.

That being said, can you share the full client and server configuration (redacted as needed)? In the meantime there may be a workaround for the public/private IP that's getting advertised to Consul at least.

tgross · 2021-11-29T16:11:10Z

I've opened #11585 with the fix for this.

github-actions · 2022-10-14T02:44:40Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

YafimK added the type/bug label Oct 28, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 8, 2021

tgross added theme/client theme/config theme/discovery labels Nov 8, 2021

tgross self-assigned this Nov 8, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Nov 8, 2021

tgross added the stage/waiting-reply label Nov 9, 2021

tgross mentioned this issue Nov 29, 2021

client: respect client_auto_join after connection loss #11585

Merged

tgross removed the stage/waiting-reply label Nov 29, 2021

tgross added this to the 1.2.3 milestone Nov 29, 2021

tgross closed this as completed in #11585 Nov 30, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done Nov 30, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad agent ignores retry-join server address and uses consul discovery instead on boot. #11404

Nomad agent ignores retry-join server address and uses consul discovery instead on boot. #11404

YafimK commented Oct 28, 2021

tgross commented Nov 8, 2021

tgross commented Nov 29, 2021

github-actions bot commented Oct 14, 2022

Nomad agent ignores retry-join server address and uses consul discovery instead on boot. #11404

Nomad agent ignores retry-join server address and uses consul discovery instead on boot. #11404

Comments

YafimK commented Oct 28, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs (if appropriate)

tgross commented Nov 8, 2021

tgross commented Nov 29, 2021

github-actions bot commented Oct 14, 2022