Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad agent ignores retry-join server address and uses consul discovery instead on boot. #11404

Closed
YafimK opened this issue Oct 28, 2021 · 3 comments · Fixed by #11585
Closed

Comments

@YafimK
Copy link

YafimK commented Oct 28, 2021

Nomad version

Client - Nomad v1.1.0 (2678c36)
Server -

Operating system and Environment details

Linux (Ubuntu focal) - DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
Linux 127 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

On boot, nomad tries to connect to servers and uses consul discovered servers with private addresses instead of public IP which are advertised.
Nomad servers have public IP used for RPC interface and Serf / HTTP are exposed via private IP (the nomad servers and clients don't reside in the same cloud provider).

we have almost been able to solve this by restarting both nomad and consul and also by putting the following in our nomad client configuration:

consul {
client_auto_join = false
}

Reproduction steps

Set up nomad cluster as prescribed whereas nomad agent should be only accessible to nomad server via public servers, and vice versa.

Expected Result

nomad agent should use staticlly defines servers block in nomad.hcl config file and not use consul discovery or the very least use the public advertised rpc address of nomad server and not private.

Actual Result

nomad client agent get the wrong ip of nomad servers (logs above)

Nomad Client logs (if appropriate)

we see the following in the nomad agent logs (1.2.3.1-3 are the public addresses) -

Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.131Z [WARN] client.server_mgr: no servers available
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.131Z [WARN] client.server_mgr: no servers available
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.131Z [INFO] agent.joiner: starting retry join: servers="1.2.3.1 1.2.3.2 1.2.3.3"
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.445Z [INFO] agent.joiner: retry join completed: initial_servers=3 agent_mode=client
Oct 28 11:22:21 127 nomad[859]: 2021-10-28T11:22:21.563Z [INFO] client.consul: discovered following servers: servers=[10.0.241.2:4647, 10.0.4.1:4647, 10.0.5.1:4647]
Oct 28 11:22:28 127 nomad[859]: 2021-10-28T11:22:28.059Z [INFO] client.fingerprint_mgr.consul: consul agent is available

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 8, 2021
@tgross
Copy link
Member

tgross commented Nov 8, 2021

Hi @YafimK!

I dug into the client code a bit and although client_auto_join defaults to true and does appear to control the initial discovery (see client/client.go#L507-L514), there are a few other cases where clients can trigger it as well.

  • (1) If clients miss a heartbeat they'll trigger Consul discovery. We're missing a conditional there to check for the client_auto_join config value. So that's a bug for sure. But if we're hitting that case I would expect to see logs about the failed heartbeat, which I don't see in the logs you've provided.
  • (2) If the Node.UpdateStatus RPC fails, we trigger Consul discovery and we're missing the check for the config value there too.
  • (3) It happens again a bit after that RPC if there's no Nomad leader, again without checking the config value.

Those behaviors are captured in the client_auto_join but from my reading I would expect this config value to cover those as well:

Specifies if the Nomad clients should automatically discover servers in the same region by searching for the Consul service name defined in the server_service_name option. The search occurs if the client is not registered with any servers or it is unable to heartbeat to the leader of the region, in which case it may be partitioned and searches for other servers.

So in the scenario you've described my hunch is that the client can't reach the server and the Node.UpdateStatus RPC is timing out after 5 sec, and then it's trying to find a server and falls back to Consul as described in (2) above. We'll probably want to fix the configuration issue by moving or copying the config check into the consulDiscoveryImpl function so that no one can miss it in the future. Or maybe we can move the check before we send on the channel in triggerDiscovery and then just never spin up the goroutine that does that work.

That being said, can you share the full client and server configuration (redacted as needed)? In the meantime there may be a workaround for the public/private IP that's getting advertised to Consul at least.

@tgross
Copy link
Member

tgross commented Nov 29, 2021

I've opened #11585 with the fix for this.

@tgross tgross added this to the 1.2.3 milestone Nov 29, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Nov 30, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants