Unable to update list of servers after replacing servers #1590

mlafeldt · 2016-08-15T13:10:33Z

Nomad version

v0.4.0

Operating system and Environment details

CoreOS stable 1068.8.0
AWS

Issue

Nomad clients are unable to register with servers configured via nomad client-config -update-servers after replacing all servers.

Reproduction steps

I have a working Nomad cluster setup consisting of 3 clients and these these 3 server nodes:

$ nomad server-members
Name                                            Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
ip-10-8-3-95.eu-west-1.compute.internal.global  10.8.3.95  4648  alive   true    2         0.4.0  eu-west-1   global
ip-10-8-3-96.eu-west-1.compute.internal.global  10.8.3.96  4648  alive   false   2         0.4.0  eu-west-1   global
ip-10-8-4-30.eu-west-1.compute.internal.global  10.8.4.30  4648  alive   false   2         0.4.0  eu-west-1   global

Afterwards I terminate the 3 cluster nodes and recreate them from scratch:

$ nomad server-members
Name                                             Address     Port  Status  Leader  Protocol  Build  Datacenter  Region
ip-10-8-3-123.eu-west-1.compute.internal.global  10.8.3.123  4648  alive   true    2         0.4.0  eu-west-1   global
ip-10-8-4-93.eu-west-1.compute.internal.global   10.8.4.93   4648  alive   false   2         0.4.0  eu-west-1   global
ip-10-8-4-94.eu-west-1.compute.internal.global   10.8.4.94   4648  alive   false   2         0.4.0  eu-west-1   global

However, now I cannot get the clients to register with the new servers, even after running nomad client-config -update-servers. In fact, the agent still tries to contact the old/dead server nodes:

$ nomad client-config -update-servers 10.8.3.123:4647 10.8.4.93:4647 10.8.4.94:4647
Updated server list
$ nomad client-config -servers
10.8.3.123:4647
10.8.4.30:4647
10.8.3.96:4647
10.8.4.93:4647
10.8.3.95:4647
10.8.4.94:4647

Nomad Client logs

From what I can see in the client logs, the agent still tries to connect to the old/dead cluster leader:

Aug 15 14:43:38 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:37.889545 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:41 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:40.895560 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:44 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:43.901578 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:47 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:46.907597 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:50 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:49.913596 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:53 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:52.919560 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:53 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:52.919592 [DEBUG] client.rpcproxy: No healthy servers during rebalance, aborting

It appears that the agent does not even attempt to connect to all servers returned by nomad client-config -servers.

Background

We want our infrastructure to be self-healing. While Nomad provides retry_join on the server side, there's no such thing for clients. I know that servers will push the current list of healthy servers to clients. However, this does not work if all server nodes are replaced at once or if the client nodes are bootstrapped before any server. That's why we want to periodically push discovered servers via the /v1/agent/servers endpoint on clients.

/cc @denderello

The text was updated successfully, but these errors were encountered:

mlafeldt · 2016-08-15T13:31:06Z

PS: Of course, we normally do rolling updates of both our server and client clusters. Having to replace the entire server cluster is still a scenario I'd like to handle (by decoupling both clusters as much as possible).

dadgar · 2016-08-16T00:28:25Z

So it was actually using the update list, the problem was that the client was not reregistering itself since the normal path is register than just heartbeat. So when the new servers came up they were rejecting its heartbeats.

mlafeldt · 2016-08-16T08:41:01Z

I'm not sure about the internals and what is going wrong. In the logs, I can't see that the new servers are contacted at all. What I can say is that we need to restart the client agent and give it the new server list for it to register successfully. Updating the list via nomad client-config -update-servers does not work here.

mlafeldt · 2016-08-16T15:37:28Z

For the time being, we managed to decouple deployment of Nomad clients from servers by using a watchdog unit that periodically checks whether there's a valid server among the list reported by nomad client-config -servers. If not, we update the client configuration file and restart the agent.

I still think that nomad client-config -update-servers should support this use case, so that people aren't forced to use Consul.

dadgar · 2016-08-16T15:39:41Z

Are you saying this after the PR I opened

mlafeldt · 2016-08-16T15:46:33Z

Ah! Totally missed that one. Thanks.

I can run some tests with the PR on our cluster. Just need to add a way to roll out custom builds. Is this ready for testing?

(We're not going to install non-released Nomad builds in production, so the watchdog workaround will still be required for some time.)

dadgar · 2016-08-16T15:49:55Z

Yeah it is ready! This will fix the case of having to restart the client if
all the servers are rolled but if the client gets a heartbeat, the set of
servers there will override what was set in the cli.

In this way some additional work needs to be done to make the update set
the list of servers and not append to it
On Tue, Aug 16, 2016 at 8:46 AM Mathias Lafeldt notifications@github.com
wrote:

Ah! Totally missed that one. Thanks.

I can run some tests with the PR on our cluster. Just need to add a way to
roll out custom builds. Is this ready for testing?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1590 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA_9amOsPc_kwJaco6Xx93gFBGcwTbtxks5qgdtbgaJpZM4JkXUs
.

mlafeldt · 2016-08-17T10:16:45Z

@dadgar I'm happy to report that your fix actually works for us.

The test scenario:

Have a working server cluster
Bootstrap a client cluster with custom Nomad version
Re-create server cluster from scratch
Run nomad client-config -update-servers to tell clients about new nodes

After the heartbeat, the clients successfully re-registered with the servers and showed up in nomad node-status as well.

With this fix in hand, we're able to use a systemd timer that periodically pushes discovered servers via nomad client-config -update-servers as a fallback mechanism to the initial discovery on boot-up.

Thanks!

github-actions · 2022-12-20T02:14:51Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mlafeldt changed the title ~~nomad client-config -update-servers fails after replacing servers~~ Unable to update list of servers after replacing servers Aug 15, 2016

dadgar mentioned this issue Aug 16, 2016

Reregister Client on failed heartbeat #1593

Merged

dadgar closed this as completed in #1593 Aug 16, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to update list of servers after replacing servers #1590

Unable to update list of servers after replacing servers #1590

mlafeldt commented Aug 15, 2016 •

edited

Loading

mlafeldt commented Aug 15, 2016

dadgar commented Aug 16, 2016

mlafeldt commented Aug 16, 2016

mlafeldt commented Aug 16, 2016

dadgar commented Aug 16, 2016

mlafeldt commented Aug 16, 2016 •

edited

Loading

dadgar commented Aug 16, 2016

mlafeldt commented Aug 17, 2016

github-actions bot commented Dec 20, 2022

Unable to update list of servers after replacing servers #1590

Unable to update list of servers after replacing servers #1590

Comments

mlafeldt commented Aug 15, 2016 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs

Background

mlafeldt commented Aug 15, 2016

dadgar commented Aug 16, 2016

mlafeldt commented Aug 16, 2016

mlafeldt commented Aug 16, 2016

dadgar commented Aug 16, 2016

mlafeldt commented Aug 16, 2016 • edited Loading

dadgar commented Aug 16, 2016

mlafeldt commented Aug 17, 2016

github-actions bot commented Dec 20, 2022

mlafeldt commented Aug 15, 2016 •

edited

Loading

mlafeldt commented Aug 16, 2016 •

edited

Loading