raft: Allow Join to be called multiple times for the same cluster member #2198

aaronlehmann · 2017-05-23T01:20:31Z

If Join is called and the member already exists in the cluster, instead
of returning an error, it will update the node's address with the new
one provided to the RPC.

This will allow managers to update their addresses automatically on
startup, if they were configured to autodetect the addresses.

It will also be possible to manually repeat the "docker swarm join"
command, to specify a different advertise address, or rejoin a cluster
when all known manager IPs have changed.

cc @aluzzardi @cyli

codecov · 2017-05-23T01:30:58Z

Codecov Report

Merging #2198 into master will increase coverage by 0.12%.
The diff coverage is 67.44%.

@@            Coverage Diff             @@
##           master    #2198      +/-   ##
==========================================
+ Coverage   61.07%   61.19%   +0.12%     
==========================================
  Files         128      128              
  Lines       20556    20581      +25     
==========================================
+ Hits        12554    12595      +41     
+ Misses       6627     6598      -29     
- Partials     1375     1388      +13

aaronlehmann · 2017-05-23T03:11:18Z

@aluzzardi: This will handle the case where the user wants to manually update the advertise address, or rejoin a cluster where all the IPs have changed, as we discussed.

However, for automatic IP address updating, I'm thinking now that it's better to include the advertise address in the ProcessRaftMessageRequest that gets sent with RPCs for normal Raft communications. If a receiving node ever notices an address that doesn't match the one it expected, it will update that address in Raft. I think this is better than calling Join automatically on startup because:

It handles the case where the IP address changes after the daemon is already running
It will make the code a lot simpler. There would be no need to invoke the Join RPC against different candidate nodes until it succeeds against one - Raft already has logic that sends heartbeat responses or vote requests, and if we include the address in these, we can make sure it gets seen by some other node.

aaronlehmann · 2017-05-23T03:35:03Z

One potential issue with automatically updating the IP addresses (or encouraging people to script it) is that if someone wrongly copies the /var/lib/docker/swarm directory to another machine, or clones a VM, the two nodes would fight each other for the identity. I'm not sure there's a good solution to this.

aaronlehmann · 2017-05-23T21:02:33Z

I've added a commit that avoids skipping Join when a node is already part of a cluster, so the IP can be updated and a current list of peers can be fetched. Note this will require some changes in moby. I'm working on those next.

aaronlehmann · 2017-05-24T00:25:33Z

I have some moby changes related to this at https://github.com/aaronlehmann/docker/tree/swarmkit-rejoin. It also covers #2199.

I'm questioning whether allowing swarm join while already part of a cluster, to manually update the address, makes sense. The only situation where it would be useful is when all the other managers' IPs have changed while one of the nodes were down. But you would need at least 5 managers to have a situation where these IPs could change incrementally with one node down, and not lose quorum. If quorum was lost because of too many IP changes at once, you would have to use --force-new-cluster.

So now I'm leaning towards reverting the ability to run swarm join to update the address manually in that moby branch. If the changes are so extreme that a node doesn't know of any other nodes, the user should just use --force-new-cluster, since in most of these cases it would be necessary anyway.

cyli · 2017-05-24T00:40:53Z

@aaronlehmann It could be a worker that was down while the other managers changed IPs, or alternately that went down if it only knew about 1 manager, that has since changed IPs. However, it seems less important for workers to be able to re-join the cluster as it was - spinning up a new node seems fine if it's a worker.

Other than that I think I agree making swarm join just change the IP address seems not very useful.

aaronlehmann · 2017-05-24T17:52:01Z

There's one use case I remembered where repeating swarm join may be useful - when you want to change from a fixed advertise address to an autodetected one, or the other way around.

I tested this use case with the linked moby fork and it seems to work.

So, the question is whether repeated swarm join is worth supporting for this use case, or if we should just tell people to demote the node, take it out of the swarm, and join again as a fresh node.

cyli · 2017-05-24T19:40:37Z

@aaronlehmann Wondering if it'd make sense to put that in node update rather than swarm join instead?

aaronlehmann · 2017-05-24T20:04:53Z

`node update` on the node itself or on another manager?

cyli · 2017-05-24T20:48:40Z

@aaronlehmann I thinking if you call it on another manager you can either specify the address for that node, or tell the cluster to pick the address that it is connected to the cluster from?

cyli · 2017-05-24T20:52:53Z

Sorry, I don't mean to bikeshed this PR - I'm fine with not supporting it at all as well. If we did want to support it, "updating" the address of the node makes more sense than re-joining the cluster as an operation, but I don't feel very strongly.

aaronlehmann · 2017-05-24T21:58:03Z

Using node update is a little more complex because node update submits a new NodeSpec. We would need to add address information to the NodeSpec and add a reconciliation loop that updates the Raft member list based on the NodeSpec.

This would be in addition to supporting some other mechanism (such as repeated Join calls) for a node to update its own address when it changes (see #2199), unless we're okay with a node automatically issuing a request to the controlapi to change its own spec.

Despite being more technically complex, node update may make more sense from a UX perspective, though.

I'd be curious to hear @aluzzardi's thoughts, since he was the one who suggested allowing swarm join on a node that's already part of a cluster.

cyli · 2017-05-25T02:28:13Z

The code mostly LGTM - I have 2 non-blocking questions though:

Would it be worth trying to join that node one more time in the test, but make sure that if ForceJoin is not set, joining a 3rd time will fail?
Node already has SetAddr, which seems to require that the node is leader in order to set anything: https://github.com/docker/swarmkit/blob/master/manager/state/raft/raft.go#L281.

There's a note in there that suggests that maybe it should be modified to allow nodes to change their address later. Not sure if this would be a good use case to modify it, since it looks pretty similar to https://github.com/docker/swarmkit/pull/2198/files#diff-d0df97935a320420c7f148954ec53bf4R952?

aaronlehmann · 2017-05-25T16:45:15Z

Would it be worth trying to join that node one more time in the test, but make sure that if ForceJoin is not set, joining a 3rd time will fail?

It won't fail in this case, it just will not call Join.

Node already has SetAddr, which seems to require that the node is leader in order to set anything.

That's a good point, but I'm not sure I see a good way to unify them. The use case for SetAddr is when raft is running before a port is bound (not a mode that's currently supported, but this was preliminary work towards it). This only makes sense on a single-node cluster - you wouldn't be able to join an existing cluster without binding a port first.

In theory SetAddr could call Join on the leader if this node is not the leader, but I don't see a use case for that, so I'm hesitant to add the extra code. One tempting thought is that an exposed SetAddr method on Node could replace the (admittedly ugly) ForceJoin flag, and swarm join when already part of a cluster could use this after creating the node, but the semantics aren't quite the same. When you use swarm join, you specify an address to join (which is the remote peer that the Join RPC is invoked on), and SetAddr doesn't have a concept of a remote address. So I guess this boils down to the question of whether the UX to update the address is running swarm join again, or some other command that only updates the address.

cyli · 2017-05-25T17:31:44Z

Ok, thank you for explaining! LGTM if swarm join UX is preferred

aaronlehmann · 2017-06-13T21:25:31Z

@aluzzardi: I think if we get this in, it will resolve the remaining problems being seen with e2e promotion/demotion tests.

The old code won't let a node that is part of the raft cluster join it again. If a node is demoted, and removes its state, but it is immediately re-promoted and never actually gets removed from the raft cluster, it won't be able to rejoin. However, with this PR, there is no such problem, because repeated joins are allowed.

aluzzardi · 2017-06-13T22:20:20Z

One potential issue with automatically updating the IP addresses (or encouraging people to script it) is that if someone wrongly copies the /var/lib/docker/swarm directory to another machine, or clones a VM, the two nodes would fight each other for the identity. I'm not sure there's a good solution to this.

We had (have?) the same issue with classic Swarm. I believe @wsong had a workaround?

wsong · 2017-06-13T22:24:39Z

In Swarm's case, we were using the engine ID (which I believe is defined in /etc/docker/daemon.json) to differentiate nodes. I made a change to use the engine ID concatenated with the IP address to handle the case where somebody copied /etc/docker/daemon.json onto two different nodes by accident.

aluzzardi · 2017-06-13T22:26:00Z

I think the use cases to support multiple joins are:

Run swarm join at startup regardless. Helps automation
Re-join a manager when all other managers have changed IP
Re-join a worker when all other workers have changed IP
Specify a different advertise and/or listen address

However, for automatic IP address updating, I'm thinking now that it's better to include the advertise address in the ProcessRaftMessageRequest that gets sent with RPCs for normal Raft communications. If a receiving node ever notices an address that doesn't match the one it expected, it will update that address in Raft.

I think both make sense: user > detection > failure.

If the user passes an address, then we use that one no matter what. Bypass all detection.
If the user doesn't pass an address, perform auto detection.

Thoughts?

/cc @binman-docker @dhiltgen @wsong

aaronlehmann · 2017-06-13T22:50:50Z

Thoughts?

This sounds good to me if it's the approach people are most comfortable with. I have some older thoughts here: #2198 (comment) #2198 (comment)

Basically, of the use cases above, (1) doesn't seem very important to me, (2) is an extreme case that might require force-new-cluster anyway, and (3) is easy to fix with a leave/join cycle, since temporarily removing a worker is not disruptive (I assume this means "all other managers have changed IP"). So (4), specifying a different fixed advertise address or listen address, or turning on or off address autodetection, is the only use case that I see as valuable for repeated swarm joins. And my question is whether this is important/common enough to justify the UX change, or if we should just tell people to remove the node from the cluster and add it again in that situation.

aaronlehmann · 2017-06-16T00:09:06Z

ping @aluzzardi any thoughts on the above?

aluzzardi · 2017-06-26T23:13:11Z

@binman-docker @jakegdocker @thaJeztah: Any thoughts on the UX?

binman-docker · 2017-06-26T23:22:02Z

I like the behavior and the option, I can see it being handy in some environments (especially changing the IP without removing node from the Swarm if people don't want to churn the containers).

For our use case it's not really relevant - IP's don't change for the lifetime of the instance, we just replace hosts, and they only join once. I'm not sure about the cases where a node would need to rejoin a cluster after all of the other IP's changed - is their change of IP not something that would have been communicated to the node as part of the normal gossip? Or are we targeting the case where those other nodes lost contact and could not gossip the change of IP?

aaronlehmann · 2017-06-27T20:31:31Z

Gossip is only used for networking. The use case here is for a node to find the managers when it doesn't know any current manager IP addresses.

aaronlehmann · 2017-07-12T18:43:40Z

I think we should move forward with this PR. It fixes a some actual problems, for example see #2196 (comment). The UX discussion is more related to #2199, so we shouldn't let it block this PR, which is purely internals and doesn't change UX.

The necessary changes in moby/moby have already been merged some time ago in moby/moby#33361.

cyli · 2017-07-12T20:53:41Z

This still LGTM, although since it's been a while, would it be worth trying to rebase + CI to make sure there are no semantic conflicts?

If Join is called and the member already exists in the cluster, instead of returning an error, it will update the node's address with the new one provided to the RPC. This will allow managers to update their addresses automatically on startup, if they were configured to autodetect the addresses. It will also be possible to manually repeat the "docker swarm join" command, to specify a different advertise address, or rejoin a cluster when all known manager IPs have changed. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2017-07-12T20:59:43Z

This still LGTM, although since it's been a while, would it be worth trying to rebase + CI to make sure there are no semantic conflicts?

Rebased

aaronlehmann mentioned this pull request May 23, 2017

Automatically update raft IP address when it changes #2199

Open

aaronlehmann force-pushed the rejoin branch from ba3a0ec to eefde43 Compare May 23, 2017 21:01

aaronlehmann force-pushed the rejoin branch from eefde43 to f031f34 Compare May 23, 2017 23:22

aaronlehmann mentioned this pull request Jun 8, 2017

Node downed through repeated promote/demotes moby/moby#33580

Closed

aaronlehmann mentioned this pull request Jun 16, 2017

Manager listenAddress can't be changed #642

Open

aaronlehmann mentioned this pull request Jul 12, 2017

Promoting 2 nodes in short succession sometimes leave one node unreachable #2196

Open

aaronlehmann added 2 commits July 12, 2017 13:55

raft: Allow Join to be called when already part of a cluster

10e1838

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann force-pushed the rejoin branch from f031f34 to 10e1838 Compare July 12, 2017 20:58

cyli approved these changes Jul 12, 2017

View reviewed changes

aaronlehmann merged commit 293bcd7 into moby:master Jul 20, 2017

aaronlehmann deleted the rejoin branch July 20, 2017 20:47

thaJeztah mentioned this pull request Aug 5, 2017

Migrating docker swarm to another cluster still shows old IP address moby/moby#34306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: Allow Join to be called multiple times for the same cluster member #2198

raft: Allow Join to be called multiple times for the same cluster member #2198

aaronlehmann commented May 23, 2017

codecov bot commented May 23, 2017 •

edited

Loading

aaronlehmann commented May 23, 2017

aaronlehmann commented May 23, 2017

aaronlehmann commented May 23, 2017

aaronlehmann commented May 24, 2017 •

edited

Loading

cyli commented May 24, 2017

aaronlehmann commented May 24, 2017

cyli commented May 24, 2017

aaronlehmann commented May 24, 2017 via email •

edited

Loading

cyli commented May 24, 2017

cyli commented May 24, 2017

aaronlehmann commented May 24, 2017

cyli commented May 25, 2017

aaronlehmann commented May 25, 2017

cyli commented May 25, 2017

aaronlehmann commented Jun 13, 2017

aluzzardi commented Jun 13, 2017

wsong commented Jun 13, 2017

aluzzardi commented Jun 13, 2017

aaronlehmann commented Jun 13, 2017

aaronlehmann commented Jun 16, 2017

aluzzardi commented Jun 26, 2017 •

edited

Loading

binman-docker commented Jun 26, 2017

aaronlehmann commented Jun 27, 2017

aaronlehmann commented Jul 12, 2017

cyli commented Jul 12, 2017

aaronlehmann commented Jul 12, 2017

raft: Allow Join to be called multiple times for the same cluster member #2198

raft: Allow Join to be called multiple times for the same cluster member #2198

Conversation

aaronlehmann commented May 23, 2017

codecov bot commented May 23, 2017 • edited Loading

Codecov Report

aaronlehmann commented May 23, 2017

aaronlehmann commented May 23, 2017

aaronlehmann commented May 23, 2017

aaronlehmann commented May 24, 2017 • edited Loading

cyli commented May 24, 2017

aaronlehmann commented May 24, 2017

cyli commented May 24, 2017

aaronlehmann commented May 24, 2017 via email • edited Loading

cyli commented May 24, 2017

cyli commented May 24, 2017

aaronlehmann commented May 24, 2017

cyli commented May 25, 2017

aaronlehmann commented May 25, 2017

cyli commented May 25, 2017

aaronlehmann commented Jun 13, 2017

aluzzardi commented Jun 13, 2017

wsong commented Jun 13, 2017

aluzzardi commented Jun 13, 2017

aaronlehmann commented Jun 13, 2017

aaronlehmann commented Jun 16, 2017

aluzzardi commented Jun 26, 2017 • edited Loading

binman-docker commented Jun 26, 2017

aaronlehmann commented Jun 27, 2017

aaronlehmann commented Jul 12, 2017

cyli commented Jul 12, 2017

aaronlehmann commented Jul 12, 2017

codecov bot commented May 23, 2017 •

edited

Loading

aaronlehmann commented May 24, 2017 •

edited

Loading

aaronlehmann commented May 24, 2017 via email •

edited

Loading

aluzzardi commented Jun 26, 2017 •

edited

Loading