Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Allow Join to be called multiple times for the same cluster member #2198

Merged
merged 2 commits into from
Jul 20, 2017

Conversation

aaronlehmann
Copy link
Collaborator

If Join is called and the member already exists in the cluster, instead
of returning an error, it will update the node's address with the new
one provided to the RPC.

This will allow managers to update their addresses automatically on
startup, if they were configured to autodetect the addresses.

It will also be possible to manually repeat the "docker swarm join"
command, to specify a different advertise address, or rejoin a cluster
when all known manager IPs have changed.

cc @aluzzardi @cyli

@codecov
Copy link

codecov bot commented May 23, 2017

Codecov Report

Merging #2198 into master will increase coverage by 0.12%.
The diff coverage is 67.44%.

@@            Coverage Diff             @@
##           master    #2198      +/-   ##
==========================================
+ Coverage   61.07%   61.19%   +0.12%     
==========================================
  Files         128      128              
  Lines       20556    20581      +25     
==========================================
+ Hits        12554    12595      +41     
+ Misses       6627     6598      -29     
- Partials     1375     1388      +13

@aaronlehmann
Copy link
Collaborator Author

@aluzzardi: This will handle the case where the user wants to manually update the advertise address, or rejoin a cluster where all the IPs have changed, as we discussed.

However, for automatic IP address updating, I'm thinking now that it's better to include the advertise address in the ProcessRaftMessageRequest that gets sent with RPCs for normal Raft communications. If a receiving node ever notices an address that doesn't match the one it expected, it will update that address in Raft. I think this is better than calling Join automatically on startup because:

  • It handles the case where the IP address changes after the daemon is already running
  • It will make the code a lot simpler. There would be no need to invoke the Join RPC against different candidate nodes until it succeeds against one - Raft already has logic that sends heartbeat responses or vote requests, and if we include the address in these, we can make sure it gets seen by some other node.

@aaronlehmann
Copy link
Collaborator Author

One potential issue with automatically updating the IP addresses (or encouraging people to script it) is that if someone wrongly copies the /var/lib/docker/swarm directory to another machine, or clones a VM, the two nodes would fight each other for the identity. I'm not sure there's a good solution to this.

@aaronlehmann
Copy link
Collaborator Author

I've added a commit that avoids skipping Join when a node is already part of a cluster, so the IP can be updated and a current list of peers can be fetched. Note this will require some changes in moby. I'm working on those next.

@aaronlehmann
Copy link
Collaborator Author

aaronlehmann commented May 24, 2017

I have some moby changes related to this at https://github.com/aaronlehmann/docker/tree/swarmkit-rejoin. It also covers #2199.

I'm questioning whether allowing swarm join while already part of a cluster, to manually update the address, makes sense. The only situation where it would be useful is when all the other managers' IPs have changed while one of the nodes were down. But you would need at least 5 managers to have a situation where these IPs could change incrementally with one node down, and not lose quorum. If quorum was lost because of too many IP changes at once, you would have to use --force-new-cluster.

So now I'm leaning towards reverting the ability to run swarm join to update the address manually in that moby branch. If the changes are so extreme that a node doesn't know of any other nodes, the user should just use --force-new-cluster, since in most of these cases it would be necessary anyway.

@cyli
Copy link
Contributor

cyli commented May 24, 2017

@aaronlehmann It could be a worker that was down while the other managers changed IPs, or alternately that went down if it only knew about 1 manager, that has since changed IPs. However, it seems less important for workers to be able to re-join the cluster as it was - spinning up a new node seems fine if it's a worker.

Other than that I think I agree making swarm join just change the IP address seems not very useful.

@aaronlehmann
Copy link
Collaborator Author

There's one use case I remembered where repeating swarm join may be useful - when you want to change from a fixed advertise address to an autodetected one, or the other way around.

I tested this use case with the linked moby fork and it seems to work.

So, the question is whether repeated swarm join is worth supporting for this use case, or if we should just tell people to demote the node, take it out of the swarm, and join again as a fresh node.

@cyli
Copy link
Contributor

cyli commented May 24, 2017

@aaronlehmann Wondering if it'd make sense to put that in node update rather than swarm join instead?

@aaronlehmann
Copy link
Collaborator Author

aaronlehmann commented May 24, 2017 via email

@cyli
Copy link
Contributor

cyli commented May 24, 2017

@aaronlehmann I thinking if you call it on another manager you can either specify the address for that node, or tell the cluster to pick the address that it is connected to the cluster from?

@cyli
Copy link
Contributor

cyli commented May 24, 2017

Sorry, I don't mean to bikeshed this PR - I'm fine with not supporting it at all as well. If we did want to support it, "updating" the address of the node makes more sense than re-joining the cluster as an operation, but I don't feel very strongly.

@aaronlehmann
Copy link
Collaborator Author

Using node update is a little more complex because node update submits a new NodeSpec. We would need to add address information to the NodeSpec and add a reconciliation loop that updates the Raft member list based on the NodeSpec.

This would be in addition to supporting some other mechanism (such as repeated Join calls) for a node to update its own address when it changes (see #2199), unless we're okay with a node automatically issuing a request to the controlapi to change its own spec.

Despite being more technically complex, node update may make more sense from a UX perspective, though.

I'd be curious to hear @aluzzardi's thoughts, since he was the one who suggested allowing swarm join on a node that's already part of a cluster.

@cyli
Copy link
Contributor

cyli commented May 25, 2017

The code mostly LGTM - I have 2 non-blocking questions though:

  1. Would it be worth trying to join that node one more time in the test, but make sure that if ForceJoin is not set, joining a 3rd time will fail?

  2. Node already has SetAddr, which seems to require that the node is leader in order to set anything: https://github.com/docker/swarmkit/blob/master/manager/state/raft/raft.go#L281.

    There's a note in there that suggests that maybe it should be modified to allow nodes to change their address later. Not sure if this would be a good use case to modify it, since it looks pretty similar to https://github.com/docker/swarmkit/pull/2198/files#diff-d0df97935a320420c7f148954ec53bf4R952?

@aaronlehmann
Copy link
Collaborator Author

Would it be worth trying to join that node one more time in the test, but make sure that if ForceJoin is not set, joining a 3rd time will fail?

It won't fail in this case, it just will not call Join.

Node already has SetAddr, which seems to require that the node is leader in order to set anything.

That's a good point, but I'm not sure I see a good way to unify them. The use case for SetAddr is when raft is running before a port is bound (not a mode that's currently supported, but this was preliminary work towards it). This only makes sense on a single-node cluster - you wouldn't be able to join an existing cluster without binding a port first.

In theory SetAddr could call Join on the leader if this node is not the leader, but I don't see a use case for that, so I'm hesitant to add the extra code. One tempting thought is that an exposed SetAddr method on Node could replace the (admittedly ugly) ForceJoin flag, and swarm join when already part of a cluster could use this after creating the node, but the semantics aren't quite the same. When you use swarm join, you specify an address to join (which is the remote peer that the Join RPC is invoked on), and SetAddr doesn't have a concept of a remote address. So I guess this boils down to the question of whether the UX to update the address is running swarm join again, or some other command that only updates the address.

@cyli
Copy link
Contributor

cyli commented May 25, 2017

Ok, thank you for explaining! LGTM if swarm join UX is preferred

@aaronlehmann
Copy link
Collaborator Author

@aluzzardi: I think if we get this in, it will resolve the remaining problems being seen with e2e promotion/demotion tests.

The old code won't let a node that is part of the raft cluster join it again. If a node is demoted, and removes its state, but it is immediately re-promoted and never actually gets removed from the raft cluster, it won't be able to rejoin. However, with this PR, there is no such problem, because repeated joins are allowed.

@aluzzardi
Copy link
Member

One potential issue with automatically updating the IP addresses (or encouraging people to script it) is that if someone wrongly copies the /var/lib/docker/swarm directory to another machine, or clones a VM, the two nodes would fight each other for the identity. I'm not sure there's a good solution to this.

We had (have?) the same issue with classic Swarm. I believe @wsong had a workaround?

@wsong
Copy link

wsong commented Jun 13, 2017

In Swarm's case, we were using the engine ID (which I believe is defined in /etc/docker/daemon.json) to differentiate nodes. I made a change to use the engine ID concatenated with the IP address to handle the case where somebody copied /etc/docker/daemon.json onto two different nodes by accident.

@aluzzardi
Copy link
Member

I think the use cases to support multiple joins are:

  1. Run swarm join at startup regardless. Helps automation
  2. Re-join a manager when all other managers have changed IP
  3. Re-join a worker when all other workers have changed IP
  4. Specify a different advertise and/or listen address

However, for automatic IP address updating, I'm thinking now that it's better to include the advertise address in the ProcessRaftMessageRequest that gets sent with RPCs for normal Raft communications. If a receiving node ever notices an address that doesn't match the one it expected, it will update that address in Raft.

I think both make sense: user > detection > failure.

If the user passes an address, then we use that one no matter what. Bypass all detection.
If the user doesn't pass an address, perform auto detection.

Thoughts?

/cc @binman-docker @dhiltgen @wsong

@aaronlehmann
Copy link
Collaborator Author

Thoughts?

This sounds good to me if it's the approach people are most comfortable with. I have some older thoughts here: #2198 (comment) #2198 (comment)

Basically, of the use cases above, (1) doesn't seem very important to me, (2) is an extreme case that might require force-new-cluster anyway, and (3) is easy to fix with a leave/join cycle, since temporarily removing a worker is not disruptive (I assume this means "all other managers have changed IP"). So (4), specifying a different fixed advertise address or listen address, or turning on or off address autodetection, is the only use case that I see as valuable for repeated swarm joins. And my question is whether this is important/common enough to justify the UX change, or if we should just tell people to remove the node from the cluster and add it again in that situation.

@aaronlehmann
Copy link
Collaborator Author

ping @aluzzardi any thoughts on the above?

@aluzzardi
Copy link
Member

aluzzardi commented Jun 26, 2017

@binman-docker @jakegdocker @thaJeztah: Any thoughts on the UX?

@binman-docker
Copy link

I like the behavior and the option, I can see it being handy in some environments (especially changing the IP without removing node from the Swarm if people don't want to churn the containers).

For our use case it's not really relevant - IP's don't change for the lifetime of the instance, we just replace hosts, and they only join once. I'm not sure about the cases where a node would need to rejoin a cluster after all of the other IP's changed - is their change of IP not something that would have been communicated to the node as part of the normal gossip? Or are we targeting the case where those other nodes lost contact and could not gossip the change of IP?

@aaronlehmann
Copy link
Collaborator Author

Gossip is only used for networking. The use case here is for a node to find the managers when it doesn't know any current manager IP addresses.

@aaronlehmann
Copy link
Collaborator Author

I think we should move forward with this PR. It fixes a some actual problems, for example see #2196 (comment). The UX discussion is more related to #2199, so we shouldn't let it block this PR, which is purely internals and doesn't change UX.

The necessary changes in moby/moby have already been merged some time ago in moby/moby#33361.

@cyli
Copy link
Contributor

cyli commented Jul 12, 2017

This still LGTM, although since it's been a while, would it be worth trying to rebase + CI to make sure there are no semantic conflicts?

If Join is called and the member already exists in the cluster, instead
of returning an error, it will update the node's address with the new
one provided to the RPC.

This will allow managers to update their addresses automatically on
startup, if they were configured to autodetect the addresses.

It will also be possible to manually repeat the "docker swarm join"
command, to specify a different advertise address, or rejoin a cluster
when all known manager IPs have changed.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann
Copy link
Collaborator Author

This still LGTM, although since it's been a while, would it be worth trying to rebase + CI to make sure there are no semantic conflicts?

Rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants