-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Promotion/Demotion workflow review #2565
Comments
I think edge cases (2) and (3) can be solved by #2551 if I add watching for node deletion events. I don't think updating the removal logic to disallow removal of a manager node before demotion has completed will fix (3), although it will prevent (2), because we can't tell if a node has completed demotion or if it has just failed to complete promotion. |
I think edge case 3 should not a possible scenario.
I think this will not happen since in RemoveNode, we check for the desired state as NodeRoleManager. So, we'll fail that node remove if the node was promoted, but the promotion has not completed yet. |
From IRL discussion with @anshulpundir, it will fail to remove the desired state is |
Maybe there is room simplify this workflow ? Some questions/ideas:
In this case, we can just fail the demote instead of keep retrying. Basically, I think we should try to reduce the number of invariants as much as possible by disallowing operations until reconciliation is complete. Thoughts ? |
From our previous IRL discussion, I mentioned that the desired role as added later - I misremembered actually, it was the observed role that was added later: #1829. The desired role has been there since near the beginning I think: #690.
We'd still have to provide backwards compatibility, so while we can delete it for future versions, we'd still have to support checking it and reconciling it. Also, removing it I think breaks the pattern for everything else in swarm of having the spec be the desired state that the user would like, and the rest of the object storing the current state of the object. Possibly if we wanted to remove one state though, it'd be the observed state.
We do fail the demote if we can detect if it breaks quorum - there's a quorum check in the control API, and another in the reconciliation loop. #1829 mentions that the GRPC call for demote can happen concurrently, meaning that both calls can happen at the same time, both will check to see if demoting will break quorum - both those checks will pass (allowing demotion), and then demoting 2 nodes will break quorum. As per IRL discussion just a second ago though, you mentioned that maybe we could just grab a lock to serialize demotion calls, which could fix this issue. It looks like previously, prior to #1829, the |
Ah wait, we definitely will hit the failure case I think when demoting the current leader. When we demote the leader, we transfer leadership, and once that's done we won't be able to write the store update to change the role, which means that users will effectively have to demote a node twice if it's the leader. The mismatch between the desired vs observed state I guess then functions sort of as a message to the next leader to say "hey, finish demoting me now". |
I agree that the workflow would be simpler to understand as a serial operation, and I will continue to think about it, but at the moment I'm not sure we can get cluster membership correct that way either. The raft cluster membership cannot be immediately synced with the node list on control API changes - they're 2 different things, so they cannot be changed atomically, so we have to account for a failure in the middle of the two changes causing inconsistent state. We can try to roll back, but rollback has the same issue. Retrying to converge to a consistent view of the world seems to be the best strategy; we'd have to work through all of these cases anyway if we remove the reconciliation loop, and advise the user what to do in each case (because they would have to be the ones that would retry, or have to re-initialize the cluster if they accidentally break quorum).
Just thinking through all the cases here out loud:
Possibly we can provide better UI or introspection into the promote/demote progress - currently users promote or demote a node, and we have no indication that a promotion or demotion is in progress, or what might be blocking it. Users just demote a node, but don't see that anything is happening. |
Discussion with @anshulpundir IRL:
|
Do the issues shown in this comment potentially relate to the race conditions mentioned, particularly around demotion and removal? |
We've been seeing some flakiness around node promotion demotion, so this writeup of how it works, along with some possible issues.
Node promotion
When a node is promoted, the desired state of the node is set to manager in the control api.
The
roleManager
(github.com/docker/swarmkit/manager/role_manager.go
) is a service running on the leader which watches for updates to nodes, and reconciles the desired role with the observed role. When it has gotten an update about a promotion, it just automatically updates the a node's observed role to manager, if the node's desired or observed roles or the node's existence at all hasn't already changed in the meantime.However, the raft membership isn't updated yet. The node is added to the raft cluster membership when it makes a call to another manager node's raft API (which gets forwarded to the leader) and requests to join. The leader will:
We don't want to pre-add the raft node to the cluster, because:
Node demotion
When a node is demoted, before we allow a demotion we do a couple sanity checks:
The
roleManager
, when it gets an update about a demotion, will attempt to reconcile the role in the following manner:Node removal
When a node is removed, we do some sanity checks before we remove it:
Once a node is removed, its node ID is added to a list of blacklisted certs - no node with this ID will be allowed to connect with another manager again (we have a custom GRPC authorizer that checks to see if the node ID is in the blacklist). If the cluster has a stored certificate for this node, an expiry date is added so that the list of blacklisted certs can be cleaned up after the last valid cert for the node expires. It's a blacklist rather than a whitelist because if a blacklist fails to propagate in time, a node will, for a time, be able to connect to a cluster when it shouldn't. If the whitelist fails to propagate in time, a node won't be able to connect when it should, and this could destabilize the cluster.
Renewing the TLS certs/manager startup/shutdown after a demotion or promotion
When a node's desired role changes (or the certificate status is
IssuanceStateRotate
), the dispatcher pushes node changes down to the agent. The agent, upon seeing a node change, including the desired role change, will renew the certificate, and keep trying to renew until it gets a certificate with the expected (desired) role.The CA server running on the leader only issues certs for the observed node role (so if the node role hasn't been reconciled yet, the CA server will issue a cert for the previous desired node role).
When a node gets a cert for a new role, only then does it officially either start up the manager if it's been promoted, or shut down the manager if it's been demoted. This makes sense for promotions, because there is no point in starting up manager services without a manager cert, since the node will not be able to perform its role as a manager without the manager cert.
Edge cases
Manager quick demotion-promotion
This is I think not an edge case, but it is involved and kinda quirky, and if I'm wrong it's possible the manager will be removed from the cluster but try to rejoin with the same raft ID and that could cause issues where the cluster thinks there might be 2 leaders (see https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration).
If a manager node is demoted and then promoted before the reconciler can successfully demote (possibly due to quorum issues), the reconciler should see that the node's desired state matches it's current state, and do nothing (not demote). This all works because there is a single event loop, so two reconciliations can't happen at the same time (assuming there's a single role manager running at any given time).
If the reconciler managed to remove the node from the raft consensus, but hadn't gotten around to updating the observed role yet, it will never update the observed role. The node will probably stop trying to get a new cert. The node's raft node will detect the conf change about it being removed from raft, and the manager will shut down with a error that the node was removed from raft, and wipe out all data.
The
github.com/docker/swarmkit/node/node.go
'ssuperviseManager
function try to get a worker cert, because the manager was evicted, but will time out after a little while because it will fail to get a worker cert, and then restart the manager.Worker quick promotion-demotion
A manager can't be demoted if it's not part of the raft cluster, so the demotion would fail until it successfully joined the raft cluster, at which point the demotion logic should kick in.
The text was updated successfully, but these errors were encountered: