-
Notifications
You must be signed in to change notification settings - Fork 611
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promoting 2 nodes in short succession sometimes leave one node unreachable #2196
Comments
Also noteworthy, the node object's mgrs, err := cli.NodeList(ctx, types.NodeListOptions{Filters: mf})
if err != nil {
return errors.Wrap(err, "error listing managers")
}
// check of the len of the managers we just got is the same as the
// number we had before plus the 2 we just added
if len(mgrs) != expected {
errors.Errorf("expected %v managers got %v", expected, len(mgrs))
}
// check each manager is healthy
for _, m := range mgrs {
if m.Status.State != swarm.NodeStateReady {
return errors.Errorf("manager %v is not ready", m.ID)
}
if m.ManagerStatus == nil {
return errors.Errorf("node %v is not a manager", m.ID)
}
if m.ManagerStatus.Reachability != swarm.ReachabilityReachable {
return errors.Errorf("manager %v is not reachable", m.ID)
}
} |
whoops i accidentally the issue for sec there. |
Can you please provide logs associated with this?
Does it remain unreachable indefinitely?
You may just need to wait for convergence here. A node doesn't become a manager instantly, but a few steps need to happen behind the scenes first. |
This is @sanimej cluster, so I can't get them. I can say that when I encountered this issue, nothing egregiously stood out in the logs. No obvious errors.
Yes, and it also no longer responds to promote/demote instructions.
I've elided the specifics, but that code is polled in a loop and only returns an error if it hasn't converged to be error-free by the end of a time out. |
Also relevant: when I encountered this error, the node in question did refused to execute swarm commands, saying it was not a manager, instead of timing out like I would expect an errant manager to do. |
I don't understand how this would happen if |
Manager status likely converges to "unreachable" some time after the test
times out.
…On May 19, 2017 6:22 PM, "Aaron Lehmann" ***@***.***> wrote:
Also noteworthy, the node object's MangerStatus is equal to nil at this
point in the test, and we get "node 0d1lke3s3zbwnaopuqqhdlull is not a
manager" as the error.
I don't understand how this would happen if node ls shows "unreachable".
The "reachability" field is inside ManagerStatus. The CLI shouldn't print
anything there if ManagerStatus is nil.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2196 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACQhcne7AkpwQ7H39ypAFW4nnhZUT8Zoks5r7kBTgaJpZM4NhGb_>
.
|
@aaronlehmann I didn't save the logs. But the node was permanently stuck in |
A stack trace from the affected node may also be helpful. |
@dperny: Are you still seeing this? |
@aaronlehmann I also encounter the same problem. 1.swarm1 docker swam init details
|
docker log
|
Thanks very much for providing the log. I think I understand the problem.
swarm3 gets promoted first
swarm2 gets promoted, but this waits for
Consensus is reached on swarm3 becoming a manager.
At this point, swarm1 is transferring its log to swarm3. This process can involve several heartbeats to figure out what entries the new node needs. Sometimes it involves substantial data transfer. It can take some time.
swarm3 didn't end up as fully operational manager in time, so swarm2's promotion timed out waiting for consensus.
swarm3 is finally ready now, and the promotion of swarm2 went through. But we already returned an error to swarm2 from Basically, when we go from one manager to two, it can take some time for the second manager to become operational, and in the mean time it's like having a loss of quorum because only one of two managers is processing log appends. Joining raft has a timeout, and this timeout can be exceeded when the cluster is in this state. I believe that when this is triggered with
Also, when a node joins as a manager directly (instead of using |
Thanks. |
I want to know how the node status changes in the promote/demote operation. |
Hi @yangfeiyu20102011: I just opened #2318 which includes some information on promotion and demotion. Does it answer your questions? |
@aaronlehmann:With the raft.md,I think I have understood. |
In the course of writing some end to end tests for docker, I've found that sometimes promoting 2 nodes in short succession sometimes fails. One of the promoted nodes ends up in "Unreachable" state and has to be removed from the cluster and rejoined.
The failure is intermittent. It does not happen every time. The failure may or may not be related to previous promote/demote cycles failing silently.
The code snippet that causes this:
And the result:
As you can see, one node is
Unreachable
/cc @sanimej
The text was updated successfully, but these errors were encountered: