Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don’t attempt to reconnect swarm on failed join after timeout #27123

Merged
merged 1 commit into from
Nov 8, 2016

Conversation

tonistiigi
Copy link
Member

fixes #26646

The reproducible part of the bug was already fixed with the grpc changes in swarmkit, but this makes it more robust and makes it not rely on swarmkit timeouts.

The issue appeared because reconnecting expects state from remote hosts. There was no state because the join failed.

cc @mrjana

Signed-off-by: Tonis Tiigi tonistiigi@gmail.com

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
@thaJeztah
Copy link
Member

@tonistiigi is this only on master, or a fix for 1.12.1?

@tonistiigi
Copy link
Member Author

@thaJeztah I think this can wait for v1.13

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 5, 2016

LGTM

@LK4D4 LK4D4 added this to the 1.13.0 milestone Oct 5, 2016
@aaronlehmann
Copy link
Contributor

Can we consider changing join behavior so that it doesn't keep retrying after the timeout? I think it's really unexpected for a node to keep trying to join a swarm after docker swarm join returned failure. Having it succeed potentially weeks or months later isn't useful behavior IMHO.

@tonistiigi
Copy link
Member Author

@aaronlehmann I remember @aluzzardi saw it as an important feature. Actually, it seems that swarmkit has already started to move away from that model as for example in #26646 swarmkit doesn't try to connect until network returns but fails out quite soon so we never even reach the timeout anymore. I'm not sure if this is the case with all the possible scenarios. If it is then in Docker side we should just remove the timeout completely and swarmkit either has to join in a meaningful time or give up with an error.

@aaronlehmann
Copy link
Contributor

@aluzzardi: Any thoughts?

@thaJeztah
Copy link
Member

ping @aluzzardi PTAL!

@aluzzardi
Copy link
Member

@aaronlehmann Well, right now we're half sync half async and we all agreed to move one way or another since right now automation is really painful.

I believe we had a chat offline a while ago where, if I remember correctly, decided to go the async route (and make the CLI look synchronous?)

@aluzzardi
Copy link
Member

LGTM

1 similar comment
@aaronlehmann
Copy link
Contributor

LGTM

@aaronlehmann aaronlehmann merged commit 0ccbae0 into moby:master Nov 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker node fails to recover from a join failure because of transient networking issue
6 participants