Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker node fails to recover from a join failure because of transient networking issue #26646

Closed
mrjana opened this issue Sep 16, 2016 · 0 comments · Fixed by #27123
Closed
Assignees
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/master

Comments

@mrjana
Copy link
Contributor

mrjana commented Sep 16, 2016

Description

If there is a transient networking issue when a a swarm join command is issued in worker node which persists until the join command times out, then the worker node never recovers from that even if the engine instance is restarted.

Steps to reproduce the issue:
Reproduced in a set of dind instances:

  1. Created two dind instances running docker master version engine
  2. Did swarm init on one
  3. Found the outer veth of the dind container hosting the engine where I did swarm init and brought it down
  4. Tried to join the swarm from the other dind instance as a a worker
  5. It fails to join as expected because the link is down
  6. The join command times out
  7. I bring up the link of the downed veth
  8. But the worker is never able to rejoin the cluster

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

root@87208c93f51b:/go/src/github.com/docker/docker# ./bundles/latest/binary-client/docker version
Client:
 Version:      1.13.0-dev
 API version:  1.25
 Go version:   go1.7.1
 Git commit:   c9fb551-unsupported
 Built:        Fri Sep 16 18:09:07 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0-dev
 API version:  1.25
 Go version:   go1.7.1
 Git commit:   c9fb551-unsupported
 Built:        Fri Sep 16 18:09:07 2016
 OS/Arch:      linux/amd64

Logs

I just get these in the logs: ```EBU[0361] form data: {"AdvertiseAddr":"","JoinToken":"*****","ListenAddr":"0.0.0.0:2377","RemoteAddrs":["172.17.0.2:2377"]}
DEBU[0361] no valid local CA certificate found: local root CA certificate does not exist
WARN[0366] failed to retrieve remote root CA certificate: rpc error: code = 4 desc = context deadline exceeded
WARN[0371] failed to retrieve remote root CA certificate: rpc error: code = 4 desc = context deadline exceeded
WARN[0376] failed to retrieve remote root CA certificate: rpc error: code = 4 desc = context deadline exceeded
ERRO[0381] Handler for POST /v1.25/swarm/join returned error: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.
WARN[0381] failed to retrieve remote root CA certificate: rpc error: code = 4 desc = context deadline exceeded
WARN[0386] failed to retrieve remote root CA certificate: rpc error: code = 4 desc = context deadline exceeded
ERRO[0386] cluster exited with error: rpc error: code = 4 desc = context deadline exceeded
WARN[0386] Restarting swarm in 0.20 seconds
DEBU[0386] successfully loaded the Root CA: /var/lib/docker/swarm/certificates/swarm-root-ca.crt
DEBU[0386] successfully loaded the Root CA: /var/lib/docker/swarm/certificates/swarm-root-ca.crt
DEBU[0386] loaded local CA certificate: /var/lib/docker/swarm/certificates/swarm-root-ca.crt.
DEBU[0386] loaded local TLS credentials: /var/lib/docker/swarm/certificates/swarm-node.crt.
DEBU[0386] Requesting certificate for NodeID: dflye3tzgrsvg9r1cqpafsvk3
DEBU[0386] successfully loaded the Root CA: /var/lib/docker/swarm/certificates/swarm-root-ca.crt
INFO[0386] Listening for connections                     addr=[::]:2377 proto=tcp
DEBU[0386] (*Agent).run                                  module=node/agent
DEBU[0386] (*session).start                              module=node/agent
INFO[0386] Listening for local connections               addr=/var/run/docker/swarm/control.sock proto=unix
ERRO[0391] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0391] agent: rebuild session                        module=node/agent
DEBU[0391] (*session).start                              module=node/agent
ERRO[0396] agent: session failed                         error=rpc error: code = 13 desc = transport is closing module=node/agent
DEBU[0396] agent: rebuild session                        module=node/agent
DEBU[0396] (*session).start                              module=node/agent
ERRO[0397] error reestabilishing connection to leader    error=raft: no elected cluster leader
ERRO[0399] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0399] agent: rebuild session                        module=node/agent
DEBU[0399] (*session).start                              module=node/agent
ERRO[0401] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0401] agent: rebuild session                        module=node/agent
DEBU[0402] (*session).start                              module=node/agent
ERRO[0406] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0406] agent: rebuild session                        module=node/agent
DEBU[0406] (*session).start                              module=node/agent
ERRO[0407] error reestabilishing connection to leader    error=raft: no elected cluster leader
ERRO[0409] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0409] agent: rebuild session                        module=node/agent
DEBU[0410] (*session).start                              module=node/agent
ERRO[0413] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0413] agent: rebuild session                        module=node/agent
ERRO[0417] error reestabilishing connection to leader    error=raft: no elected cluster leader
DEBU[0418] (*session).start                              module=node/agent
ERRO[0418] agent: session failed                         error=rpc error: code = 2 desc = grpc: the client connection is closing module=node/agent
DEBU[0418] agent: rebuild session                        module=node/agent
DEBU[0421] (*session).start                              module=node/agent

Node ls on first node

./bundles/latest/binary-client/docker node ls
ID                           HOSTNAME      STATUS  AVAILABILITY  MANAGER STATUS
cyfszpq11c1r5adlckfqhdevd *  3f11ed585915  Ready   Active        Leader

Certificates on worker node

root@87208c93f51b:/go/src/github.com/docker/docker# ls /var/lib/docker/swarm/certificates/swarm-
swarm-node.crt     swarm-node.key     swarm-root-ca.crt  swarm-root-ca.key

root@87208c93f51b:/go/src/github.com/docker/docker# openssl x509 -in /var/lib/docker/swarm/certificates/swarm-node.crt -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            69:ef:4c:10:ba:97:e2:57:80:f1:30:b2:f6:c6:aa:29:5f:5c:e0:97
    Signature Algorithm: ecdsa-with-SHA256
        Issuer: CN=swarm-ca
        Validity
            Not Before: Sep 16 17:21:00 2016 GMT
            Not After : Dec 15 18:21:00 2016 GMT
        Subject: O=esvbwebnuleplc926zltf0x5b, OU=swarm-manager, CN=dflye3tzgrsvg9r1cqpafsvk3
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:0b:68:e7:74:79:1b:18:d9:92:60:17:54:a7:f8:
                    c7:04:d9:16:29:6d:15:c5:9a:f2:b0:e0:d1:2b:f5:
                    4c:2b:7c:9f:9c:f8:2b:a7:20:00:da:ca:e0:c9:d2:
                    68:09:09:c9:fe:1a:4a:a2:db:91:35:cb:50:db:e7:
                    2a:c9:9e:05:b2
                ASN1 OID: prime256v1
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier:
                66:30:AC:EA:9B:7A:27:A7:E2:D5:CC:53:B6:C0:83:CE:EC:2C:74:70
            X509v3 Authority Key Identifier:
                keyid:40:00:AA:2E:08:A6:F0:FB:CF:83:88:C9:1C:6A:56:4D:79:10:9F:7D

            X509v3 Subject Alternative Name:
                DNS:swarm-manager, DNS:dflye3tzgrsvg9r1cqpafsvk3, DNS:swarm-ca
    Signature Algorithm: ecdsa-with-SHA256
         30:46:02:21:00:8f:92:d8:49:04:36:00:cc:b5:db:4f:6b:8b:
         80:34:75:c0:0b:00:0e:19:07:0a:27:64:06:68:9e:70:26:11:
         8b:02:21:00:ff:46:4f:76:3a:4d:c9:97:d0:e4:2b:b8:d0:7a:
         f0:8b:bb:0f:ca:32:fd:24:d2:fd:34:dd:9e:fc:f1:5d:bd:6c
-----BEGIN CERTIFICATE-----
MIICNjCCAdugAwIBAgIUae9MELqX4leA8TCy9saqKV9c4JcwCgYIKoZIzj0EAwIw
EzERMA8GA1UEAxMIc3dhcm0tY2EwHhcNMTYwOTE2MTcyMTAwWhcNMTYxMjE1MTgy
MTAwWjBgMSIwIAYDVQQKExllc3Zid2VibnVsZXBsYzkyNnpsdGYweDViMRYwFAYD
VQQLEw1zd2FybS1tYW5hZ2VyMSIwIAYDVQQDExlkZmx5ZTN0emdyc3ZnOXIxY3Fw
YWZzdmszMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEC2jndHkbGNmSYBdUp/jH
BNkWKW0VxZrysODRK/VMK3yfnPgrpyAA2srgydJoCQnJ/hpKotuRNctQ2+cqyZ4F
sqOBvzCBvDAOBgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsG
AQUFBwMCMAwGA1UdEwEB/wQCMAAwHQYDVR0OBBYEFGYwrOqbeien4tXMU7bAg87s
LHRwMB8GA1UdIwQYMBaAFEAAqi4IpvD7z4OIyRxqVk15EJ99MD0GA1UdEQQ2MDSC
DXN3YXJtLW1hbmFnZXKCGWRmbHllM3R6Z3Jzdmc5cjFjcXBhZnN2azOCCHN3YXJt
LWNhMAoGCCqGSM49BAMCA0kAMEYCIQCPkthJBDYAzLXbT2uLgDR1wAsADhkHCidk
BmiecCYRiwIhAP9GT3Y6TcmX0OQruNB68Iu7D8oy/STS/TTdnvzxXb1s
-----END CERTIFICATE-----

/cc @tonistiigi

@tonistiigi tonistiigi self-assigned this Sep 16, 2016
@tonistiigi tonistiigi added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label Sep 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/master
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants