Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promoting 2 nodes in short succession sometimes leave one node unreachable #2196

Open
dperny opened this issue May 19, 2017 · 18 comments
Open

Comments

@dperny
Copy link
Collaborator

dperny commented May 19, 2017

In the course of writing some end to end tests for docker, I've found that sometimes promoting 2 nodes in short succession sometimes fails. One of the promoted nodes ends up in "Unreachable" state and has to be removed from the cluster and rejoined.

The failure is intermittent. It does not happen every time. The failure may or may not be related to previous promote/demote cycles failing silently.

The code snippet that causes this:

	// now promote 2 workers at the same time
	// first, pick 2 workers
	candidates := workers[0:2]
	// update both of their specs
	for _, candidate := range candidates {
		// get the latest version of the candidate
		c, _, err := cli.NodeInspectWithRaw(testContext, candidate.ID)
		require.NoError(t, err, "error getting latest version of candidate %v", candidate.ID)
		c.Spec.Role = swarm.NodeRoleManager
		err = cli.NodeUpdate(testContext, c.ID, c.Version, c.Spec)
		require.NoError(t, err, "error with node update")
	}

And the result:

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
0d1lke3s3zbwnaopuqqhdlull     sanimej-e2e-1       Ready               Active              Unreachable
m3tj6cxmvpkg8th29w6kr4uoc *   sanimej-e2e-0       Ready               Active              Leader
vn4jsthinmzit3ckvvb3ipttx     sanimej-e2e-2       Ready               Active              Reachable

As you can see, one node is Unreachable

/cc @sanimej

@dperny
Copy link
Collaborator Author

dperny commented May 19, 2017

Also noteworthy, the node object's MangerStatus is equal to nil at this point in the test, and we get "node 0d1lke3s3zbwnaopuqqhdlull is not a manager" as the error.

mgrs, err := cli.NodeList(ctx, types.NodeListOptions{Filters: mf})
if err != nil {
	return errors.Wrap(err, "error listing managers")
}
// check of the len of the managers we just got is the same as the
// number we had before plus the 2 we just added
if len(mgrs) != expected {
	errors.Errorf("expected %v managers got %v", expected, len(mgrs))
}
// check each manager is healthy
for _, m := range mgrs {
	if m.Status.State != swarm.NodeStateReady {
		return errors.Errorf("manager %v is not ready", m.ID)
	}
	if m.ManagerStatus == nil {
		return errors.Errorf("node %v is not a manager", m.ID)
	}
	if m.ManagerStatus.Reachability != swarm.ReachabilityReachable {
		return errors.Errorf("manager %v is not reachable", m.ID)
	}
}

@dperny dperny closed this as completed May 19, 2017
@dperny dperny reopened this May 19, 2017
@dperny
Copy link
Collaborator Author

dperny commented May 19, 2017

whoops i accidentally the issue for sec there.

@aaronlehmann
Copy link
Collaborator

Can you please provide logs associated with this?

As you can see, one node is Unreachable

Does it remain unreachable indefinitely?

Also noteworthy, the node object's MangerStatus is equal to nil at this point in the test, and we get "node 0d1lke3s3zbwnaopuqqhdlull is not a manager" as the error.

You may just need to wait for convergence here. A node doesn't become a manager instantly, but a few steps need to happen behind the scenes first.

@dperny
Copy link
Collaborator Author

dperny commented May 20, 2017

Can you please provide logs associated with this?

This is @sanimej cluster, so I can't get them. I can say that when I encountered this issue, nothing egregiously stood out in the logs. No obvious errors.

Does it remain unreachable indefinitely?

Yes, and it also no longer responds to promote/demote instructions.

You may just need to wait for convergence here. A node doesn't become a manager instantly, but a few steps need to happen behind the scenes first.

I've elided the specifics, but that code is polled in a loop and only returns an error if it hasn't converged to be error-free by the end of a time out.

@dperny
Copy link
Collaborator Author

dperny commented May 20, 2017

Also relevant: when I encountered this error, the node in question did refused to execute swarm commands, saying it was not a manager, instead of timing out like I would expect an errant manager to do.

@aaronlehmann
Copy link
Collaborator

Also noteworthy, the node object's MangerStatus is equal to nil at this point in the test, and we get "node 0d1lke3s3zbwnaopuqqhdlull is not a manager" as the error.

I don't understand how this would happen if node ls shows "unreachable". The "reachability" field is inside ManagerStatus. The CLI shouldn't print anything there if ManagerStatus is nil.

@dperny
Copy link
Collaborator Author

dperny commented May 20, 2017 via email

@sanimej
Copy link
Contributor

sanimej commented May 20, 2017

@aaronlehmann I didn't save the logs. But the node was permanently stuck in unreachable state and a force leave didn't work either. I think it should happen if the e2e test suite is run in a loop. If it happens again I will save the debugs. Do you need any other info from the cluster when its in this state ?

@aaronlehmann
Copy link
Collaborator

A stack trace from the affected node may also be helpful.

@aaronlehmann
Copy link
Collaborator

@dperny @sanimej: When you have time, I'd definitely appreciate more information to help track this down.

@aaronlehmann
Copy link
Collaborator

@dperny: Are you still seeing this?

@yangfeiyu20102011
Copy link

@aaronlehmann I also encounter the same problem.

1.swarm1 docker swam init
2.swarm2 and swarm3 join as workers
3.on swarm1,docker node promote swarm2 swarm3

details

swarm1:~ # docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
9rc4hrbnfgw5oqbbzbr157uls     swarm3              Ready               Active              
qm0ryc2j7h3o76ijq0niyg8ke *   swarm1              Ready               Active              Leader
umafjbqluc0fxn5r04sfri9d8     swarm2              Ready               Active              
swarm1:~ # docker node promote 9 u
Node 9 promoted to a manager in the swarm.
Node u promoted to a manager in the swarm.
swarm1:~ # docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
9rc4hrbnfgw5oqbbzbr157uls     swarm3              Ready               Active              Reachable
qm0ryc2j7h3o76ijq0niyg8ke *   swarm1              Ready               Active              Leader
umafjbqluc0fxn5r04sfri9d8     swarm2              Ready               Active              
swarm1:~ # docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
9rc4hrbnfgw5oqbbzbr157uls     swarm3              Ready               Active              Reachable
qm0ryc2j7h3o76ijq0niyg8ke *   swarm1              Ready               Active              Leader
umafjbqluc0fxn5r04sfri9d8     swarm2              Ready               Active              
swarm1:~ # docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
9rc4hrbnfgw5oqbbzbr157uls     swarm3              Ready               Active              Reachable
qm0ryc2j7h3o76ijq0niyg8ke *   swarm1              Ready               Active              Leader
umafjbqluc0fxn5r04sfri9d8     swarm2              Ready               Active              
swarm1:~ # docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
9rc4hrbnfgw5oqbbzbr157uls     swarm3              Ready               Active              Reachable
qm0ryc2j7h3o76ijq0niyg8ke *   swarm1              Ready               Active              Leader
umafjbqluc0fxn5r04sfri9d8     swarm2              Ready               Active              Unreachable

@yangfeiyu20102011
Copy link

docker log

Aug 13 12:28:17 swarm1 dockerd: time="2018-08-13T12:28:17.433188028Z" level=debug msg="swarm1-8ddbf165a098: Initiating bulk sync with nodes [swarm3-c6f28b1ed394]"
Aug 13 12:28:17 swarm1 dockerd: time="2018-08-13T12:28:17.433287939Z" level=debug msg="swarm1-8ddbf165a098: Initiating unsolicited bulk sync for networks [6nf9mjgkhz9kpdysb0ysz5p4n 2pr277ylr20s6zre1ov5cyjsm] with node swarm3-c6f28b1ed394"
Aug 13 12:28:17 swarm1 dockerd: time="2018-08-13T12:28:17.435064724Z" level=debug msg="memberlist: TCP connection from=*.*.*.*:39256"
Aug 13 12:28:17 swarm1 dockerd: time="2018-08-13T12:28:17.435673848Z" level=debug msg="swarm1-8ddbf165a098: Bulk sync to node swarm3-c6f28b1ed394 took 1.665798ms"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.435479845Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.436424429Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.438693707Z" level=debug msg="Calling GET /v1.29/nodes/9"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.441237722Z" level=debug msg="Calling POST /v1.29/nodes/9rc4hrbnfgw5oqbbzbr157uls/update?version=1197"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.441325453Z" level=debug msg="form data: {\"Availability\":\"active\",\"Labels\":{},\"Role\":\"manager\"}"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.447365433Z" level=debug msg="Calling GET /v1.29/nodes/u"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.449222268Z" level=debug msg="Calling POST /v1.29/nodes/umafjbqluc0fxn5r04sfri9d8/update?version=1189"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.449293928Z" level=debug msg="form data: {\"Availability\":\"active\",\"Labels\":{},\"Role\":\"manager\"}"
Aug 13 12:28:24 swarm1 [/bin/bash]: [docker node promote 9 u] return code=[0], execute success by [root] from [pts/1 (*.*.*.*)]
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.471144818Z" level=debug msg="node certificate updated" cert.cn=9rc4hrbnfgw5oqbbzbr157uls cert.role=MANAGER method=issueRenewCertificate
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.471947858Z" level=debug msg="started watching for certificate updates" method=NodeCertificateStatus node.id=9rc4hrbnfgw5oqbbzbr157uls status={RENEW }
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.473251762Z" level=debug msg="node certificate updated" cert.cn=umafjbqluc0fxn5r04sfri9d8 cert.role=MANAGER method=issueRenewCertificate
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.483150290Z" level=debug msg="certificate issued" method="(*Server).signNodeCert" module=ca node.id=9rc4hrbnfgw5oqbbzbr157uls node.role=MANAGER
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.484898057Z" level=debug msg="certificate issued" method="(*Server).signNodeCert" module=ca node.id=umafjbqluc0fxn5r04sfri9d8 node.role=MANAGER
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.523340355Z" level=debug method="(*Node).Join" node.id=9rc4hrbnfgw5oqbbzbr157uls raft_id=3f15bf204e8ce86d
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.527270864Z" level=debug msg="node certificate updated" cert.cn=9rc4hrbnfgw5oqbbzbr157uls cert.role=MANAGER method=issueRenewCertificate
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.528719994Z" level=debug msg="node certificate updated" cert.cn=umafjbqluc0fxn5r04sfri9d8 cert.role=MANAGER method=issueRenewCertificate
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.535820624Z" level=debug method="(*Node).Join" node.id=umafjbqluc0fxn5r04sfri9d8 raft_id=3f15bf204e8ce86d
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.539195086Z" level=debug msg="certificate issued" method="(*Server).signNodeCert" module=ca node.id=9rc4hrbnfgw5oqbbzbr157uls node.role=MANAGER
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.539810680Z" level=debug msg="transport: add peer 79c0024ac8b22557 with address *.*.*.*:2377"
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.542809700Z" level=debug msg="node joined" method="(*Node).Join" node.id=9rc4hrbnfgw5oqbbzbr157uls raft_id=3f15bf204e8ce86d
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.565963968Z" level=debug msg="3f15bf204e8ce86d received msgApp rejection(lastindex: 0) from 79c0024ac8b22557 for index 1210" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.566015435Z" level=debug msg="3f15bf204e8ce86d decreased progress of 79c0024ac8b22557 to [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:25 swarm1 dockerd: time="2018-08-13T12:28:25.024103172Z" level=debug msg="memberlist: TCP connection from=*.*.*.*:39276"
Aug 13 12:28:26 swarm1 dockerd: time="2018-08-13T12:28:26.824541693Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:26 swarm1 dockerd: time="2018-08-13T12:28:26.825099315Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:26 swarm1 dockerd: time="2018-08-13T12:28:26.825636098Z" level=debug msg="Calling GET /v1.29/nodes"
Aug 13 12:28:26 swarm1 dockerd: time="2018-08-13T12:28:26.827283254Z" level=debug msg="Calling GET /v1.29/info"
Aug 13 12:28:26 swarm1 [/bin/bash]: [docker node ls] return code=[0], execute success by [root] from [pts/1 (*.*.*.*)]
Aug 13 12:28:31 swarm1 dockerd: time="2018-08-13T12:28:31.062752967Z" level=debug msg="memberlist: Initiating push/pull sync with: *.*.*.*:7946"
Aug 13 12:28:32 swarm1 dockerd: time="2018-08-13T12:28:32.108289007Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:32 swarm1 dockerd: time="2018-08-13T12:28:32.111056187Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:32 swarm1 dockerd: time="2018-08-13T12:28:32.113748736Z" level=debug msg="Calling GET /v1.29/nodes"
Aug 13 12:28:32 swarm1 dockerd: time="2018-08-13T12:28:32.115607892Z" level=debug msg="Calling GET /v1.29/info"
Aug 13 12:28:32 swarm1 [/bin/bash]: [docker node ls] return code=[0], execute success by [root] from [pts/1 (*.*.*.*)]
Aug 13 12:28:35 swarm1 dockerd: time="2018-08-13T12:28:35.531455828Z" level=error msg="failed to add member 4c3ad8d93b5c51db" error="context deadline exceeded" method="(*Node).Join" node.id=umafjbqluc0fxn5r04sfri9d8 raft_id=3f15bf204e8ce86d
Aug 13 12:28:36 swarm1 dockerd: time="2018-08-13T12:28:36.515310913Z" level=debug msg="memberlist: TCP connection from=*.*.*.*:39302"
Aug 13 12:28:36 swarm1 dockerd: time="2018-08-13T12:28:36.516475551Z" level=debug msg="swarm1-8ddbf165a098: Initiating  bulk sync for networks [6nf9mjgkhz9kpdysb0ysz5p4n 2pr277ylr20s6zre1ov5cyjsm] with node swarm3-c6f28b1ed394"
Aug 13 12:28:40 swarm1 dockerd: time="2018-08-13T12:28:40.723157565Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:40 swarm1 dockerd: time="2018-08-13T12:28:40.723826581Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:40 swarm1 dockerd: time="2018-08-13T12:28:40.724305449Z" level=debug msg="Calling GET /v1.29/nodes"
Aug 13 12:28:40 swarm1 dockerd: time="2018-08-13T12:28:40.727785997Z" level=debug msg="Calling GET /v1.29/info"
Aug 13 12:28:40 swarm1 [/bin/bash]: [docker node ls] return code=[0], execute success by [root] from [pts/1 (*.*.*.*)]
Aug 13 12:28:41 swarm1 dockerd: time="2018-08-13T12:28:41.280709860Z" level=debug msg="transport: add peer 4c3ad8d93b5c51db with address *.*.*.*:2377"
Aug 13 12:28:41 swarm1 dockerd: time="2018-08-13T12:28:41.281647272Z" level=debug msg="certificate issued" method="(*Server).signNodeCert" module=ca node.id=umafjbqluc0fxn5r04sfri9d8 node.role=MANAGER
Aug 13 12:28:42 swarm1 dockerd: time="2018-08-13T12:28:42.272441299Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:42 swarm1 dockerd: time="2018-08-13T12:28:42.272506907Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:43 swarm1 dockerd: time="2018-08-13T12:28:43.272568369Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:43 swarm1 dockerd: time="2018-08-13T12:28:43.272652469Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:44 swarm1 dockerd: time="2018-08-13T12:28:44.272396938Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:44 swarm1 dockerd: time="2018-08-13T12:28:44.272461215Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:45 swarm1 dockerd: time="2018-08-13T12:28:45.272864946Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:45 swarm1 dockerd: time="2018-08-13T12:28:45.272923202Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:46 swarm1 dockerd: time="2018-08-13T12:28:46.272350420Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:46 swarm1 dockerd: time="2018-08-13T12:28:46.272417910Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.272521041Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.272583262Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.432984038Z" level=debug msg="swarm1-8ddbf165a098: Initiating bulk sync with nodes [swarm2-b39e61c8b22b]"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.433073133Z" level=debug msg="swarm1-8ddbf165a098: Initiating unsolicited bulk sync for networks [6nf9mjgkhz9kpdysb0ysz5p4n] with node swarm2-b39e61c8b22b"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.435948234Z" level=debug msg="memberlist: TCP connection from=9.81.254.204:51910"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.436438034Z" level=debug msg="swarm1-8ddbf165a098: Bulk sync to node swarm2-b39e61c8b22b took 2.318012ms"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.436486044Z" level=debug msg="swarm1-8ddbf165a098: Initiating bulk sync with nodes [swarm3-c6f28b1ed394]"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.436521681Z" level=debug msg="swarm1-8ddbf165a098: Initiating unsolicited bulk sync for networks [6nf9mjgkhz9kpdysb0ysz5p4n 2pr277ylr20s6zre1ov5cyjsm] with node swarm3-c6f28b1ed394"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.439147636Z" level=debug msg="memberlist: TCP connection from=*.*.*.*:39344"
Aug 13 12:28:47 swarm1 dockerd: time="2018-08-13T12:28:47.439800580Z" level=debug msg="swarm1-8ddbf165a098: Bulk sync to node swarm3-c6f28b1ed394 took 1.836498ms"
Aug 13 12:28:48 swarm1 dockerd: time="2018-08-13T12:28:48.272304693Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:48 swarm1 dockerd: time="2018-08-13T12:28:48.272369010Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:49 swarm1 dockerd: time="2018-08-13T12:28:49.272890874Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:49 swarm1 dockerd: time="2018-08-13T12:28:49.272953990Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:50 swarm1 dockerd: time="2018-08-13T12:28:50.272507443Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:50 swarm1 dockerd: time="2018-08-13T12:28:50.272564241Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:50 swarm1 [/bin/bash]: [cat messages] return code=[0], execute success by [root] from [pts/2 (*.*.*.*)]
Aug 13 12:28:51 swarm1 dockerd: time="2018-08-13T12:28:51.272968396Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:51 swarm1 dockerd: time="2018-08-13T12:28:51.273114320Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:52 swarm1 dockerd: time="2018-08-13T12:28:52.272291260Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:52 swarm1 dockerd: time="2018-08-13T12:28:52.272369936Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:53 swarm1 dockerd: time="2018-08-13T12:28:53.272587439Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:53 swarm1 dockerd: time="2018-08-13T12:28:53.272656396Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:54 swarm1 dockerd: time="2018-08-13T12:28:54.272488450Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:54 swarm1 dockerd: time="2018-08-13T12:28:54.272570608Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:55 swarm1 dockerd: time="2018-08-13T12:28:55.272451010Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:55 swarm1 dockerd: time="2018-08-13T12:28:55.272534038Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:56 swarm1 dockerd: time="2018-08-13T12:28:56.272358900Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:56 swarm1 dockerd: time="2018-08-13T12:28:56.272415523Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:57 swarm1 dockerd: time="2018-08-13T12:28:57.272799343Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:57 swarm1 dockerd: time="2018-08-13T12:28:57.272859615Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:57 swarm1 dockerd: time="2018-08-13T12:28:57.943773369Z" level=debug msg="memberlist: TCP connection from=*.*.*.*:51912"
Aug 13 12:28:57 swarm1 dockerd: time="2018-08-13T12:28:57.944686609Z" level=debug msg="swarm1-8ddbf165a098: Initiating  bulk sync for networks [6nf9mjgkhz9kpdysb0ysz5p4n] with node swarm2-b39e61c8b22b"
Aug 13 12:28:58 swarm1 dockerd: time="2018-08-13T12:28:58.272460355Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:58 swarm1 dockerd: time="2018-08-13T12:28:58.272523609Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:59 swarm1 dockerd: time="2018-08-13T12:28:59.061235626Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:59 swarm1 dockerd: time="2018-08-13T12:28:59.061862745Z" level=debug msg="Calling GET /_ping"
Aug 13 12:28:59 swarm1 dockerd: time="2018-08-13T12:28:59.064930342Z" level=debug msg="Calling GET /v1.29/nodes"
Aug 13 12:28:59 swarm1 dockerd: time="2018-08-13T12:28:59.074846163Z" level=debug msg="Calling GET /v1.29/info"
Aug 13 12:28:59 swarm1 [/bin/bash]: [docker node ls] return code=[0], execute success by [root] from [pts/1 (*.*.*.*)]
Aug 13 12:28:59 swarm1 dockerd: time="2018-08-13T12:28:59.272329140Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:28:59 swarm1 dockerd: time="2018-08-13T12:28:59.272455826Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:00 swarm1 dockerd: time="2018-08-13T12:29:00.272232426Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:00 swarm1 dockerd: time="2018-08-13T12:29:00.272314845Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:01 swarm1 dockerd: time="2018-08-13T12:29:01.064842333Z" level=debug msg="memberlist: Initiating push/pull sync with: *.*.*.*:7946"
Aug 13 12:29:01 swarm1 dockerd: time="2018-08-13T12:29:01.272247943Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:01 swarm1 dockerd: time="2018-08-13T12:29:01.272314367Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:02 swarm1 dockerd: time="2018-08-13T12:29:02.272457847Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:02 swarm1 dockerd: time="2018-08-13T12:29:02.272527477Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:03 swarm1 dockerd: time="2018-08-13T12:29:03.272469214Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:03 swarm1 dockerd: time="2018-08-13T12:29:03.272545706Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:04 swarm1 dockerd: time="2018-08-13T12:29:04.272531172Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:04 swarm1 dockerd: time="2018-08-13T12:29:04.272603997Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:05 swarm1 dockerd: time="2018-08-13T12:29:05.272330806Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:05 swarm1 dockerd: time="2018-08-13T12:29:05.272386981Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:06 swarm1 dockerd: time="2018-08-13T12:29:06.272380681Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:06 swarm1 dockerd: time="2018-08-13T12:29:06.272458479Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:06 swarm1 dockerd: time="2018-08-13T12:29:06.526217585Z" level=debug msg="memberlist: TCP connection from=*.*.*.*:39354"
Aug 13 12:29:06 swarm1 dockerd: time="2018-08-13T12:29:06.526764442Z" level=debug msg="swarm1-8ddbf165a098: Initiating  bulk sync for networks [6nf9mjgkhz9kpdysb0ysz5p4n 2pr277ylr20s6zre1ov5cyjsm] with node swarm3-c6f28b1ed394"
Aug 13 12:29:07 swarm1 dockerd: time="2018-08-13T12:29:07.272381366Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:07 swarm1 dockerd: time="2018-08-13T12:29:07.272445873Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:08 swarm1 dockerd: time="2018-08-13T12:29:08.272367983Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:08 swarm1 dockerd: time="2018-08-13T12:29:08.272440352Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:09 swarm1 dockerd: time="2018-08-13T12:29:09.272503015Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:09 swarm1 dockerd: time="2018-08-13T12:29:09.272577733Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:10 swarm1 dockerd: time="2018-08-13T12:29:10.272448349Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:10 swarm1 dockerd: time="2018-08-13T12:29:10.272514623Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:29:11 swarm1 dockerd: time="2018-08-13T12:29:11.272469400Z" level=debug msg="failed to send message MsgHeartbeat" error="rpc error: code = 14 desc = grpc: the connection is unavailable" peer_id=4c3ad8d93b5c51db
Aug 13 12:29:11 swarm1 dockerd: time="2018-08-13T12:29:11.272538166Z" level=debug msg="3f15bf204e8ce86d failed to send message to 4c3ad8d93b5c51db because it is unreachable [next = 1213, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke

@aaronlehmann
Copy link
Collaborator

aaronlehmann commented Jul 12, 2017

Thanks very much for providing the log. I think I understand the problem.

Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.523340355Z" level=debug method="(*Node).Join" node.id=9rc4hrbnfgw5oqbbzbr157uls raft_id=3f15bf204e8ce86d

swarm3 gets promoted first

Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.535820624Z" level=debug method="(*Node).Join" node.id=umafjbqluc0fxn5r04sfri9d8 raft_id=3f15bf204e8ce86d

swarm2 gets promoted, but this waits for membershipLock. Only one promotion or demotion can happen at a given time.

Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.542809700Z" level=debug msg="node joined" method="(*Node).Join" node.id=9rc4hrbnfgw5oqbbzbr157uls raft_id=3f15bf204e8ce86d

Consensus is reached on swarm3 becoming a manager. membershipLock is released so swarm2's promotion can proceed.

Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.565963968Z" level=debug msg="3f15bf204e8ce86d received msgApp rejection(lastindex: 0) from 79c0024ac8b22557 for index 1210" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke
Aug 13 12:28:24 swarm1 dockerd: time="2018-08-13T12:28:24.566015435Z" level=debug msg="3f15bf204e8ce86d decreased progress of 79c0024ac8b22557 to [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" module=raft node.id=qm0ryc2j7h3o76ijq0niyg8ke

At this point, swarm1 is transferring its log to swarm3. This process can involve several heartbeats to figure out what entries the new node needs. Sometimes it involves substantial data transfer. It can take some time.

Aug 13 12:28:35 swarm1 dockerd: time="2018-08-13T12:28:35.531455828Z" level=error msg="failed to add member 4c3ad8d93b5c51db" error="context deadline exceeded" method="(*Node).Join" node.id=umafjbqluc0fxn5r04sfri9d8 raft_id=3f15bf204e8ce86d

swarm3 didn't end up as fully operational manager in time, so swarm2's promotion timed out waiting for consensus.

Aug 13 12:28:41 swarm1 dockerd: time="2018-08-13T12:28:41.280709860Z" level=debug msg="transport: add peer 4c3ad8d93b5c51db with address *.*.*.*:2377"

swarm3 is finally ready now, and the promotion of swarm2 went through. But we already returned an error to swarm2 from Join, so swarm2 has given up on being a manager, and is no longer listening on port 2377.

Basically, when we go from one manager to two, it can take some time for the second manager to become operational, and in the mean time it's like having a loss of quorum because only one of two managers is processing log appends. Joining raft has a timeout, and this timeout can be exceeded when the cluster is in this state.

I believe that when this is triggered with docker node promote, the join to raft would be retried later on, but:

  • I suspect that in the cases seen in this issue, the deadlock in manager: Fix hanging Stop method #2203 prevented that from happening. This has been fixed in 17.06.
  • Joining raft again won't necessarily work correctly. In this case, the join went through after the timeout, but swarm2 has no way of knowing that. If it tries to join again, it would get the error a raft member with this node ID already exists. This would be fixed by raft: Allow Join to be called multiple times for the same cluster member #2198. We should really merge that, as it has been open for a long time and will fix a whole class of problems.

Also, when a node joins as a manager directly (instead of using docker node promote), the timeout in this situation would cause the swarm join to fail, and I believe there would not be a retry. We should decide whether this is expected behavior, or if it makes sense to increase the timeout, retry joining a few times, etc.

cc @tonistiigi @cyli @aluzzardi

@yangfeiyu20102011
Copy link

Thanks.
It has confused me a few days,now I have a better understanding.

@yangfeiyu20102011
Copy link

I want to know how the node status changes in the promote/demote operation.
I have read some related code but I still don't have a clear understanding,especially the asyn.
Could you explain the process in detail?
In the process,what functions are called and how the node status change?
Thanks,@aaronlehmann

@aaronlehmann
Copy link
Collaborator

Hi @yangfeiyu20102011: I just opened #2318 which includes some information on promotion and demotion. Does it answer your questions?

@yangfeiyu20102011
Copy link

@aaronlehmann:With the raft.md,I think I have understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants