Node removed from a cluster without any (visible) reason #843

bjaglin · 2014-06-12T16:50:25Z

I had a cluster of 3 nodes running on EC2 (etcd 0.4.3, CoreOS alpha 343.0.0). One node was removed from the cluster, with no particular reason just by looking at the logs (and no known partition). Rebooting that node made it re-join the cluster.

node 9c3d9d039dad434089d9c9cc4a352f0a, which was removed from the cluster:

Jun 12 15:02:55 ip-10-0-1-50 etcd[3082]: [etcd] Jun 12 15:02:55.803 WARNING   | [ae] Error: nil response
Jun 12 15:02:55 ip-10-0-1-50 etcd[3082]: [etcd] Jun 12 15:02:55.803 INFO      | 9c3d9d039dad434089d9c9cc4a352f0a: state changed from 'follower' to 'stopped'.
Jun 12 15:02:55 ip-10-0-1-50 etcd[3082]: [etcd] Jun 12 15:02:55.803 INFO      | 9c3d9d039dad434089d9c9cc4a352f0a: state changed from 'stopped' to 'stopped'.

node 4afd42696815438da5175179fa29e235, which is the leader:

Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.702 INFO      | 4afd42696815438da5175179fa29e235: removing: 9c3d9d039dad434089d9c9cc4a352f0a
Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.738 INFO      | 4afd42696815438da5175179fa29e235: peer removed: '9c3d9d039dad434089d9c9cc4a352f0a'
Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.800 WARNING   | transporter.ae.decoding.error:proto: field/encoding mismatch: wrong type for field
Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.801 INFO      | 4afd42696815438da5175179fa29e235: warning: heartbeat time out peer="9c3d9d039dad434089d9c9cc4a352f0a" missed=1 backoff="4s"

node a1433478ab9548dba9d3e97ec5838b22, which is a follower:

Jun 12 15:02:55 ip-10-0-101-160 etcd[5673]: [etcd] Jun 12 15:02:55.783 INFO      | a1433478ab9548dba9d3e97ec5838b22: peer removed: '9c3d9d039dad434089d9c9cc4a352f0a'

Any idea?

The text was updated successfully, but these errors were encountered:

yichengq · 2014-06-12T17:06:43Z

@bjaglin The log is ambiguous, and I have sent a PR to improve it.
It indicates that the node is removed by the cluster because the number of peer-mode machines in the cluster is bigger than activeSize setting in the cluster config.
Does this fit your case well?

bjaglin · 2014-06-12T17:25:52Z

I did update the activeSize, but only to decrease it from 9 to 3. If that node was demoted, it would still return an up-to-date :7001/v2/admin/machines, right? It wasn't.

Thanks for the PR, I will close this for now and will reopen with more info if I reproduce.

yichengq · 2014-06-12T17:33:31Z

If the node goes into standby mode, it will reject all requests to endpoints on 7001/peer-addr because this is only used for peer-mode instances.
Thanks for your reporting!

bjaglin · 2014-06-12T21:01:53Z

Right - then there was definitely something fishy as that URL on the demoted node was returning a succesfull payoad, although describing outdated peers as time passed and peers were updated in the rest of the cluster. As said, there is not much else to do without further logs, so closing this for now!

yichengq mentioned this issue Jun 12, 2014

chore(peer_server): improve log for auto removal #844

Merged

bjaglin closed this as completed Jun 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node removed from a cluster without any (visible) reason #843

Node removed from a cluster without any (visible) reason #843

bjaglin commented Jun 12, 2014

yichengq commented Jun 12, 2014

bjaglin commented Jun 12, 2014

yichengq commented Jun 12, 2014

bjaglin commented Jun 12, 2014

Node removed from a cluster without any (visible) reason #843

Node removed from a cluster without any (visible) reason #843

Comments

bjaglin commented Jun 12, 2014

yichengq commented Jun 12, 2014

bjaglin commented Jun 12, 2014

yichengq commented Jun 12, 2014

bjaglin commented Jun 12, 2014