Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node removed from a cluster without any (visible) reason #843

Closed
bjaglin opened this issue Jun 12, 2014 · 4 comments
Closed

Node removed from a cluster without any (visible) reason #843

bjaglin opened this issue Jun 12, 2014 · 4 comments

Comments

@bjaglin
Copy link

bjaglin commented Jun 12, 2014

I had a cluster of 3 nodes running on EC2 (etcd 0.4.3, CoreOS alpha 343.0.0). One node was removed from the cluster, with no particular reason just by looking at the logs (and no known partition). Rebooting that node made it re-join the cluster.

node 9c3d9d039dad434089d9c9cc4a352f0a, which was removed from the cluster:

Jun 12 15:02:55 ip-10-0-1-50 etcd[3082]: [etcd] Jun 12 15:02:55.803 WARNING   | [ae] Error: nil response
Jun 12 15:02:55 ip-10-0-1-50 etcd[3082]: [etcd] Jun 12 15:02:55.803 INFO      | 9c3d9d039dad434089d9c9cc4a352f0a: state changed from 'follower' to 'stopped'.
Jun 12 15:02:55 ip-10-0-1-50 etcd[3082]: [etcd] Jun 12 15:02:55.803 INFO      | 9c3d9d039dad434089d9c9cc4a352f0a: state changed from 'stopped' to 'stopped'.

node 4afd42696815438da5175179fa29e235, which is the leader:

Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.702 INFO      | 4afd42696815438da5175179fa29e235: removing: 9c3d9d039dad434089d9c9cc4a352f0a
Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.738 INFO      | 4afd42696815438da5175179fa29e235: peer removed: '9c3d9d039dad434089d9c9cc4a352f0a'
Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.800 WARNING   | transporter.ae.decoding.error:proto: field/encoding mismatch: wrong type for field
Jun 12 15:02:55 ip-10-0-1-254 etcd[5507]: [etcd] Jun 12 15:02:55.801 INFO      | 4afd42696815438da5175179fa29e235: warning: heartbeat time out peer="9c3d9d039dad434089d9c9cc4a352f0a" missed=1 backoff="4s"

node a1433478ab9548dba9d3e97ec5838b22, which is a follower:

Jun 12 15:02:55 ip-10-0-101-160 etcd[5673]: [etcd] Jun 12 15:02:55.783 INFO      | a1433478ab9548dba9d3e97ec5838b22: peer removed: '9c3d9d039dad434089d9c9cc4a352f0a'

Any idea?

@yichengq
Copy link
Contributor

@bjaglin The log is ambiguous, and I have sent a PR to improve it.
It indicates that the node is removed by the cluster because the number of peer-mode machines in the cluster is bigger than activeSize setting in the cluster config.
Does this fit your case well?

@bjaglin
Copy link
Author

bjaglin commented Jun 12, 2014

I did update the activeSize, but only to decrease it from 9 to 3. If that node was demoted, it would still return an up-to-date :7001/v2/admin/machines, right? It wasn't.

Thanks for the PR, I will close this for now and will reopen with more info if I reproduce.

@yichengq
Copy link
Contributor

If the node goes into standby mode, it will reject all requests to endpoints on 7001/peer-addr because this is only used for peer-mode instances.
Thanks for your reporting!

@bjaglin
Copy link
Author

bjaglin commented Jun 12, 2014

Right - then there was definitely something fishy as that URL on the demoted node was returning a succesfull payoad, although describing outdated peers as time passed and peers were updated in the rest of the cluster. As said, there is not much else to do without further logs, so closing this for now!

@bjaglin bjaglin closed this as completed Jun 12, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants