-
Notifications
You must be signed in to change notification settings - Fork 30
etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021
Comments
Just realised this might be a dupe of #936 |
journalctl -t shows this extra info: Dec 07 12:20:32 ip-172-24-107-204.eu-west-1.compute.internal etcd2[788]: invalid datadir. Both member and proxy directories exist. I don't know how to recover from that? |
We are possibly hitting etcd-io/etcd#3827 |
@martingartonft You just need to remove everything in the data directory (default: |
Thanks. That does work, but the problem keeps re-occuring. (today I have another machine doing the same). This is happening on all my alpha clusters, but none of my stable ones. I'd like to be able to verify that these machines have once been a member? I don't see any evidence of that. |
Are you keeping on recreating machines? |
No, the clusters are generally static once created. I have re-created some of the clusters entirely as a last resort to see whether the problem goes away. The new cluster is fine as expected, but after a time it happens again. |
@martingartonft All the failures are due to bootstrap (at least from the logging you provided). The double data dir is caused by an issue in etcd when a machine failed to finish discovery and fell back to a proxy for the second discovery when the cluster is full. Can you verify that the cluster is actually static after the first successful bootstrap? |
@xiang90 By "static" do you mean no new machines getting added, none removed, and none being replaced or IP addresses changing? If so, I can confirm it is static. If there is anything else I can tell you or logs that will help, please let me know. |
@martingartonft
Exactly.
Can you show me the full log of all members? From they successfully bootstrapped to when they started to restart constantly? I actually have never seen this before and could not reproduce it. |
Unfortunately the start of the logs is no longer available. I will try to catch the issue happening next time and grab the logs before while they are still there. |
@xiang90 Okay it happened again and I got more of the logs. The problem started after a reboot as you can see below:
|
I'm seeing the same problem. I created a cluster of two machines and have since added two more ESXi virtual machines to the cluster which initially come up fine. With the original two machines, fleetctl works fine, but if I reboot the last two machines I added, fleetclt returns the following error: Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etcd sudo journalctl -u etcd2 Shows that etcd2 failed to start: Dec 22 22:49:49 localhost etcd2[654]: invalid datadir. Both member and proxy directories exist. Removing /var/lib/etcd2/member folder allows the etcd2 to start properly and allows CoreOS to be rebooted without problem. |
This issue is fixed by etcd-io/etcd#4089 in etcd 2.2.3 release. @crawford I think this can be closed now. |
Fixed in 921.0.0. |
@crawford can you include etcd v2.2.3 into latest stable? lot of users face problem of bootstrapping etcd cluster. |
It looks like 899.4.0 (which has etcd 2.2.3) will be the base for the next Stable. So yes, it should be in there. |
Constant stream of the same error over and over again in "journalctl -u etcd2":
I looked around for how to enable further debug information but didn't find anything obvious. Any help appreciated.
The text was updated successfully, but these errors were encountered: