Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

Closed
ghost opened this issue Dec 8, 2015 · 17 comments
Closed

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

ghost opened this issue Dec 8, 2015 · 17 comments

Comments

@ghost
Copy link

ghost commented Dec 8, 2015

Constant stream of the same error over and over again in "journalctl -u etcd2":

Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32252]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32252]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32252]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32259]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32259]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32259]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32284]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32284]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32284]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.

I looked around for how to enable further debug information but didn't find anything obvious. Any help appreciated.

@ghost
Copy link
Author

ghost commented Dec 8, 2015

Just realised this might be a dupe of #936

@ghost
Copy link
Author

ghost commented Dec 8, 2015

journalctl -t shows this extra info:

Dec 07 12:20:32 ip-172-24-107-204.eu-west-1.compute.internal etcd2[788]: invalid datadir. Both member and proxy directories exist.

I don't know how to recover from that?

@ghost
Copy link
Author

ghost commented Dec 8, 2015

We are possibly hitting etcd-io/etcd#3827

@crawford
Copy link
Contributor

crawford commented Dec 8, 2015

@martingartonft You just need to remove everything in the data directory (default: /var/lib/etcd2). Somehow that node was once a member and is now a proxy.

@ghost
Copy link
Author

ghost commented Dec 9, 2015

Thanks. That does work, but the problem keeps re-occuring. (today I have another machine doing the same). This is happening on all my alpha clusters, but none of my stable ones.

I'd like to be able to verify that these machines have once been a member? I don't see any evidence of that.

@xiang90
Copy link

xiang90 commented Dec 9, 2015

but the problem keeps re-occuring.

Are you keeping on recreating machines?

@ghost
Copy link
Author

ghost commented Dec 9, 2015

No, the clusters are generally static once created. I have re-created some of the clusters entirely as a last resort to see whether the problem goes away. The new cluster is fine as expected, but after a time it happens again.

@xiang90
Copy link

xiang90 commented Dec 9, 2015

@martingartonft All the failures are due to bootstrap (at least from the logging you provided). The double data dir is caused by an issue in etcd when a machine failed to finish discovery and fell back to a proxy for the second discovery when the cluster is full.

Can you verify that the cluster is actually static after the first successful bootstrap?

@ghost
Copy link
Author

ghost commented Dec 10, 2015

@xiang90 By "static" do you mean no new machines getting added, none removed, and none being replaced or IP addresses changing? If so, I can confirm it is static.

If there is anything else I can tell you or logs that will help, please let me know.

@xiang90
Copy link

xiang90 commented Dec 10, 2015

@martingartonft

By "static" do you mean no new machines getting added, none removed, and none being replaced or IP addresses changing? If so, I can confirm it is static.

Exactly.

If there is anything else I can tell you or logs that will help, please let me know.

Can you show me the full log of all members? From they successfully bootstrapped to when they started to restart constantly? I actually have never seen this before and could not reproduce it.

@ghost
Copy link
Author

ghost commented Dec 11, 2015

Unfortunately the start of the logs is no longer available. I will try to catch the issue happening next time and grab the logs before while they are still there.

@ghost
Copy link
Author

ghost commented Dec 15, 2015

@xiang90 Okay it happened again and I got more of the logs. The problem started after a reboot as you can see below:

ec 11 21:44:18 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:57892 closed request prematurely
Dec 11 21:44:38 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58272 closed request prematurely
Dec 11 21:44:58 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58452 closed request prematurely
Dec 11 21:45:18 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58528 closed request prematurely
Dec 11 21:45:38 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58948 closed request prematurely
Dec 11 21:45:58 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:59138 closed request prematurely
Dec 11 21:46:18 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:59200 closed request prematurely
Dec 11 21:46:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:59580 closed request prematurely
-- Reboot --
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal etcd2[755]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal etcd2[755]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal etcd2[755]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal etcd2[797]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal etcd2[797]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal etcd2[797]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[804]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[804]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[804]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal etcd2[814]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal etcd2[814]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal etcd2[814]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:57 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:57 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:57 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...

@df-ryan
Copy link

df-ryan commented Dec 22, 2015

I'm seeing the same problem. I created a cluster of two machines and have since added two more ESXi virtual machines to the cluster which initially come up fine. With the original two machines, fleetctl works fine, but if I reboot the last two machines I added, fleetclt returns the following error:

Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etcd

sudo journalctl -u etcd2

Shows that etcd2 failed to start:

Dec 22 22:49:49 localhost etcd2[654]: invalid datadir. Both member and proxy directories exist.

Removing /var/lib/etcd2/member folder allows the etcd2 to start properly and allows CoreOS to be rebooted without problem.

@xiang90
Copy link

xiang90 commented Jan 12, 2016

@df-ryan

This issue is fixed by etcd-io/etcd#4089 in etcd 2.2.3 release.
We have confirmed this is fixed. Also etcd 2.2.3 is shipped in latest CoreOS alpha. User has confirmed it does fix the issue too. (etcd-io/etcd#3971 (comment))

@crawford I think this can be closed now.

@crawford
Copy link
Contributor

Fixed in 921.0.0.

@kayrus
Copy link

kayrus commented Jan 21, 2016

@crawford can you include etcd v2.2.3 into latest stable? lot of users face problem of bootstrapping etcd cluster.

@crawford
Copy link
Contributor

It looks like 899.4.0 (which has etcd 2.2.3) will be the base for the next Stable. So yes, it should be in there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants