etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

ghost · 2015-12-08T11:26:46Z

Constant stream of the same error over and over again in "journalctl -u etcd2":

Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32252]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32252]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32252]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 08 11:28:40 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32259]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32259]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32259]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 08 11:28:50 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32284]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32284]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal etcd2[32284]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 08 11:29:00 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.

I looked around for how to enable further debug information but didn't find anything obvious. Any help appreciated.

The text was updated successfully, but these errors were encountered:

ghost · 2015-12-08T12:22:45Z

Just realised this might be a dupe of #936

ghost · 2015-12-08T12:24:13Z

journalctl -t shows this extra info:

Dec 07 12:20:32 ip-172-24-107-204.eu-west-1.compute.internal etcd2[788]: invalid datadir. Both member and proxy directories exist.

I don't know how to recover from that?

ghost · 2015-12-08T12:29:14Z

We are possibly hitting etcd-io/etcd#3827

crawford · 2015-12-08T18:37:30Z

@martingartonft You just need to remove everything in the data directory (default: /var/lib/etcd2). Somehow that node was once a member and is now a proxy.

ghost · 2015-12-09T15:16:58Z

Thanks. That does work, but the problem keeps re-occuring. (today I have another machine doing the same). This is happening on all my alpha clusters, but none of my stable ones.

I'd like to be able to verify that these machines have once been a member? I don't see any evidence of that.

xiang90 · 2015-12-09T22:28:43Z

but the problem keeps re-occuring.

Are you keeping on recreating machines?

ghost · 2015-12-09T22:43:51Z

No, the clusters are generally static once created. I have re-created some of the clusters entirely as a last resort to see whether the problem goes away. The new cluster is fine as expected, but after a time it happens again.

xiang90 · 2015-12-09T22:47:57Z

@martingartonft All the failures are due to bootstrap (at least from the logging you provided). The double data dir is caused by an issue in etcd when a machine failed to finish discovery and fell back to a proxy for the second discovery when the cluster is full.

Can you verify that the cluster is actually static after the first successful bootstrap?

ghost · 2015-12-10T09:53:57Z

@xiang90 By "static" do you mean no new machines getting added, none removed, and none being replaced or IP addresses changing? If so, I can confirm it is static.

If there is anything else I can tell you or logs that will help, please let me know.

xiang90 · 2015-12-10T15:57:07Z

@martingartonft

By "static" do you mean no new machines getting added, none removed, and none being replaced or IP addresses changing? If so, I can confirm it is static.

Exactly.

If there is anything else I can tell you or logs that will help, please let me know.

Can you show me the full log of all members? From they successfully bootstrapped to when they started to restart constantly? I actually have never seen this before and could not reproduce it.

ghost · 2015-12-11T09:55:43Z

Unfortunately the start of the logs is no longer available. I will try to catch the issue happening next time and grab the logs before while they are still there.

ghost · 2015-12-15T10:24:05Z

@xiang90 Okay it happened again and I got more of the logs. The problem started after a reboot as you can see below:

ec 11 21:44:18 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:57892 closed request prematurely
Dec 11 21:44:38 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58272 closed request prematurely
Dec 11 21:44:58 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58452 closed request prematurely
Dec 11 21:45:18 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58528 closed request prematurely
Dec 11 21:45:38 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:58948 closed request prematurely
Dec 11 21:45:58 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:59138 closed request prematurely
Dec 11 21:46:18 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:59200 closed request prematurely
Dec 11 21:46:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[3168]: proxy: client 172.17.0.5:59580 closed request prematurely
-- Reboot --
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal etcd2[755]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal etcd2[755]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:16 ip-172-24-107-204.eu-west-1.compute.internal etcd2[755]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:17 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal etcd2[797]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal etcd2[797]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal etcd2[797]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:27 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[804]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[804]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal etcd2[804]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:37 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal etcd2[814]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.24.107.204:2379,http://172.24.107.204:4001
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal etcd2[814]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal etcd2[814]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/a0c6f0e23c444fb7c9df6b9f59652ba1
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Failed to start etcd2.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Unit entered failed state.
Dec 11 21:48:47 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Failed with result 'exit-code'.
Dec 11 21:48:57 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: etcd2.service: Service hold-off time over, scheduling restart.
Dec 11 21:48:57 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Stopped etcd2.
Dec 11 21:48:57 ip-172-24-107-204.eu-west-1.compute.internal systemd[1]: Starting etcd2...

df-ryan · 2015-12-22T23:01:13Z

I'm seeing the same problem. I created a cluster of two machines and have since added two more ESXi virtual machines to the cluster which initially come up fine. With the original two machines, fleetctl works fine, but if I reboot the last two machines I added, fleetclt returns the following error:

Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etcd

sudo journalctl -u etcd2

Shows that etcd2 failed to start:

Dec 22 22:49:49 localhost etcd2[654]: invalid datadir. Both member and proxy directories exist.

Removing /var/lib/etcd2/member folder allows the etcd2 to start properly and allows CoreOS to be rebooted without problem.

xiang90 · 2016-01-12T21:27:33Z

@df-ryan

This issue is fixed by etcd-io/etcd#4089 in etcd 2.2.3 release.
We have confirmed this is fixed. Also etcd 2.2.3 is shipped in latest CoreOS alpha. User has confirmed it does fix the issue too. (etcd-io/etcd#3971 (comment))

@crawford I think this can be closed now.

crawford · 2016-01-12T21:35:36Z

Fixed in 921.0.0.

kayrus · 2016-01-21T15:01:40Z

@crawford can you include etcd v2.2.3 into latest stable? lot of users face problem of bootstrapping etcd cluster.

crawford · 2016-01-21T16:07:23Z

It looks like 899.4.0 (which has etcd 2.2.3) will be the base for the next Stable. So yes, it should be in there.

crawford added component/etcd kind/question labels Dec 8, 2015

crawford closed this as completed Jan 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

ghost commented Dec 8, 2015

ghost commented Dec 8, 2015

ghost commented Dec 8, 2015

ghost commented Dec 8, 2015

crawford commented Dec 8, 2015

ghost commented Dec 9, 2015

xiang90 commented Dec 9, 2015

ghost commented Dec 9, 2015

xiang90 commented Dec 9, 2015

ghost commented Dec 10, 2015

xiang90 commented Dec 10, 2015

ghost commented Dec 11, 2015

ghost commented Dec 15, 2015

df-ryan commented Dec 22, 2015

xiang90 commented Jan 12, 2016

crawford commented Jan 12, 2016

kayrus commented Jan 21, 2016

crawford commented Jan 21, 2016

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) #1021

Comments

ghost commented Dec 8, 2015

ghost commented Dec 8, 2015

ghost commented Dec 8, 2015

ghost commented Dec 8, 2015

crawford commented Dec 8, 2015

ghost commented Dec 9, 2015

xiang90 commented Dec 9, 2015

ghost commented Dec 9, 2015

xiang90 commented Dec 9, 2015

ghost commented Dec 10, 2015

xiang90 commented Dec 10, 2015

ghost commented Dec 11, 2015

ghost commented Dec 15, 2015

df-ryan commented Dec 22, 2015

xiang90 commented Jan 12, 2016

crawford commented Jan 12, 2016

kayrus commented Jan 21, 2016

crawford commented Jan 21, 2016