Potential Issue with Newer Instance Types and etcd #1230

kylegoch · 2018-04-09T19:26:31Z

I was standing up a production cluster and wanted to use an m5.large instance type for my etcd nodes. However they never stood up. I went back to using a t2 instance type and they stood up and I could carry on setting up everything.

Here are the logs from the failed etcd node:

Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Starting cfn-signal.service...
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemctl[1852]: inactive
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Control process exited, code=exited status=3
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Failed with result 'exit-code'.
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Failed to start cfn-signal.service.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Service hold-off time over, scheduling restart.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Scheduled restart job, restart counter is at 108.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Starting etcdadm reconfigure runner...
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Stopped cfn-signal.service.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemctl[1855]: inactive
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcdadm-reconfigure.service: Control process exited, code=exited status=3
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcdadm-reconfigure.service: Failed with result 'exit-code'.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Failed to start etcdadm reconfigure runner.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Dependency failed for etcd (System Application Container).
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Dependency failed for etcdadm update status.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcdadm-update-status.service: Job etcdadm-update-status.service/start failed with result 'dependency'.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcd-member.service: Job etcd-member.service/start failed with result 'dependency'.

Digging around i traced it to the var-lib-etcd2.mount that is a requirement of the etcd-member.service. It was returning these errors:

Apr 09 17:34:22 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: Dependency failed for /var/lib/etcd2.
Apr 09 17:34:22 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: var-lib-etcd2.mount: Job var-lib-etcd2.mount/start failed with result 'dependency'.
Apr 09 17:35:54 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: Dependency failed for /var/lib/etcd2.
Apr 09 17:35:54 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: var-lib-etcd2.mount: Job var-lib-etcd2.mount/start failed with result 'dependency'.

Not sure if its something with the way the drives are setup or what, but figured I would pass along.

The text was updated successfully, but these errors were encountered:

mumoshu · 2018-04-13T08:00:17Z

Seems like the same for me too, with c5.large:

Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: dev-xvdf.device: Job dev-xvdf.device/start timed out.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: Timed out waiting for device dev-xvdf.device.
-- Subject: Unit dev-xvdf.device has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit dev-xvdf.device has failed.
--
-- The result is RESULT.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: Dependency failed for Formats etcd2 ebs volume.
-- Subject: Unit format-etcd2-volume.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit format-etcd2-volume.service has failed.
--
-- The result is RESULT.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: Dependency failed for /var/lib/etcd2.
-- Subject: Unit var-lib-etcd2.mount has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit var-lib-etcd2.mount has failed.
--
-- The result is RESULT.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: var-lib-etcd2.mount: Job var-lib-etcd2.mount/start failed with result 'dependency'.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: format-etcd2-volume.service: Job format-etcd2-volume.service/start failed with result 'dependency'.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: dev-xvdf.device: Job dev-xvdf.device/start failed with result 'timeout'.

mumoshu · 2018-04-13T08:03:42Z

I slightly remember that someone told me that device names are different in newer instance types with its EBS volumes exposed as NVMe block devices:

$ ls -lah /dev/ | grep 'nvme\|xvdf'
crw-------.  1 root root    249,   0 Apr 13 07:55 nvme0
brw-rw----.  1 root disk    259,   0 Apr 13 07:56 nvme0n1
brw-rw----.  1 root disk    259,   1 Apr 13 07:56 nvme0n1p1
brw-rw----.  1 root disk    259,   2 Apr 13 07:56 nvme0n1p2
brw-rw----.  1 root disk    259,   3 Apr 13 07:56 nvme0n1p3
brw-rw----.  1 root disk    259,   4 Apr 13 07:56 nvme0n1p4
brw-rw----.  1 root disk    259,   5 Apr 13 07:56 nvme0n1p6
brw-rw----.  1 root disk    259,   6 Apr 13 07:56 nvme0n1p7
brw-rw----.  1 root disk    259,   7 Apr 13 07:56 nvme0n1p9
crw-------.  1 root root    249,   1 Apr 13 07:55 nvme1
brw-rw----.  1 root disk    259,   8 Apr 13 07:55 nvme1n1

mumoshu · 2018-04-13T08:57:49Z

Found it! 😉 #1048

Would you mind implementing the workaround shared in the upstream coreos issue into kube-aws?
Probably editing all the cloud-config's under core/controlplane/config/templaets/cloud-config-* would make it.

You can also add the workaround into userdata/cloud-config-* under the directory where you run kube-aws render.

kylegoch · 2018-04-13T19:59:39Z

We just used a different instance type for the time being. Should I close this or leave it open?

mumoshu · 2018-04-16T00:33:55Z

@kylegoch Thanks for the confirmation 👍
Let's keep this open so that we won't forget adding support for c5/m5 series.

kiich · 2019-03-11T15:00:11Z

Just found this issue and wanted to chime in - I spun up a cluster with c5.9xlarge as one of my worker node pool and the cluster seems to have come up ok.
Is this issue only appearing if I have an external mount point or something along those line? Or should I have run into this issue with root volume?

It seems this is fixed now?

fejta-bot · 2019-06-09T15:44:26Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-07-09T16:33:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-08-08T17:16:40Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-08-08T17:16:48Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mumoshu mentioned this issue May 2, 2018

Unable to create cluster with more than 1 etcd #1206

Closed

Vrtak-CZ mentioned this issue May 21, 2018

ETCd not working on m5 instances #1313

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2019

k8s-ci-robot closed this as completed Aug 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Issue with Newer Instance Types and etcd #1230

Potential Issue with Newer Instance Types and etcd #1230

kylegoch commented Apr 9, 2018

mumoshu commented Apr 13, 2018

mumoshu commented Apr 13, 2018

mumoshu commented Apr 13, 2018

kylegoch commented Apr 13, 2018

mumoshu commented Apr 16, 2018 •

edited

Loading

kiich commented Mar 11, 2019

fejta-bot commented Jun 9, 2019

fejta-bot commented Jul 9, 2019

fejta-bot commented Aug 8, 2019

k8s-ci-robot commented Aug 8, 2019

Potential Issue with Newer Instance Types and etcd #1230

Potential Issue with Newer Instance Types and etcd #1230

Comments

kylegoch commented Apr 9, 2018

mumoshu commented Apr 13, 2018

mumoshu commented Apr 13, 2018

mumoshu commented Apr 13, 2018

kylegoch commented Apr 13, 2018

mumoshu commented Apr 16, 2018 • edited Loading

kiich commented Mar 11, 2019

fejta-bot commented Jun 9, 2019

fejta-bot commented Jul 9, 2019

fejta-bot commented Aug 8, 2019

k8s-ci-robot commented Aug 8, 2019

mumoshu commented Apr 16, 2018 •

edited

Loading