Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Potential Issue with Newer Instance Types and etcd #1230

Closed
kylegoch opened this issue Apr 9, 2018 · 10 comments
Closed

Potential Issue with Newer Instance Types and etcd #1230

kylegoch opened this issue Apr 9, 2018 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@kylegoch
Copy link

kylegoch commented Apr 9, 2018

I was standing up a production cluster and wanted to use an m5.large instance type for my etcd nodes. However they never stood up. I went back to using a t2 instance type and they stood up and I could carry on setting up everything.

Here are the logs from the failed etcd node:

Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Starting cfn-signal.service...
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemctl[1852]: inactive
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Control process exited, code=exited status=3
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Failed with result 'exit-code'.
Apr 09 14:23:12 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Failed to start cfn-signal.service.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Service hold-off time over, scheduling restart.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: cfn-signal.service: Scheduled restart job, restart counter is at 108.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Starting etcdadm reconfigure runner...
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Stopped cfn-signal.service.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemctl[1855]: inactive
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcdadm-reconfigure.service: Control process exited, code=exited status=3
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcdadm-reconfigure.service: Failed with result 'exit-code'.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Failed to start etcdadm reconfigure runner.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Dependency failed for etcd (System Application Container).
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: Dependency failed for etcdadm update status.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcdadm-update-status.service: Job etcdadm-update-status.service/start failed with result 'dependency'.
Apr 09 14:23:22 ip-10-100-9-140.us-east-2.compute.internal systemd[1]: etcd-member.service: Job etcd-member.service/start failed with result 'dependency'.

Digging around i traced it to the var-lib-etcd2.mount that is a requirement of the etcd-member.service. It was returning these errors:

Apr 09 17:34:22 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: Dependency failed for /var/lib/etcd2.
Apr 09 17:34:22 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: var-lib-etcd2.mount: Job var-lib-etcd2.mount/start failed with result 'dependency'.
Apr 09 17:35:54 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: Dependency failed for /var/lib/etcd2.
Apr 09 17:35:54 ip-10-100-7-66.us-east-2.compute.internal systemd[1]: var-lib-etcd2.mount: Job var-lib-etcd2.mount/start failed with result 'dependency'.

Not sure if its something with the way the drives are setup or what, but figured I would pass along.

@mumoshu
Copy link
Contributor

mumoshu commented Apr 13, 2018

Seems like the same for me too, with c5.large:

Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: dev-xvdf.device: Job dev-xvdf.device/start timed out.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: Timed out waiting for device dev-xvdf.device.
-- Subject: Unit dev-xvdf.device has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit dev-xvdf.device has failed.
--
-- The result is RESULT.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: Dependency failed for Formats etcd2 ebs volume.
-- Subject: Unit format-etcd2-volume.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit format-etcd2-volume.service has failed.
--
-- The result is RESULT.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: Dependency failed for /var/lib/etcd2.
-- Subject: Unit var-lib-etcd2.mount has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit var-lib-etcd2.mount has failed.
--
-- The result is RESULT.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: var-lib-etcd2.mount: Job var-lib-etcd2.mount/start failed with result 'dependency'.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: format-etcd2-volume.service: Job format-etcd2-volume.service/start failed with result 'dependency'.
Apr 13 07:58:38 ip-10-0-0-152.ap-northeast-1.compute.internal systemd[1]: dev-xvdf.device: Job dev-xvdf.device/start failed with result 'timeout'.

@mumoshu
Copy link
Contributor

mumoshu commented Apr 13, 2018

I slightly remember that someone told me that device names are different in newer instance types with its EBS volumes exposed as NVMe block devices:

$ ls -lah /dev/ | grep 'nvme\|xvdf'
crw-------.  1 root root    249,   0 Apr 13 07:55 nvme0
brw-rw----.  1 root disk    259,   0 Apr 13 07:56 nvme0n1
brw-rw----.  1 root disk    259,   1 Apr 13 07:56 nvme0n1p1
brw-rw----.  1 root disk    259,   2 Apr 13 07:56 nvme0n1p2
brw-rw----.  1 root disk    259,   3 Apr 13 07:56 nvme0n1p3
brw-rw----.  1 root disk    259,   4 Apr 13 07:56 nvme0n1p4
brw-rw----.  1 root disk    259,   5 Apr 13 07:56 nvme0n1p6
brw-rw----.  1 root disk    259,   6 Apr 13 07:56 nvme0n1p7
brw-rw----.  1 root disk    259,   7 Apr 13 07:56 nvme0n1p9
crw-------.  1 root root    249,   1 Apr 13 07:55 nvme1
brw-rw----.  1 root disk    259,   8 Apr 13 07:55 nvme1n1

@mumoshu
Copy link
Contributor

mumoshu commented Apr 13, 2018

Found it! 😉 #1048

Would you mind implementing the workaround shared in the upstream coreos issue into kube-aws?
Probably editing all the cloud-config's under core/controlplane/config/templaets/cloud-config-* would make it.

You can also add the workaround into userdata/cloud-config-* under the directory where you run kube-aws render.

@kylegoch
Copy link
Author

We just used a different instance type for the time being. Should I close this or leave it open?

@mumoshu
Copy link
Contributor

mumoshu commented Apr 16, 2018

@kylegoch Thanks for the confirmation 👍
Let's keep this open so that we won't forget adding support for c5/m5 series.

@kiich
Copy link
Contributor

kiich commented Mar 11, 2019

Just found this issue and wanted to chime in - I spun up a cluster with c5.9xlarge as one of my worker node pool and the cluster seems to have come up ok.
Is this issue only appearing if I have an external mount point or something along those line? Or should I have run into this issue with root volume?

It seems this is fixed now?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants