Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start efs stunnel watch dog #104

Closed
leakingtapan opened this issue Nov 26, 2019 · 8 comments · Fixed by #113
Closed

Start efs stunnel watch dog #104

leakingtapan opened this issue Nov 26, 2019 · 8 comments · Fixed by #113
Milestone

Comments

@leakingtapan
Copy link
Contributor

leakingtapan commented Nov 26, 2019

Is your feature request related to a problem? Please describe.
EFS stunnel watch dog is not started properly due to efs mount helper is installed within container environment. We need to start the watch dog to recover stunnel from crash.

Error message:

Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"

That is because there is no proper init system present in the container. This will cause efs stunnel watch dog start to fail here. Although seems an issue, this doesn't seems to be the cause of this issue since the watch dog is never start even in the initial success mount:

bash-4.2# cat /var/log/amazon/efs/mount.log
2019-11-26 20:49:30,695 - INFO - version=1.9 options={'tls': None, 'rw': None}
2019-11-26 20:49:30,700 - WARNING - Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
2019-11-26 20:49:30,737 - INFO - Starting TLS tunnel: "stunnel /var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.390d7c5f-108e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv.mount.20388"
2019-11-26 20:49:30,768 - INFO - Started TLS tunnel, pid: 8083
2019-11-26 20:49:30,769 - INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:/ /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount -o rw,noresvport,nfsvers=4.1,retrans=2,hard,wsize=1048576,timeo=600,rsize=1048576,port=20388"
2019-11-26 20:49:31,089 - INFO - Successfully mounted fs-e8a95a42.efs.us-west-2.amazonaws.com at /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount

Originally posted by @leakingtapan in #103 (comment)

@leakingtapan
Copy link
Contributor Author

leakingtapan commented Dec 9, 2019

I did some quick test by starting the amazon-efs-mount-watchdog from efs mount helper, there are several challenges when using the exiting watch dog. The watch dog is designed for a non-containerized environment where systemd or initd is required to monitor and restart the process if it crashes. And running systemd in a docker container is not trivial.

There are two ways I can think of to solve the problem:

  1. Create a new container, efs-watch-dog, in the efs node daemonset pod. The container will start the watch dog. This approach leverages kubelet being the init system, so that the container will be restarted once crashes. However, this requires sharing PID namespace so that the watch dog can kill the stunnel processes what are in the efs-plugin namespace that are different to the watch dog process namespace. Also because watch dog is working against a shared efs state file(s) under /var/run/efs, this requires sharing of the file across efs-plugin and watch dog container. This approach could work but is a bit messy and reduces process isolation from namespace.

  2. Manage the watch dog as a subprocess to the efs-plugin. The efs-plugin will restart the watch dog if it crashes. This approach is as secure as current container isolation and much cleaner since it doesn't require PID namespace and efs state file sharing across containers. But it requires more time to implement some process monitoring facility.

@adammw
Copy link

adammw commented Jan 31, 2020

@leakingtapan I'm seeing this behaviour still on amazon/aws-efs-csi-driver@sha256:2ebe856c6fa58b63b45f011f562c25a61aca6b54f479f5eb4b636f649ea58fe0, which should have been after your fix was merged. Any ideas?

I0131 03:16:23.060907       1 node.go:50] NodePublishVolume: called with args volume_id:"fs-123456" target_path:"/var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount" volume_capability:<mount:<mount_flags:"tls" mount_flags:"iam" mount_flags:"accesspoint=fsap-987654321" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > >
I0131 03:16:23.061014       1 node.go:119] NodePublishVolume: creating dir /var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount
I0131 03:16:23.061043       1 node.go:124] NodePublishVolume: mounting fs-123456:/ at /var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount with options [tls iam accesspoint=fsap-987654321]
I0131 03:16:23.061063       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t efs -o tls,iam,accesspoint=fsap-987654321 fs-123456:/ /var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount])
I0131 03:16:58.161943       1 reaper.go:61] Waited for child process 0
E0131 03:16:58.161952       1 mount_linux.go:140] Mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o tls,iam,accesspoint=fsap-987654321 fs-123456:/ /var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
mount.nfs4: an incorrect mount option was specified
Failed to initialize TLS tunnel for fs-123456

E0131 03:16:58.162033       1 driver.go:74] GRPC error: rpc error: code = Internal desc = Could not mount "fs-123456:/" at "/var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount": mount failed: exit status 1
2020-01-31 03:29:32,359 - INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:/ /var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount -o iam,rw,noresvport,nfsvers=4.1,accesspoint=fsap-987654321,retrans=2,hard,wsize=1048576,timeo=600,rsize=1048576,port=20236"
2020-01-31 03:29:33,881 - ERROR - Failed to mount fs-123456.efs.us-west-2.amazonaws.com at /var/lib/kubelet/pods/746329cb-6e15-4eee-9214-0f133d3837c5/volumes/kubernetes.io~csi/k8s-efs-test-pv/mount: returncode=32, stderr="mount.nfs4: an incorrect mount option was specified

@allamand
Copy link

Hello,

I have a similar issue, but trying to mount the EFS volume on Fargate

Mounting arguments: -t efs -o tls fs-4c960478:/ /var/lib/kubelet/pods/c0279054-bebb-4ffe-9432-60dabdc58fcd/volumes/kubernetes.io~csi/test/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "bash"
mount.nfs4: Connection reset by peer

@AidanWenzel
Copy link

@allamand I am getting the same error as you - did you find a solution?

@lemmikens
Copy link

Hello,

I have a similar issue, but trying to mount the EFS volume on Fargate

Mounting arguments: -t efs -o tls fs-4c960478:/ /var/lib/kubelet/pods/c0279054-bebb-4ffe-9432-60dabdc58fcd/volumes/kubernetes.io~csi/test/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "bash"
mount.nfs4: Connection reset by peer

One of two things fixed this and I'm not sure which because I did both at the same time. The first (and most likely culprit) was Creating a an IAM service account for K8s (scroll down to "Create an IAM policy and role" and look at step 2).

the second (and less likely) was by changing the SGs inside of the mount points. Literally the only change I made was the port... I had the SG open to "All Traffic" before and when I narrowed it down to the specific NFS port (2049), it seemed to work.

@damiangene
Copy link

Hello,
I have a similar issue, but trying to mount the EFS volume on Fargate

Mounting arguments: -t efs -o tls fs-4c960478:/ /var/lib/kubelet/pods/c0279054-bebb-4ffe-9432-60dabdc58fcd/volumes/kubernetes.io~csi/test/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "bash"
mount.nfs4: Connection reset by peer

One of two things fixed this and I'm not sure which because I did both at the same time. The first (and most likely culprit) was Creating a an IAM service account for K8s (scroll down to "Create an IAM policy and role" and look at step 2).

the second (and less likely) was by changing the SGs inside of the mount points. Literally the only change I made was the port... I had the SG open to "All Traffic" before and when I narrowed it down to the specific NFS port (2049), it seemed to work.

I was having the same issue and it was the less likely change of yours that resolved my problem. Although I will say I did the first fix you mentioned initially and then proceeded to do some testing before implementing the secondary fix.

Thank you so much for your comment I would have still been pulling out my hair if it wasn't for this comment.

@balbatross
Copy link

Hello,
I have a similar issue, but trying to mount the EFS volume on Fargate

Mounting arguments: -t efs -o tls fs-4c960478:/ /var/lib/kubelet/pods/c0279054-bebb-4ffe-9432-60dabdc58fcd/volumes/kubernetes.io~csi/test/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "bash"
mount.nfs4: Connection reset by peer

One of two things fixed this and I'm not sure which because I did both at the same time. The first (and most likely culprit) was Creating a an IAM service account for K8s (scroll down to "Create an IAM policy and role" and look at step 2).

the second (and less likely) was by changing the SGs inside of the mount points. Literally the only change I made was the port... I had the SG open to "All Traffic" before and when I narrowed it down to the specific NFS port (2049), it seemed to work.

If you ever start an infrastructure provider please let me know, this answer was more helpful than 8 hours of AWS documentation, Jah Bless

@phyzical
Copy link

@lemmikens Thanks!!! this was driving me mad

for me it was the SGs on the mount points didnt needs the sa (but im not using the csi driver)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants