Driver fails to release ports on unmount #281

spohner · 2020-11-16T08:00:19Z

/kind bug

What happened?
We noticed that pods were unable to mount EFS PVCs, and got stuck in ContainerCreating. The logs showed
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Logged into the node and found with netstat that all 400 ports were populated by stunnel processes.

The watchdog logs shows that it fails to kill processes on unmount. These log lines repeat for several pids.
2020-11-12 16:12:16,530 - INFO - Unmount grace period expired for fs-6a3285fb.var.lib.kubelet.pods.c3015949-346d-42cf-9594-3be561ca30c8.volumes.kubernetes.io~csi.pvc-7ef93798-9182-469f-b35a-72cd13ecfcac.mount.20402 2020-11-12 16:12:16,530 - INFO - Terminating running TLS tunnel - PID: 2773, group ID: 2773 2020-11-12 16:12:16,530 - INFO - TLS tunnel: 2773 is still running, will retry termination

What you expected to happen?
Ports are freed upon unmount and pods on all nodes are able to mount EFS PVCs.

How to reproduce it (as minimally and precisely as possible)?
Not sure how to reproduce as it seems random.

Anything else we need to know?:
We have seen this issue several times, however it seems random when a node fails to release the ports. We experience this in our medium sized cluster maybe once a week. Other nodes are working just fine when this happens. A quick fix is to replace the bad node.

Environment

Kubernetes version (use kubectl version): v1.17.13
Driver version: master as of 22.10.2020
OS: Ubuntu 20.04 LTS, 5.4.0

The text was updated successfully, but these errors were encountered:

reyntjensw · 2021-01-29T07:49:12Z

We are experiencing the same issue in an environment where a lot of pod autoscaling is happening.

From our experience it happens every 7 to 10 days, the quick fix here is to replace all nodes but in a production environment this is not a behavior we want to have.

Right after this issue we have opened up a support case but they requested us to update this issue to start with.

This is the part of the error we are seeing

Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

E0107 09:34:53.755515       1 driver.go:75] GRPC error: rpc error: code = Internal desc = Could not mount "fs-0f780057:/" at "/var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

fejta-bot · 2021-04-29T08:00:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-29T08:24:59Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

michaelswierszcz · 2021-06-02T17:32:01Z

facing a similar issue. the pattern i've identified so far is that when the node reaches a high cpu usage, the efs-csi-driver crashes. if this happens 3+ times the node can no longer mount any efs pv's.

wongma7 · 2021-06-02T19:38:58Z

/remove-lifecycle rotten

smrutiranjantripathy · 2021-07-20T06:29:32Z

It is fixed by this PR

k8s-triage-robot · 2021-10-18T07:25:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jumping · 2023-03-24T10:03:09Z

the issue appeared again "Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf".

kubernets: 1.23
aws-efs-csi-driver:v1.3.7
os: AMI 1.23.16-20230304

usulkies · 2023-07-10T13:47:03Z

I saw the issue today:
Failed to locate an available port in the range [20049, 20449]

It should happen any time we have more than 401 pods running on the same node and trying all to mount an EFS volume.
A workaround could be setting the maxPods value on the kubelet, but another approach might be allowing a wider range through the helm chart values.
Is it possible to let the user set the values for these two values?
https://github.com/aws/efs-utils/blob/62fde08f790a1ab50f25b81f85940bec6f4b92e9/src/mount_efs/__init__.py#L959C50-L959C72

balusarakesh · 2023-07-29T00:38:59Z

this is happening for us too, is there a workaround for this in AWS EKS?

Thank you

neoakris · 2024-08-14T17:16:08Z

Encountered on v2.0.1 (released ~April 2024), so might still be a thing. It seems there are related issues https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues?q=is%3Aissue+ports+is%3Aclosed so I'll try updating to the latest version v2.0.7 (as of ~Aug, 2024)

JonTheNiceGuy · 2024-08-14T20:03:47Z

Glad it's not just me @neoakris!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 16, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2021

wongma7 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 2, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2021

spohner closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver fails to release ports on unmount #281

Driver fails to release ports on unmount #281

spohner commented Nov 16, 2020

reyntjensw commented Jan 29, 2021 •

edited

Loading

fejta-bot commented Apr 29, 2021

fejta-bot commented May 29, 2021

michaelswierszcz commented Jun 2, 2021

wongma7 commented Jun 2, 2021 •

edited

Loading

smrutiranjantripathy commented Jul 20, 2021

k8s-triage-robot commented Oct 18, 2021

jumping commented Mar 24, 2023

usulkies commented Jul 10, 2023

balusarakesh commented Jul 29, 2023

neoakris commented Aug 14, 2024

JonTheNiceGuy commented Aug 14, 2024

Driver fails to release ports on unmount #281

Driver fails to release ports on unmount #281

Comments

spohner commented Nov 16, 2020

reyntjensw commented Jan 29, 2021 • edited Loading

fejta-bot commented Apr 29, 2021

fejta-bot commented May 29, 2021

michaelswierszcz commented Jun 2, 2021

wongma7 commented Jun 2, 2021 • edited Loading

smrutiranjantripathy commented Jul 20, 2021

k8s-triage-robot commented Oct 18, 2021

jumping commented Mar 24, 2023

usulkies commented Jul 10, 2023

balusarakesh commented Jul 29, 2023

neoakris commented Aug 14, 2024

JonTheNiceGuy commented Aug 14, 2024

reyntjensw commented Jan 29, 2021 •

edited

Loading

wongma7 commented Jun 2, 2021 •

edited

Loading