Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver fails to release ports on unmount #281

Closed
spohner opened this issue Nov 16, 2020 · 12 comments
Closed

Driver fails to release ports on unmount #281

spohner opened this issue Nov 16, 2020 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@spohner
Copy link

spohner commented Nov 16, 2020

/kind bug

What happened?
We noticed that pods were unable to mount EFS PVCs, and got stuck in ContainerCreating. The logs showed
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Logged into the node and found with netstat that all 400 ports were populated by stunnel processes.

The watchdog logs shows that it fails to kill processes on unmount. These log lines repeat for several pids.
2020-11-12 16:12:16,530 - INFO - Unmount grace period expired for fs-6a3285fb.var.lib.kubelet.pods.c3015949-346d-42cf-9594-3be561ca30c8.volumes.kubernetes.io~csi.pvc-7ef93798-9182-469f-b35a-72cd13ecfcac.mount.20402 2020-11-12 16:12:16,530 - INFO - Terminating running TLS tunnel - PID: 2773, group ID: 2773 2020-11-12 16:12:16,530 - INFO - TLS tunnel: 2773 is still running, will retry termination

What you expected to happen?
Ports are freed upon unmount and pods on all nodes are able to mount EFS PVCs.

How to reproduce it (as minimally and precisely as possible)?
Not sure how to reproduce as it seems random.

Anything else we need to know?:
We have seen this issue several times, however it seems random when a node fails to release the ports. We experience this in our medium sized cluster maybe once a week. Other nodes are working just fine when this happens. A quick fix is to replace the bad node.

Environment

  • Kubernetes version (use kubectl version): v1.17.13
  • Driver version: master as of 22.10.2020
  • OS: Ubuntu 20.04 LTS, 5.4.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 16, 2020
@reyntjensw
Copy link

reyntjensw commented Jan 29, 2021

We are experiencing the same issue in an environment where a lot of pod autoscaling is happening.

From our experience it happens every 7 to 10 days, the quick fix here is to replace all nodes but in a production environment this is not a behavior we want to have.

Right after this issue we have opened up a support case but they requested us to update this issue to start with.

This is the part of the error we are seeing

Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

E0107 09:34:53.755515       1 driver.go:75] GRPC error: rpc error: code = Internal desc = Could not mount "fs-0f780057:/" at "/var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2021
@michaelswierszcz
Copy link

facing a similar issue. the pattern i've identified so far is that when the node reaches a high cpu usage, the efs-csi-driver crashes. if this happens 3+ times the node can no longer mount any efs pv's.

@wongma7
Copy link
Contributor

wongma7 commented Jun 2, 2021

/remove-lifecycle rotten

@wongma7 wongma7 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 2, 2021
@smrutiranjantripathy
Copy link

It is fixed by this PR

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2021
@spohner spohner closed this as completed Oct 28, 2021
@jumping
Copy link

jumping commented Mar 24, 2023

the issue appeared again "Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf".

kubernets: 1.23
aws-efs-csi-driver:v1.3.7
os: AMI 1.23.16-20230304

@usulkies
Copy link

I saw the issue today:
Failed to locate an available port in the range [20049, 20449]

It should happen any time we have more than 401 pods running on the same node and trying all to mount an EFS volume.
A workaround could be setting the maxPods value on the kubelet, but another approach might be allowing a wider range through the helm chart values.
Is it possible to let the user set the values for these two values?
https://github.com/aws/efs-utils/blob/62fde08f790a1ab50f25b81f85940bec6f4b92e9/src/mount_efs/__init__.py#L959C50-L959C72

@balusarakesh
Copy link

this is happening for us too, is there a workaround for this in AWS EKS?

  • Thank you

@neoakris
Copy link

Encountered on v2.0.1 (released ~April 2024), so might still be a thing. It seems there are related issues https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues?q=is%3Aissue+ports+is%3Aclosed so I'll try updating to the latest version v2.0.7 (as of ~Aug, 2024)

@JonTheNiceGuy
Copy link

Glad it's not just me @neoakris!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests