Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in Still connecting to unix:///csi/csi.sock #1301

Closed
philnielsen opened this issue Apr 2, 2024 · 7 comments
Closed

Stuck in Still connecting to unix:///csi/csi.sock #1301

philnielsen opened this issue Apr 2, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@philnielsen
Copy link

/kind bug

What happened?

Sometimes the efs-csi-node pod crashloops forever, leading to a deadlock requiring manual intervention.

What you expected to happen?
If the efs driver never starts, don't schedule pods on the node

How to reproduce it (as minimally and precisely as possible)?

As this only happens intermittently I can't reliably force it to reproduce.

Anything else we need to know?:

kubernetes-csi/livenessprobe#236 I think is the issue, EBS driver fixed it in kubernetes-sigs/aws-ebs-csi-driver#1935 but I'm not sure that EFS driver has picked that up yet? Could this be the same problem? I'm going to test some things and will update this issue with what I find.

Environment

  • Kubernetes version (use kubectl version): v1.25.11
  • Driver version: 1.7.4

Please also attach debug logs to help us better diagnose

csi-driver-registrar I0402 15:37:14.562063       1 main.go:135] Version: v2.9.3
csi-driver-registrar I0402 15:37:14.562088       1 main.go:136] Running node-driver-registrar in mode=
csi-driver-registrar I0402 15:37:14.562093       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
csi-driver-registrar W0402 15:37:24.563155       1 connection.go:233] Still connecting to unix:///csi/csi.sock
csi-driver-registrar W0402 15:37:34.562773       1 connection.go:233] Still connecting to unix:///csi/csi.sock
csi-driver-registrar W0402 15:37:44.562326       1 connection.go:233] Still connecting to unix:///csi/csi.sock
csi-driver-registrar E0402 15:37:44.562353       1 main.go:160] error connecting to CSI driver: context deadline exceeded
liveness-probe W0402 15:37:24.640667       1 connection.go:183] Still connecting to unix:///csi/csi.sock
liveness-probe W0402 15:37:34.641369       1 connection.go:183] Still connecting to unix:///csi/csi.sock
liveness-probe W0402 15:37:44.640935       1 connection.go:183] Still connecting to unix:///csi/csi.sock
liveness-probe F0402 15:37:44.640998       1 main.go:146] failed to establish connection to CSI driver: context deadline exceeded
Stream closed EOF for kube-system/efs-csi-node-g6tzk (liveness-probe)
Stream closed EOF for kube-system/efs-csi-node-g6tzk (csi-driver-registrar)
stream logs failed pods "efs-csi-node-g6tzk" not found for kube-system/efs-csi-node-g6tzk (efs-plugin)
stream logs failed pods "efs-csi-node-g6tzk" not found for kube-system/efs-csi-node-g6tzk (efs-plugin)
stream logs failed pods "efs-csi-node-g6tzk" not found for kube-system/efs-csi-node-g6tzk (efs-plugin)
stream logs failed pods "efs-csi-node-g6tzk" not found for kube-system/efs-csi-node-g6tzk (efs-plugin)
stream logs failed pods "efs-csi-node-g6tzk" not found for kube-system/efs-csi-node-g6tzk (efs-plugin)
  • Instructions to gather debug logs can be found here
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2024
@brian-provenzano
Copy link

brian-provenzano commented Apr 5, 2024

We are seeing this as well... I can provide additional logs if needed

@mskanth972
Copy link
Contributor

Hi @philnielsen, seems it is the same problem. We will the update the livenessprobe version to 2.12.0 which should have this fix.
ECD for the new release is 4/12.

@mskanth972
Copy link
Contributor

This PR should resolve the issue. Will merge and release by same ECD mentioned above

@mskanth972
Copy link
Contributor

The new version is released. Closing the issue, please feel free to open if you are facing the issue still

@philnielsen
Copy link
Author

@mskanth972 is there any chance of backporting the change to a 1.x release? I'm still reading up on the implications of upgrading to 2.x but it seems like this might be worth backporting

@brian-provenzano
Copy link

I am still seeing this issue even in the v2.0.0 release (which includes the new liveness probe sidecar upgrade afaik).

I am not able to reproduce it reliably - it just seems to happen occasionally - see logs from a efs-csi-node pod below.

k8s version: 1.28
efs-csi driver version: chart 3.0.0 / appVersion 2.0.0 (efs-csi-node)

2024-04-24T11:56:16-06:00	I0424 17:56:15.999927       1 main.go:135] Version: v2.10.0
2024-04-24T11:56:14-06:00	I0424 17:56:14.754964       1 main.go:170] "ServeMux listening" address="0.0.0.0:9809"
2024-04-24T11:56:14-06:00	I0424 17:56:14.754942       1 main.go:141] "CSI driver name" driver="efs.csi.aws.com"
2024-04-24T11:56:14-06:00	I0424 17:56:14.517720       1 main.go:133] "Calling CSI driver to discover driver name"
2024-04-24T11:56:13-06:00	I0424 17:56:13.446853       1 driver.go:137] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
2024-04-24T11:56:13-06:00	I0424 17:56:13.359868       1 driver.go:127] Starting reaper
2024-04-24T11:56:13-06:00	I0424 17:56:13.336807       1 efs_watch_dog.go:230] Copying /etc/amazon/efs/efs-utils.crt 
2024-04-24T11:56:13-06:00	I0424 17:56:13.336788       1 efs_watch_dog.go:235] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
2024-04-24T11:56:13-06:00	I0424 17:56:13.336270       1 driver.go:121] Starting efs-utils watchdog
2024-04-24T11:56:13-06:00	I0424 17:56:13.336256       1 driver.go:118] Registering Controller Server
2024-04-24T11:56:13-06:00	I0424 17:56:13.336239       1 driver.go:116] Registering Node Server
2024-04-24T11:56:13-06:00	I0424 17:56:13.333118       1 driver.go:150] Did not find any input tags.
2024-04-24T11:56:13-06:00	I0424 17:56:13.225572       1 metadata.go:70] retrieving metadata from EC2 metadata service
2024-04-24T11:56:13-06:00	I0424 17:56:13.166643       1 metadata.go:65] getting MetadataService...
2024-04-24T11:56:13-06:00	I0424 17:56:13.160591       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
2024-04-24T11:56:12-06:00	W0424 17:56:12.077567       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:56:02-06:00	W0424 17:56:02.078031       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:56:01-06:00	E0424 17:56:01.186042       1 main.go:160] error connecting to CSI driver: context deadline exceeded
2024-04-24T11:56:01-06:00	W0424 17:56:01.185999       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:55:52-06:00	W0424 17:55:52.076965       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:55:51-06:00	W0424 17:55:51.176737       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:55:49-06:00	E0424 17:55:49.620261       1 connection.go:193] Lost connection to unix:///csi/csi.sock.
2024-04-24T11:55:49-06:00	E0424 17:55:49.611502       1 connection.go:193] Lost connection to unix:///csi/csi.sock.
2024-04-24T11:55:42-06:00	W0424 17:55:42.080342       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:55:41-06:00	W0424 17:55:41.184471       1 connection.go:234] Still connecting to unix:///csi/csi.sock
2024-04-24T11:55:31-06:00	I0424 17:55:31.175517       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
2024-04-24T11:55:31-06:00	I0424 17:55:31.175510       1 main.go:136] Running node-driver-registrar in mode=
2024-04-24T11:55:31-06:00	I0424 17:55:31.175473       1 main.go:135] Version: v2.10.0
2024-04-24T11:55:30-06:00	I0424 17:55:30.306192       1 driver.go:127] Starting reaper
...

@ekeih
Copy link

ekeih commented Apr 26, 2024

We had the same issue with the 3.0.1 chart which uses the 2.0.1 image on k8s 1.29.

But while debugging we noticed that the 2.0.1 image was silently removed from Dockerhub, so for now we downgraded the chart back to 3.0.0 which uses the 2.0.0 image which works fine for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants