Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: controller plugin timeouts #7629

Closed
tgross opened this issue Apr 4, 2020 · 4 comments · Fixed by #7632, #7794 or #7840
Closed

csi: controller plugin timeouts #7629

tgross opened this issue Apr 4, 2020 · 4 comments · Fixed by #7632, #7794 or #7840
Assignees
Milestone

Comments

@tgross
Copy link
Member

tgross commented Apr 4, 2020

With the fixes in #7626 I've got most of the unpublish workflow working again.

But we're getting controller timeouts when trying to unpublish with the EBS plugin. It appears that the controller plugin is sending the AWS API call and the volume is getting detached, but the RPC call is not returning, so it then times out and bubbles up as an error so we don't release the claim.

Logs from the controller plugin:

I0404 19:01:57.186105       1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0564768ab4da32abb NodeId:i-0eb5420cb5752a46e Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0404 19:02:08.010019       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0404 19:02:38.010766       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0404 19:03:08.011877       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0404 19:03:38.012855       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0404 19:04:08.013598       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
E0404 19:04:14.836283       1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0564768ab4da32abb" from node "i-0eb5420cb5752a46e": RequestCanceled: request context canceled
caused by: context canceled
@tgross
Copy link
Member Author

tgross commented Apr 5, 2020

After a lot of tedious debugging, I've conclusively figured out that the Job.Deregister call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the Job.Deregister. I've opened #7632 to partially fix this.

However, in the case of nomad job stop -purge, we're consistently losing a race between the volume GC eval being fired and the Nomad client picking up that its job has been purged and shutting down the alloc. This means that the first volume GC attempt will fail, but once the client picks up the change it will pass. The operator experience of this is not awesome, but it's currently safe... just slower than we'd like.

Maybe there's room to suggest that nomad job stop -purge should trigger the client events faster?

@tgross tgross added the type/bug label Apr 5, 2020
@tgross tgross reopened this Apr 6, 2020
@tgross
Copy link
Member Author

tgross commented Apr 6, 2020

Although #7632 partially fixes this, I'm re-opening it so we can make further improvements in the 0.11.1 cycle.

@tgross
Copy link
Member Author

tgross commented Apr 30, 2020

With #7794 merged there's one last item to do on this before it can be closed which is to restore the original timeout length we had in the client.

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.