Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: skip node unpublish on GC'd or down nodes #13301

Merged
merged 1 commit into from
Jun 9, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Jun 8, 2022

Fixes #13264

If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The csi_hook on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.

(Note that while this behavior isn't ideal, it appears to match user
expectations and the behavior reported by k8s users.)

If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The `csi_hook` on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.
@tgross tgross force-pushed the csi-discard-claims-on-gcd-nodes branch from e1d6b40 to 9de444e Compare June 8, 2022 20:54
@@ -0,0 +1,3 @@
```release-note:bug
csi: Fixed a bug where volume claims on lost or garbage collected nodes could not be freed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: I'm torn on whether to call this a bug or improvement but calling it a bug makes it something we can backport so I'm leaning that way.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about other possible node status (like initializing or draining) but those would be fine since the node is still around to handle the request right?

@tgross
Copy link
Member Author

tgross commented Jun 9, 2022

I was thinking about other possible node status (like initializing or draining) but those would be fine since the node is still around to handle the request right?

Exactly! The disconnected state is a little weird as well, but in that case we're keeping the claims if the node can be marked disconnected, because the allocations will still be up and running, just temporarily unavailable.

@tgross tgross merged commit dd1bbbe into main Jun 9, 2022
@tgross tgross deleted the csi-discard-claims-on-gcd-nodes branch June 9, 2022 15:33
tbehling pushed a commit that referenced this pull request Jun 29, 2022
If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The `csi_hook` on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.
@tgross tgross added the backport/1.3.x backport to 1.3.x release line label Aug 23, 2022
@tgross tgross modified the milestones: 1.3.2, 1.3.4 Aug 23, 2022
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI: discard claims on GC'd nodes
2 participants