CSI: reorder controller volume detachment #12387

tgross · 2022-03-25T21:01:36Z

Ran into this while working on #12384

In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.

Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.

This has a couple side-effects that also had to get fixed here:

Changing the order means that the volume will have a past claim
without a valid external node ID because it came from the client, and
this uncovered a separate bug where we didn't assert the external node
ID was valid before returning it. Fallthrough to getting the ID from
the plugins in the state store in this case. We avoided this
originally because of concerns around plugins getting lost during node
drain but now that we've fixed that we may want to revisit it in
future work.
We should make sure we're handling FailedPrecondition cases from
the controller plugin the same way we handle other retryable cases.
Several tests had to be updated because they were assuming we fail
in a particular order that we're no longer doing.

No changelog entry because this is updating code that hasn't yet shipped.

Fixed E2E test from #12384

$ go test -v . -suite CSI -run 'TestE2E/CSI/\*csi\.CSIControllerPluginEBSTest/TestNodeDrain'
=== RUN   TestE2E
=== RUN   TestE2E/CSI
=== RUN   TestE2E/CSI/*csi.CSIControllerPluginEBSTest
=== RUN   TestE2E/CSI/*csi.CSIControllerPluginEBSTest/TestNodeDrain
--- PASS: TestE2E (114.61s)
    --- PASS: TestE2E/CSI (114.61s)
        --- PASS: TestE2E/CSI/*csi.CSIControllerPluginEBSTest (114.60s)
            --- PASS: TestE2E/CSI/*csi.CSIControllerPluginEBSTest/TestNodeDrain (42.92s)
PASS
ok      github.com/hashicorp/nomad/e2e  115.513s

In #12112 and #12113 we solved for the problem of races in releasing volume claims, but there was a case that we missed. During a node drain with a controller attach/detach, we can hit a race where we call controller publish before the unpublish has completed. This is discouraged in the spec but plugins are supposed to handle it safely. But if the storage provider's API is slow enough and the plugin doesn't handle the case safely, the volume can get "locked" into a state where the provider's API won't detach it cleanly. Check the claim before making any external controller publish RPC calls so that Nomad is responsible for the canonical information about whether a volume is currently claimed. This has a couple side-effects that also had to get fixed here: * Changing the order means that the volume will have a past claim without a valid external node ID because it came from the client, and this uncovered a separate bug where we didn't assert the external node ID was valid before returning it. Fallthrough to getting the ID from the plugins in the state store in this case. We avoided this originally because of concerns around plugins getting lost during node drain but now that we've fixed that we may want to revisit it in future work. * We should make sure we're handling `FailedPrecondition` cases from the controller plugin the same way we handle other retryable cases. * Several tests had to be updated because they were assuming we fail in a particular order that we're no longer doing.

shoenig

LGTM!

github-actions · 2022-10-16T02:47:30Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

fix log formatting in csi_hook

91ead5d

tgross force-pushed the b-csi-volume-detach-order branch from 40cde4c to 4a46d82 Compare March 28, 2022 19:08

vercel bot temporarily deployed to Preview – nomad March 28, 2022 19:08 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 28, 2022 19:08 View deployment

tgross force-pushed the b-csi-volume-detach-order branch from 4a46d82 to 2639770 Compare March 28, 2022 19:49

vercel bot temporarily deployed to Preview – nomad March 28, 2022 19:49 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 28, 2022 19:49 View deployment

tgross mentioned this pull request Mar 28, 2022

E2E: test exercising node drain behavior for CSI volumes #12384

Merged

tgross marked this pull request as ready for review March 28, 2022 20:19

tgross requested review from shoenig, jazzyfresh and DerekStrickland March 28, 2022 20:19

tgross added this to the 1.3.0 milestone Mar 28, 2022

shoenig approved these changes Mar 29, 2022

View reviewed changes

tgross merged commit 98e122c into main Mar 29, 2022

tgross deleted the b-csi-volume-detach-order branch March 29, 2022 13:44

tgross mentioned this pull request Mar 29, 2022

CSI: noisy logs for claim releasing operations #11963

Closed

lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 20, 2022

This was referenced Apr 20, 2022

Backport of CSI: reorder controller volume detachment into release/1.2.x #12700

Merged

Backport of CSI: reorder controller volume detachment into release/1.1.x #12701

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: reorder controller volume detachment #12387

CSI: reorder controller volume detachment #12387

tgross commented Mar 25, 2022 •

edited

Loading

shoenig left a comment

github-actions bot commented Oct 16, 2022

CSI: reorder controller volume detachment #12387

CSI: reorder controller volume detachment #12387

Conversation

tgross commented Mar 25, 2022 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 16, 2022

tgross commented Mar 25, 2022 •

edited

Loading