CSI: ensure initial unpublish state is checkpointed #14675

tgross · 2022-09-23T16:02:14Z

A test flake revealed a bug in the CSI unpublish workflow, where an unpublish
that comes from a client that's successfully done the node-unpublish step will
not have the claim checkpointed if the controller-unpublish step fails. This
will result in a delay in releasing the volume claim until the next GC.

This changeset also ensures we're using a new snapshot after each write to raft,
and fixes two timing issues in test where either the volume watcher can
unpublish before the unpublish RPC is sent or we don't wait long enough in
resource-restricted environments like GHA.

Two notes for reviewers:

This is somewhat related to CSI: failed allocation should not block its own controller unpublish #14484 but is really a distinct bug, so I've given it its own changelog and it'll need backports.
This doesn't fix the longstanding issue of Nomad thumbing its nose at the concurrency requirements of the CSI spec, inasmuch that the volumewatcher can kick off after we've checkpointed (which doesn't actually result in bad behavior so long as plugins are mostly spec-compliant themselves, but...). I've got some thoughts about fixing that, but it's not going to land in 1.4.0.

A test flake revealed a bug in the CSI unpublish workflow, where an unpublish that comes from a client that's successfully done the node-unpublish step will not have the claim checkpointed if the controller-unpublish step fails. This will result in a delay in releasing the volume claim until the next GC. This changeset also ensures we're using a new snapshot after each write to raft, and fixes two timing issues in test where either the volume watcher can unpublish before the unpublish RPC is sent or we don't wait long enough in resource-restricted environements like GHA.

DerekStrickland

LGTM

A test flake revealed a bug in the CSI unpublish workflow, where an unpublish that comes from a client that's successfully done the node-unpublish step will not have the claim checkpointed if the controller-unpublish step fails. This will result in a delay in releasing the volume claim until the next GC. This changeset also ensures we're using a new snapshot after each write to raft, and fixes two timing issues in test where either the volume watcher can unpublish before the unpublish RPC is sent or we don't wait long enough in resource-restricted environements like GHA.

A test flake revealed a bug in the CSI unpublish workflow, where an unpublish that comes from a client that's successfully done the node-unpublish step will not have the claim checkpointed if the controller-unpublish step fails. This will result in a delay in releasing the volume claim until the next GC. This changeset also ensures we're using a new snapshot after each write to raft, and fixes two timing issues in test where either the volume watcher can unpublish before the unpublish RPC is sent or we don't wait long enough in resource-restricted environements like GHA. Co-authored-by: Tim Gross <tgross@hashicorp.com>

github-actions · 2023-01-26T02:16:06Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview – nomad-storybook-and-ui September 23, 2022 16:05 View deployment

tgross changed the title ~~CSI: ensure we're using new snapshots after checkpoint~~ CSI: fix test flake in TestCSIVolumeEndpoint_Unpublish Sep 23, 2022

tgross force-pushed the csi-flaky branch from 87d22e7 to b721642 Compare September 23, 2022 17:15

vercel bot deployed to Preview – nomad-storybook-and-ui September 23, 2022 17:18 View deployment

tgross force-pushed the csi-flaky branch from b721642 to 18e1cc4 Compare September 26, 2022 20:53

tgross changed the title ~~CSI: fix test flake in TestCSIVolumeEndpoint_Unpublish~~ CSI: ensure initial unpublish state is checkpointed Sep 26, 2022

tgross added this to the 1.4.x milestone Sep 26, 2022

tgross added theme/storage type/bug labels Sep 26, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui September 26, 2022 20:56 View deployment

tgross force-pushed the csi-flaky branch from 18e1cc4 to 1ad9d40 Compare September 27, 2022 11:30

tgross force-pushed the csi-flaky branch from 1ad9d40 to ff845bd Compare September 27, 2022 11:33

vercel bot deployed to Preview – nomad-storybook-and-ui September 27, 2022 11:36 View deployment

tgross added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line labels Sep 27, 2022

tgross modified the milestones: 1.4.x, 1.4.0 Sep 27, 2022

tgross marked this pull request as ready for review September 27, 2022 11:46

tgross requested review from DerekStrickland and lgfa29 September 27, 2022 11:46

DerekStrickland approved these changes Sep 27, 2022

View reviewed changes

tgross merged commit b4cd9af into main Sep 27, 2022

tgross deleted the csi-flaky branch September 27, 2022 12:43

github-actions bot locked as resolved and limited conversation to collaborators Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: ensure initial unpublish state is checkpointed #14675

CSI: ensure initial unpublish state is checkpointed #14675

tgross commented Sep 23, 2022 •

edited

Loading

DerekStrickland left a comment

github-actions bot commented Jan 26, 2023

CSI: ensure initial unpublish state is checkpointed #14675

CSI: ensure initial unpublish state is checkpointed #14675

Conversation

tgross commented Sep 23, 2022 • edited Loading

DerekStrickland left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 26, 2023

tgross commented Sep 23, 2022 •

edited

Loading