Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup snapshots before volumes #289

Conversation

timoreimann
Copy link
Contributor

@timoreimann timoreimann commented Aug 31, 2020

What type of PR is this?

/kind bug

What this PR does / why we need it:

#261 accidentally swapped the cleanup order to deleting volumes before snapshots. This represents a deviation to the previous behavior that not all CSI drivers may be able to handle. This change restores the original order.

Does this PR introduce a user-facing change?:

Restore snapshots-before-volumes cleanup order

/assign @pohly

@apurv15 thanks for reporting

[1] accidentally swapped the cleanup order which represents a deviation
to the previous behavior that not all CSI drivers may be able to handle.
This change restores the original order.

[1]: kubernetes-csi#261
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 31, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @timoreimann. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 31, 2020
@xing-yang
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 1, 2020
@pohly
Copy link
Contributor

pohly commented Sep 1, 2020

This represents a deviation to the previous behavior that not all CSI drivers may be able to handle.

Should they be able to handle this? @xing-yang ?

If yes, then some explicit testing of both sequences (delete snapshot first vs. delete volume first) would be useful.

Having said that, the PR itself makes sense.

@timoreimann
Copy link
Contributor Author

Should they be able to handle this?

I asked myself the same question. Happy to help working on a dedicated test, though I also think that might be best done as a separate, follow-up PR.

@xing-yang
Copy link
Contributor

This represents a deviation to the previous behavior that not all CSI drivers may be able to handle.

Should they be able to handle this? @xing-yang ?
If yes, then some explicit testing of both sequences (delete snapshot first vs. delete volume first) would be useful.
Having said that, the PR itself makes sense.

Actually most CSI drivers won't be able to handle this unless if they add some logic in their drivers to do soft delete. This is because for most storage systems, a snapshot is local and it depends on the volume. So I don't think we should add tests in csi-sanity that may fail most CSI drivers.

@pohly
Copy link
Contributor

pohly commented Sep 2, 2020

Actually most CSI drivers won't be able to handle this unless if they add some logic in their drivers to do soft delete.

And does the CSI standard and Kubernetes take that into account, i.e. is a CSI driver that expects a certain order compliant with the spec? What is the user experience in Kubernetes around this?

@apurv15
Copy link

apurv15 commented Sep 2, 2020

Actually most CSI drivers won't be able to handle this unless if they add some logic in their drivers to do soft delete.

And does the CSI standard and Kubernetes take that into account, i.e. is a CSI driver that expects a certain order compliant with the spec? What is the user experience in Kubernetes around this?

Yes, CSI standard says that the volume should not be deleted if there is a dependent snapshot. So, order of deletion should be snapshot first and then the source volume. I am referring the spec at https://github.com/container-storage-interface/spec/blob/master/spec.md

CSI plugins SHOULD treat volumes independent from their snapshots.

If the Controller Plugin supports deleting a volume without affecting its existing snapshots, then these snapshots MUST still be fully operational and acceptable as sources for new volumes as well as appear on ListSnapshot calls once the volume has been deleted.

When a Controller Plugin does not support deleting a volume without affecting its existing snapshots, then the volume MUST NOT be altered in any way by the request and the operation must return the FAILED_PRECONDITION error code and MAY include meaningful human-readable information in the status.message field.

@pohly
Copy link
Contributor

pohly commented Sep 2, 2020

So the CSI spec defines the expected behavior. For csi-sanity that means that we need to have two variants of a "delete volume before snapshot" test: if the driver supports that, the expected outcome is "success". If not, the outcome is "FAILED_PRECONDITION".

But as there is no capability that describes this (right?), we have to make this a tri-state config option: driver supports it/doesn't support it/unknown (= skip test).

But as mentioned before, that's just an idea for a future test. The PR itself is fine, so:
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, timoreimann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit 935b857 into kubernetes-csi:master Sep 2, 2020
@apurv15
Copy link

apurv15 commented Sep 7, 2020

There are couple of Tests failing even after this change. Tests which consume snapshot to create a new PVC or clone are failing because snapshot deletion is attempted before cleanup of new PVC or clone using the snapshot as source.

Test 1 failure signature:
Controller Service [Controller Server] CreateVolume
should create volume from an existing source snapshot
/root/go/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:568
STEP: reusing connection to CSI driver at /var/lib/kubelet/plugins/org.veritas.infoscale/csi.sock
STEP: creating mount and staging directories
STEP: creating a snapshot
STEP: creating a volume from source snapshot
cleanup snapshots: deleting snap_d8yu2qym6a776eck2ogk
cleanup snapshots: DeleteSnapshot failed: rpc error: code = FailedPrecondition desc = Could not delete volume snap_d8yu2qym6a776eck2ogk, as it has dependent snapshots ['snapres_g1c52af426281958i969'] with sync in progress

Test 2 failure signature:
Controller Service [Controller Server] CreateVolume
should create volume from an existing source volume
/root/go/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:613
STEP: reusing connection to CSI driver at /var/lib/kubelet/plugins/org.veritas.infoscale/csi.sock
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a volume from source volume
cleanup volumes: deleting vol_1rx6i2c780nqcv623pe0
cleanup volumes: DeleteVolume failed: rpc error: code = FailedPrecondition desc = Could not delete vol_1rx6i2c780nqcv623pe0 as it has dependent snapshots ['snap_d8yu2qym6a776eck2ogk']
cleanup volumes: deleting clone_2172w7wiq2y4d00bbu8h

CSI spec says following in case a snapshot is used as source for creating a new PVC and the snapshot delete is attempted:
https://github.com/container-storage-interface/spec/blob/master/spec.md

Condition gRPC Code Description Recovery Behavior
Snapshot in use 9 FAILED_PRECONDITION Indicates that the snapshot corresponding to the specified snapshot_id could not be deleted because it is in use by another resource. Caller SHOULD ensure that there are no other resources using the snapshot, and then retry with exponential back off.

@timoreimann
Copy link
Contributor Author

timoreimann commented Sep 7, 2020

@apurv15 looking at the diff from my PR #261, I do notice that the previous implementation of the first failed test you mentioned deleted the volume created from the snapshot before cleaning up the snapshot, and likewise on the second test deleted the volume created from the source volume. (You might need to expand the controller.go file in the PR to see the referenced lines.)

Do you think that might be reason why things don't work for you?

@apurv15
Copy link

apurv15 commented Sep 9, 2020

@timoreimann
Deletion of snapshots, persistent volumes or volumes created using snapshots have to follow certain order as these volumes are dependent.
Your first check-in for cleanup infrastructure broke both the tests:

  1. Reversed the order of deletion of volume and snapshot
  2. Reversed the order of deletion of volume created using the snapshot and snapshot itself.

Your second check-in fixed the first issue by correcting the order of deletion of volume and snapshot but still the test which creates volume using the snapshot fails as snapshot is being deleted first before the volume created using the snapshot.

Generic cleanup function might not be useful for all tests as there is subtle difference in the order of deletion of volumes based on tests. You could allow passing some hints to cleanup function to delete volumes in certain order for different tests.

@timoreimann
Copy link
Contributor Author

timoreimann commented Sep 9, 2020

The implementation still allows to delete resources explicitly, which some tests already do. It was an oversight that the tests in question regressed, which is why I wanted to confirm with you that the current (deficient) test implementation I pointed at is, in fact, the culprit here.

I'll submit a PR to make the necessary changes and ping you again. I'll run it against our driver for sure, it'd be great if you could do the same for further validation once the fix proposal is out.

@apurv15
Copy link

apurv15 commented Sep 9, 2020

@timoreimann : Yes, we will test with the diff that you provide. Thanks.

@apurv15
Copy link

apurv15 commented Sep 28, 2020

@timoreimann : I have not tracked new changes in csi-test. Did you submit new PR?

@timoreimann
Copy link
Contributor Author

@apurv15 sorry, I haven't gotten to it. I'll make sure to submit something tomorrow.

@timoreimann timoreimann deleted the cleanup-snapshots-before-volumes branch January 18, 2021 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants