Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems dealing with snapshot create requests timing out #134

Closed
ShyamsundarR opened this issue Jul 2, 2019 · 33 comments · Fixed by #261
Closed

Problems dealing with snapshot create requests timing out #134

ShyamsundarR opened this issue Jul 2, 2019 · 33 comments · Fixed by #261
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ShyamsundarR
Copy link

When a CSI plugin is passed a CreateSnapshot request and the caller (snapshotter sidecar) times out, the snapshotter sidecar marks this as an error and does not retry the snapshot. Further, as the call only timed out and did not fail, the storage provider may have actually created the said snapshot (although delayed).

When such an snapshot is deleted, there are no requests to the CSI plugin to delete the same, which cannot be issued by the sidecar as it does not have the SnapID.

The end result of this is that the snapshot is leaked on the storage provider.

The question/issue hence is as follows,

Should the snapshot be retried on timeouts from the CreateSnapshot call?

Based on the ready_to_use parameter in the CSI spec [1] and possibilities of application freeze as the snapshot is taken, I would assume this operation cannot be done indefinitely. But, also as per the spec timeout errors, the behavior should be a retry, as implemented for volume create and delete operations in the provisioner sidecar [2].

So to fix the potential snapshot leak by the storage provider, should the snapshotter sidecar retry till it gets an error from the plugin or a success with a SnapID, but mark the snapshot as bad/unusable as it was not completed in time (to honor the application freeze times and such)?

[1] CSI spec ready_to_use section: https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter

[2] timeout handling in provisioner sidecar: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling

@ShyamsundarR
Copy link
Author

The issue was faced when developing and testing the Ceph CSI implementation (issue details are captured here: ceph/ceph-csi#446 ).

snapshotter sidecar code reading also suggests that CreateSnapshot is a one time call, and the error returned is stashed in the snapshot object and hence the create never called again.

During the testing, introducing artificial delays was also performed to understand the behavior. If a reproducer is required please do let us know.

@Madhu-1
Copy link
Contributor

Madhu-1 commented Jul 2, 2019

cc @xing-yang

@msau42
Copy link
Collaborator

msau42 commented Jul 2, 2019

cc @jingxu97

@jingxu97
Copy link
Collaborator

jingxu97 commented Jul 2, 2019

@ShyamsundarR thank you very much for bringing this issue up. The following is the behavior I copied from create volume.

ControllerCreateVolume: The call might have timed out just before the driver provisioned a volume and was sending a response. From that reason, timeouts from ControllerCreateVolume is considered as "volume may be provisioned" or "volume is being provisioned in the background." The external-provisioner will retry calling ControllerCreateVolume after exponential backoff until it gets either successful response or final (non-timeout) error that the volume cannot be created.

As you mentioned, the reason CreateSnapshot does not retry is to avoid constantly trying to create new snapshot if the previous operations failed. In case of timeouts, as you pointed out, there is a potential issue that snapshot is being created. But the assumption here is that CreateSnapshot returns once snapshot is cut which should be very quick. The current timeout is set to 1 second and we assume that cutting the snapshot should not take this long, so that timeout means that an error occurred during snapshotting and there should no snapshots available. But if you have seen this does not cover some cases, we might need to change this behavior which probably adds quite some complexity.

For plugins that supports snapshot post processing such as uploading, CreateSnapshot SHOULD return 0 OK and ready_to_use SHOULD be set to false after the snapshot is cut but still being processed.

@msau42
Copy link
Collaborator

msau42 commented Jul 3, 2019

Problem is client side (sidecar) times out but server side (csi driver) operation continues. When the server side operation finally returns, the client side doesn't get the response because it gave up on the rpc.

Can we do something like repeatedly call DeleteSnapshot if CreateSnapshot failed until DeleteSnapshot returns success? Unsure how to handle snapshot handle that is returned by the driver

@xing-yang
Copy link
Collaborator

We can't call DeleteSnapshot before CreateSnapshot returns success because DeleteSnapshot needs the snapshot handle. If CreateSnapshot fails or times out, we won't get the handle to delete it.

@ShyamsundarR
Copy link
Author

The problem is as @msau42 and @xing-yang put it, the client times out, and also the client does not have the Snapshot ID to call DeleteSnapshot

@jingxu97 Current timeouts for CreateSnapshot seem to be set to 60 seconds, and not 1 second. Considering timeouts as errors does not help, or expecting snapshots to always complete successfully in certain time may not be true for all systems all the time.

(following is a very theoretical approach for the solution, I have not looked at the feasibility)
I would suggest marking the VolumeSnapshotContent object as bad/error on a timeout, but retry the CreateSnapshot (like CreateVolume) till an error or the snapshot ID is returned by the plugin, and then garbage collect the same based on the stored state in the VolumeSnapshotContent object. This should satisfy that the snapshot was taken in the required freeze duration and was successful, or otherwise it was not and the content object is marked stale/bad but ensure that the storage backend snapshot will be garbage collected.

@msau42
Copy link
Collaborator

msau42 commented Jul 3, 2019

cc @jsafrane for any ideas

@jsafrane
Copy link
Contributor

jsafrane commented Jul 3, 2019

IMO, proper solution without changing of CSI spec is the same as in volume provisioning. The snapshotter should retry until it gets a final response and decide what to do - either the snapshot is too old or user deleted VolumeSnapshot objects in the meantime and delete it or create VolumeSnapshotContent.

As you can immediately spot, if the snapshotter is restarted while the snapshot is being cut and after user deleted related VolumeSnapshot object, newly started snapshotter does not know that it should resume creating the snapshot. Volume provisioning has the same issue. We hope that volume taints could help here and we could create empty PV / VolumeSnapshotContent before knowing the real volume / snapshot ID as memento that there is some operation in progress on the storage backend.

@xing-yang xing-yang added the kind/bug Categorizes issue or PR as related to a bug. label Jul 8, 2019
@xing-yang
Copy link
Collaborator

@ggriffiths will be helping out with this bug fix. Thanks.

@ggriffiths
Copy link
Member

/assign ggriffiths

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2019
@xing-yang
Copy link
Collaborator

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2020
@xing-yang
Copy link
Collaborator

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2020
@xing-yang
Copy link
Collaborator

Hi @ShyamsundarR, have you tested this with the beta version of snapshot feature? I tested it and found that this problem is fixed. Can you give a try?

@ShyamsundarR
Copy link
Author

Hi @ShyamsundarR, have you tested this with the beta version of snapshot feature? I tested it and found that this problem is fixed. Can you give a try?

Will test and report in a day or two, thanks!

@ShyamsundarR
Copy link
Author

Tested the v2.0 csi-snapshotter on a kubernetes v1.17 setup, using a slightly modified ceph-csi, to ensure CreateSnapshotRequest times out.

Here are the tests and the results.

  • Test case 1:

    • Create a VolumeSnapshot request
    • Delay first N responses from CSI plugins CreateSnapshotRequest gRPC by X seconds (X greater than --timeout option to csi-snapshotter sidecar)
      • N was set to 2 for testing purposes in the modified ceph CSI plugin, X was 60 seconds, and --timeout was set to 30 seconds
    • Wait till one of the subsequent retries by the csi-snapshotter sidecar succeeds
    • Result: PASS
  • Test case 2:

    • Same as above, but ensure other errors (like ABORTED) behave similarly
    • Result: PASS
  • Test case 3:

    • Same as above, but restart CSI plugin pod (plugin and csi-snapshotter containers restart)
    • Result: PASS
  • Test case 4:

    • Same as above, but restart snapshot controller stateful set
    • Result: PASS
  • Test case 5:

    • Create a VolumeSnapshot request
    • Delay first response from the CSI plugins CreateSnapshotRequest by X seconds (X greater than --timeout option to csi-snapshotter sidecar)
      • N was set to 2 for testing purposes in the modified ceph CSI plugin, X was 45 seconds, and --timeout was set to 30 seconds
    • Post initial timeout error is received by the csi-snapshotter sidecar, delete the created VolumeSnapshot
    • Wait for CreateSnapshotRequest retries till success, and ensure DeleteSnapshot is invoked to cleanup
    • Result: FAIL
      • If VolumeSnapshot object is deleted post initial timeout errors, there are no further CreateSnapshotRequest calls or calls to ensure snapshot created exists, and hence needs to be deleted
      • NOTE: ceph CSI plugin does not support ListSnapshots capability (if that matters)
  • Test case 6:

    • Same as above, except delete VolumeSnapshot object before the initial call to CreateSnapshot times out for the csi-snapshotter
    • Result: FAIL
      • Reasons same as above
  • Further observations post test case 5 and 6:

    • The source PVC was stuck in terminating phase on deletion, this was, afaict, due to a the finalizer snapshot.storage.kubernetes.io/pvc-as-source-protection sill on the PVC
    • IOW, the source PVC and as a result its backing PV is not deleted

@xing-yang
Copy link
Collaborator

Thanks for the details tests. For test case 5 and 6, are you creating the snapshot dynamically? Do you mean that the VolumeSnapshot API object was deleted, but VolumeSnapshotContent was not deleted and physical snapshot resource was not created on the storage system?

For the finalizer on PVC, I'll take a look.

@ShyamsundarR
Copy link
Author

Thanks for the details tests. For test case 5 and 6, are you creating the snapshot dynamically?

I did not understand the "dynamic" part, I'll try to clarify my steps better below, hope that helps.

Do you mean that the VolumeSnapshot API object was deleted, but VolumeSnapshotContent was not deleted and physical snapshot resource was not created on the storage system?

So, VolumeSnapshot object and VolumeSnapshotContent were deleted, but the physical snapshot resource was created and not deleted on the storage system.

The test went something like so, (start time T)

  • T: Create VolumeSnapshot object
    • VolumeSnapshotContent gets created soon after
  • T+1: csi-snapshotter calls storage plugin CreateSnapshot
  • T+2: Delete VolumeSnapshot created @ T
    • At this point the VolumeSnapshotContent object is also deleted
  • T+3: CreateSnapshot times out in csi-snapshotter sidecar
  • T+4: The initial call to CreateSnapshot @ T+1 is completed successfully by the plugin
  • T+5: Plugin responds with success to the CreateVolume, to csi-snapshotter
    • But this response is not processed as the call has timed out @ T+3

End state of the above timeline is,

  • VolumeSnapshot and VolumeSnapshotContent objects are deleted
  • Storage system snapshot is created, and leaked
  • (and the last niggle, the source PVC is left with a finalizer)

Here is some data and logs from the test:
snapshot-leak-issue-134.txt

@xing-yang
Copy link
Collaborator

Between step T and T+2, do you see a "snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection" finalizer on the VolumeSnapshotContent? Also what finalizers are on VolumeSnapshot?

@ShyamsundarR
Copy link
Author

Between step T and T+2, do you see a "snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection" finalizer on the VolumeSnapshotContent? Also what finalizers are on VolumeSnapshot?

  • VolumeSnapshotContent finalizers between T and T+2
    • snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  • VolumeSnapshot finalizers between T and T+2
    • snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    • snapshot.storage.kubernetes.io/volumesnapshot-bound-protection

@ShyamsundarR
Copy link
Author

@xing-yang what I see as the issue, is in this snapshotter side car code snippet.

  • On a delete annotation that is set by the snapshot controller, shouldDelete would return true
  • As the SnapshotCreate request is still in flight (or timed out), content.Status would be nil
  • Hence, the code above would enter the removeContentFinalizer and remove the finalizer

This subsequently frees up the snapshot controller to delete the VolumeSnapshotContent and VolumeSnapshot objects, as the finalizer is removed

I believe we still need some more checks for in-flight requests to ensure we do not leak snapshots on the storage provider, as @jsafrane points out here.

@xing-yang
Copy link
Collaborator

Sure, I'll take a look.

@xing-yang
Copy link
Collaborator

@ShyamsundarR can you give this fix a try? #261

@ShyamsundarR
Copy link
Author

Repeated tests as in this comment.

Tests 1-4 passed, but the new annotation was removed in tests 3 and 4, due the Aborted error code, as noted in the review here.

Test 5 actually failed to delete the snapshot, I suspect due to the annotation snapshot.storage.kubernetes.io/volumesnapshot-being-created: "yes" still left on the VolumeSnapshotContent object.

I did change the code to not remove the annotation on Aborted error codes before testing to progress with the test cases. Hence, will relook at this change and debug it further before calling it a failure at present.

From the PR description, the VolumeSnapshot and VolumeSnapshotContent objects are still present, and ideally there is no snapshot leak, barring the delete issue above that needs further resolution.

@xing-yang
Copy link
Collaborator

xing-yang commented Mar 12, 2020

@ShyamsundarR Thanks for testing this. In Test 5, was the deletion timestamp added to VolumeSnapshot and VolumeSnapshotContent objects and was the snapshot.storage.kubernetes.io/volumesnapshot-being-deleted annotation added to the VolumeSnapshotContent?

Also did you go back to check it again after a couple of minutes? Just wonder if retries have happened.

@ShyamsundarR
Copy link
Author

@ShyamsundarR Thanks for testing this. In Test 5, was the deletion timestamp added to VolumeSnapshot and VolumeSnapshotContent objects and was the snapshot.storage.kubernetes.io/volumesnapshot-being-deleted annotation added to the VolumeSnapshotContent?

Yes, these were present.
(edited yaml output as below)

$ kubectl get Volumesnapshotcontent -o yaml
apiVersion: v1
items:
- apiVersion: snapshot.storage.k8s.io/v1beta1
  kind: VolumeSnapshotContent
  metadata:
    annotations:
      snapshot.storage.kubernetes.io/deletion-secret-name: rook-csi-rbd-provisioner
      snapshot.storage.kubernetes.io/deletion-secret-namespace: rook-ceph
      snapshot.storage.kubernetes.io/volumesnapshot-being-created: "yes"
      snapshot.storage.kubernetes.io/volumesnapshot-being-deleted: "yes"
    creationTimestamp: "2020-03-11T21:20:46Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2020-03-11T21:22:17Z"
    finalizers:
    - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection

Also did you go back to check it again after a couple of minutes? Just wonder if retries have happened.

I waited ~5+ minutes, checked again now, the sidecar is still looping on delete checking for the new annotation and not invoking DeleteSnapshot CSI call.

@xing-yang
Copy link
Collaborator

xing-yang commented Mar 12, 2020

So "ctrl.handler.CreateSnapshot()" never finished successfully and therefore this (https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#diff-de5dbd65778c167a3d05cd17929d6851R368) is never called to remove the volumesnapshot-being-created annotation?

I understand that I need to add more error codes to check like here (https://github.com/kubernetes-csi/external-provisioner/blob/efcaee79e47446e38910a3c1a024824387fcf235/pkg/controller/controller.go#L1223-L1245). My expectation is that when create snapshot is finally finished on the storage system, we should get here (https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#diff-de5dbd65778c167a3d05cd17929d6851R368), but it seems that it is not the case?

@ShyamsundarR
Copy link
Author

The issue seems to be that, on deletion of the snapshot prior to a successful creation, the control flow enters the else part of this check.

The code path here checkandUpdateContentStatus -> checkandUpdateContentStatusOperation -> calls CreateSnapshot which succeeds and this code path does not have the required annotation removal code and checks.

This results in the volumesnapshot-being-created annotation being left behind on the VolumeSnapshotContent, which in turn prevents the delete of the same.

@ShyamsundarR
Copy link
Author

The issue seems to be that, on deletion of the snapshot prior to a successful creation, the control flow enters the else part of this check.

Correction, this (taking the else branch) occurs even without deleting the in flight volume snapshot.

The condition content.Status is non-nil post the initial CreateVolume error (say DeadlineExceeded) is updated on the content object. Thus future retries of the same enters the checkandUpdateContentStatus call, which does not have the annotation addition/removal logic in it.

This leads to leaking the new volumesnapshot-being-created annotation.

I am not well versed with the code as yet, but adding the annotation addition/removal code in the code path checkandUpdateContentStatus -> checkandUpdateContentStatusOperation -> calls CreateSnapshot, should help fix this problem.

@xing-yang
Copy link
Collaborator

Thanks. Yeah, we call CreateSnapshot to check snapshot status after the initial call. Let me add the new annotation logic to this update status code path as well.

@ShyamsundarR
Copy link
Author

@xing-yang Tested latest PRs #261 and #283 running the test cases detailed here.

Comment is to report that there are no further leaks based on these tests, nor are there any finalizer leaks on the PVC. The patches look good based on these tests.

pohly added a commit to pohly/external-snapshotter that referenced this issue Mar 18, 2021
1748b16 Merge pull request kubernetes-csi#136 from pohly/go-1.16
ec844ea remove travis.yml, Go 1.16
df76aba Merge pull request kubernetes-csi#134 from andyzhangx/add-build-arg
e314a56 add build-arg ARCH for building multi-arch images, e.g. ARG ARCH FROM k8s.gcr.io/build-image/debian-base-${ARCH}:v2.1.3

git-subtree-dir: release-tools
git-subtree-split: 1748b16
ggriffiths pushed a commit to ggriffiths/external-snapshotter that referenced this issue Apr 15, 2021
bc0504a Merge pull request kubernetes-csi#140 from jsafrane/remove-unused-k8s-libs
5b1de1a go-get-kubernetes.sh: remove unused k8s libs
49b4269 Merge pull request kubernetes-csi#120 from pohly/add-kubernetes-release
a1e1127 Merge pull request kubernetes-csi#139 from pohly/kind-for-kubernetes-latest
1c0fb09 prow.sh: use KinD main for latest Kubernetes
1d77cfc Merge pull request kubernetes-csi#138 from pohly/kind-update-0.10
bff2fb7 prow.sh: KinD 0.10.0
95eac33 Merge pull request kubernetes-csi#137 from pohly/fix-go-version-check
437e431 verify-go-version.sh: fix check after removal of travis.yml
1748b16 Merge pull request kubernetes-csi#136 from pohly/go-1.16
ec844ea remove travis.yml, Go 1.16
df76aba Merge pull request kubernetes-csi#134 from andyzhangx/add-build-arg
e314a56 add build-arg ARCH for building multi-arch images, e.g. ARG ARCH FROM k8s.gcr.io/build-image/debian-base-${ARCH}:v2.1.3
7bc70e5 Merge pull request kubernetes-csi#129 from pohly/squash-documentation
e0b02e7 README.md: document usage of --squash
316cb95 Merge pull request kubernetes-csi#132 from yiyang5055/bugfix/boilerplate
26e2ab1 fix: default boilerplate path
1add8c1 Merge pull request kubernetes-csi#133 from pohly/kubernetes-1.20-tag
3e811d6 prow.sh: fix "on-master" prow jobs
1d60e77 Merge pull request kubernetes-csi#131 from pohly/kubernetes-1.20-tag
9f10459 prow.sh: support building Kubernetes for a specific version
f7e7ee4 docs: steps for adding testing against new Kubernetes release
fe1f284 Merge pull request kubernetes-csi#121 from kvaps/namespace-check
8fdf0f7 Merge pull request kubernetes-csi#128 from fengzixu/master
1c94220 fix: fix a bug of csi-sanity
a4c41e6 Merge pull request kubernetes-csi#127 from pohly/fix-boilerplate
ece0f50 check namespace for snapshot-controller
dbd8967 verify-boilerplate.sh: fix path to script
9289fd1 Merge pull request kubernetes-csi#125 from sachinkumarsingh092/optional-spelling-boilerplate-checks
ad29307 Make the spelling and boilerplate checks optional
5f06d02 Merge pull request kubernetes-csi#124 from sachinkumarsingh092/fix-spellcheck-boilerplate-tests
48186eb Fix spelling and boilerplate errors
71690af Merge pull request kubernetes-csi#122 from sachinkumarsingh092/include-spellcheck-boilerplate-tests
981be3f Adding spelling and boilerplate checks.
2bb7525 Merge pull request kubernetes-csi#117 from fengzixu/master
4ab8b15 use the tag to replace commit of csi-test
5d74e45 change the csi-test import path to v4
7dcd0a9 upgrade csi-test to v4.0.2

git-subtree-dir: release-tools
git-subtree-split: bc0504a
xing-yang pushed a commit to xing-yang/external-snapshotter that referenced this issue Jul 26, 2021
kaovilai pushed a commit to kaovilai/external-snapshotter that referenced this issue Feb 7, 2024
…ncy-openshift-4.15-ose-csi-external-snapshotter

OCPBUGS-25521: Updating ose-csi-external-snapshotter-container image to be consistent with ART
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
9 participants