-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems dealing with snapshot create requests timing out #134
Comments
The issue was faced when developing and testing the Ceph CSI implementation (issue details are captured here: ceph/ceph-csi#446 ). snapshotter sidecar code reading also suggests that CreateSnapshot is a one time call, and the error returned is stashed in the snapshot object and hence the create never called again. During the testing, introducing artificial delays was also performed to understand the behavior. If a reproducer is required please do let us know. |
cc @xing-yang |
cc @jingxu97 |
@ShyamsundarR thank you very much for bringing this issue up. The following is the behavior I copied from create volume.
As you mentioned, the reason CreateSnapshot does not retry is to avoid constantly trying to create new snapshot if the previous operations failed. In case of timeouts, as you pointed out, there is a potential issue that snapshot is being created. But the assumption here is that CreateSnapshot returns once snapshot is cut which should be very quick. The current timeout is set to 1 second and we assume that cutting the snapshot should not take this long, so that timeout means that an error occurred during snapshotting and there should no snapshots available. But if you have seen this does not cover some cases, we might need to change this behavior which probably adds quite some complexity.
|
Problem is client side (sidecar) times out but server side (csi driver) operation continues. When the server side operation finally returns, the client side doesn't get the response because it gave up on the rpc. Can we do something like repeatedly call DeleteSnapshot if CreateSnapshot failed until DeleteSnapshot returns success? Unsure how to handle snapshot handle that is returned by the driver |
We can't call DeleteSnapshot before CreateSnapshot returns success because DeleteSnapshot needs the snapshot handle. If CreateSnapshot fails or times out, we won't get the handle to delete it. |
The problem is as @msau42 and @xing-yang put it, the client times out, and also the client does not have the Snapshot ID to call DeleteSnapshot @jingxu97 Current timeouts for CreateSnapshot seem to be set to 60 seconds, and not 1 second. Considering timeouts as errors does not help, or expecting snapshots to always complete successfully in certain time may not be true for all systems all the time. (following is a very theoretical approach for the solution, I have not looked at the feasibility) |
cc @jsafrane for any ideas |
IMO, proper solution without changing of CSI spec is the same as in volume provisioning. The snapshotter should retry until it gets a final response and decide what to do - either the snapshot is too old or user deleted VolumeSnapshot objects in the meantime and delete it or create VolumeSnapshotContent. As you can immediately spot, if the snapshotter is restarted while the snapshot is being cut and after user deleted related VolumeSnapshot object, newly started snapshotter does not know that it should resume creating the snapshot. Volume provisioning has the same issue. We hope that volume taints could help here and we could create empty PV / VolumeSnapshotContent before knowing the real volume / snapshot ID as memento that there is some operation in progress on the storage backend. |
@ggriffiths will be helping out with this bug fix. Thanks. |
/assign ggriffiths |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Hi @ShyamsundarR, have you tested this with the beta version of snapshot feature? I tested it and found that this problem is fixed. Can you give a try? |
Will test and report in a day or two, thanks! |
Tested the v2.0 csi-snapshotter on a kubernetes v1.17 setup, using a slightly modified ceph-csi, to ensure CreateSnapshotRequest times out. Here are the tests and the results.
|
Thanks for the details tests. For test case 5 and 6, are you creating the snapshot dynamically? Do you mean that the VolumeSnapshot API object was deleted, but VolumeSnapshotContent was not deleted and physical snapshot resource was not created on the storage system? For the finalizer on PVC, I'll take a look. |
I did not understand the "dynamic" part, I'll try to clarify my steps better below, hope that helps.
So, VolumeSnapshot object and VolumeSnapshotContent were deleted, but the physical snapshot resource was created and not deleted on the storage system. The test went something like so, (start time T)
End state of the above timeline is,
Here is some data and logs from the test: |
Between step T and T+2, do you see a "snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection" finalizer on the VolumeSnapshotContent? Also what finalizers are on VolumeSnapshot? |
|
@xing-yang what I see as the issue, is in this snapshotter side car code snippet.
This subsequently frees up the snapshot controller to delete the VolumeSnapshotContent and VolumeSnapshot objects, as the finalizer is removed I believe we still need some more checks for in-flight requests to ensure we do not leak snapshots on the storage provider, as @jsafrane points out here. |
Sure, I'll take a look. |
@ShyamsundarR can you give this fix a try? #261 |
Repeated tests as in this comment. Tests 1-4 passed, but the new annotation was removed in tests 3 and 4, due the Aborted error code, as noted in the review here. Test 5 actually failed to delete the snapshot, I suspect due to the annotation I did change the code to not remove the annotation on Aborted error codes before testing to progress with the test cases. Hence, will relook at this change and debug it further before calling it a failure at present. From the PR description, the VolumeSnapshot and VolumeSnapshotContent objects are still present, and ideally there is no snapshot leak, barring the delete issue above that needs further resolution. |
@ShyamsundarR Thanks for testing this. In Test 5, was the deletion timestamp added to VolumeSnapshot and VolumeSnapshotContent objects and was the Also did you go back to check it again after a couple of minutes? Just wonder if retries have happened. |
Yes, these were present.
I waited ~5+ minutes, checked again now, the sidecar is still looping on delete checking for the new annotation and not invoking DeleteSnapshot CSI call. |
So "ctrl.handler.CreateSnapshot()" never finished successfully and therefore this (https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#diff-de5dbd65778c167a3d05cd17929d6851R368) is never called to remove the I understand that I need to add more error codes to check like here (https://github.com/kubernetes-csi/external-provisioner/blob/efcaee79e47446e38910a3c1a024824387fcf235/pkg/controller/controller.go#L1223-L1245). My expectation is that when create snapshot is finally finished on the storage system, we should get here (https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#diff-de5dbd65778c167a3d05cd17929d6851R368), but it seems that it is not the case? |
The issue seems to be that, on deletion of the snapshot prior to a successful creation, the control flow enters the else part of this check. The code path here This results in the |
Correction, this (taking the else branch) occurs even without deleting the in flight volume snapshot. The condition This leads to leaking the new I am not well versed with the code as yet, but adding the annotation addition/removal code in the code path |
Thanks. Yeah, we call CreateSnapshot to check snapshot status after the initial call. Let me add the new annotation logic to this update status code path as well. |
@xing-yang Tested latest PRs #261 and #283 running the test cases detailed here. Comment is to report that there are no further leaks based on these tests, nor are there any finalizer leaks on the PVC. The patches look good based on these tests. |
1748b16 Merge pull request kubernetes-csi#136 from pohly/go-1.16 ec844ea remove travis.yml, Go 1.16 df76aba Merge pull request kubernetes-csi#134 from andyzhangx/add-build-arg e314a56 add build-arg ARCH for building multi-arch images, e.g. ARG ARCH FROM k8s.gcr.io/build-image/debian-base-${ARCH}:v2.1.3 git-subtree-dir: release-tools git-subtree-split: 1748b16
bc0504a Merge pull request kubernetes-csi#140 from jsafrane/remove-unused-k8s-libs 5b1de1a go-get-kubernetes.sh: remove unused k8s libs 49b4269 Merge pull request kubernetes-csi#120 from pohly/add-kubernetes-release a1e1127 Merge pull request kubernetes-csi#139 from pohly/kind-for-kubernetes-latest 1c0fb09 prow.sh: use KinD main for latest Kubernetes 1d77cfc Merge pull request kubernetes-csi#138 from pohly/kind-update-0.10 bff2fb7 prow.sh: KinD 0.10.0 95eac33 Merge pull request kubernetes-csi#137 from pohly/fix-go-version-check 437e431 verify-go-version.sh: fix check after removal of travis.yml 1748b16 Merge pull request kubernetes-csi#136 from pohly/go-1.16 ec844ea remove travis.yml, Go 1.16 df76aba Merge pull request kubernetes-csi#134 from andyzhangx/add-build-arg e314a56 add build-arg ARCH for building multi-arch images, e.g. ARG ARCH FROM k8s.gcr.io/build-image/debian-base-${ARCH}:v2.1.3 7bc70e5 Merge pull request kubernetes-csi#129 from pohly/squash-documentation e0b02e7 README.md: document usage of --squash 316cb95 Merge pull request kubernetes-csi#132 from yiyang5055/bugfix/boilerplate 26e2ab1 fix: default boilerplate path 1add8c1 Merge pull request kubernetes-csi#133 from pohly/kubernetes-1.20-tag 3e811d6 prow.sh: fix "on-master" prow jobs 1d60e77 Merge pull request kubernetes-csi#131 from pohly/kubernetes-1.20-tag 9f10459 prow.sh: support building Kubernetes for a specific version f7e7ee4 docs: steps for adding testing against new Kubernetes release fe1f284 Merge pull request kubernetes-csi#121 from kvaps/namespace-check 8fdf0f7 Merge pull request kubernetes-csi#128 from fengzixu/master 1c94220 fix: fix a bug of csi-sanity a4c41e6 Merge pull request kubernetes-csi#127 from pohly/fix-boilerplate ece0f50 check namespace for snapshot-controller dbd8967 verify-boilerplate.sh: fix path to script 9289fd1 Merge pull request kubernetes-csi#125 from sachinkumarsingh092/optional-spelling-boilerplate-checks ad29307 Make the spelling and boilerplate checks optional 5f06d02 Merge pull request kubernetes-csi#124 from sachinkumarsingh092/fix-spellcheck-boilerplate-tests 48186eb Fix spelling and boilerplate errors 71690af Merge pull request kubernetes-csi#122 from sachinkumarsingh092/include-spellcheck-boilerplate-tests 981be3f Adding spelling and boilerplate checks. 2bb7525 Merge pull request kubernetes-csi#117 from fengzixu/master 4ab8b15 use the tag to replace commit of csi-test 5d74e45 change the csi-test import path to v4 7dcd0a9 upgrade csi-test to v4.0.2 git-subtree-dir: release-tools git-subtree-split: bc0504a
add build-arg ARCH
…ncy-openshift-4.15-ose-csi-external-snapshotter OCPBUGS-25521: Updating ose-csi-external-snapshotter-container image to be consistent with ART
When a CSI plugin is passed a CreateSnapshot request and the caller (snapshotter sidecar) times out, the snapshotter sidecar marks this as an error and does not retry the snapshot. Further, as the call only timed out and did not fail, the storage provider may have actually created the said snapshot (although delayed).
When such an snapshot is deleted, there are no requests to the CSI plugin to delete the same, which cannot be issued by the sidecar as it does not have the
SnapID
.The end result of this is that the snapshot is leaked on the storage provider.
The question/issue hence is as follows,
Should the snapshot be retried on timeouts from the CreateSnapshot call?
Based on the ready_to_use parameter in the CSI spec [1] and possibilities of application freeze as the snapshot is taken, I would assume this operation cannot be done indefinitely. But, also as per the spec timeout errors, the behavior should be a retry, as implemented for volume create and delete operations in the provisioner sidecar [2].
So to fix the potential snapshot leak by the storage provider, should the snapshotter sidecar retry till it gets an error from the plugin or a success with a
SnapID
, but mark the snapshot as bad/unusable as it was not completed in time (to honor the application freeze times and such)?[1] CSI spec
ready_to_use
section: https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter[2] timeout handling in provisioner sidecar: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling
The text was updated successfully, but these errors were encountered: