-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidecar spins on CreateSnapshot() retries instead of backing off #463
Comments
Do you see this message in the logs |
I looked at the code. It should get added to the rate-limited queue if syncContent fails. So I want to see which code path it falls thru. |
I0129 20:42:04.692484 1 snapshot_controller.go:57] synchronizing VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] |
This behavior can be easily reproduced by taking any CSI plugin that otherwise works and support snapshots and replacing the CreateSnapshot() implementation with something like this: func (x *X) CreateSnapshot(context.Context, *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error) { |
I also did some experiments by returning failure at create snapshot time. In my testing, I checked the logs after about 30 minutes after issuing the first create snapshot call. Initially the retry happened instantly, but gradually the wait time was increased. You can see that the wait time between 2 tries incremented from less than 1 second to more than 20 minutes at the end of the logs I provided below. I saw one instance when the retry happened quicker than the normal exponential backoff pattern, but I think that could be caused by a different retry from an API update failure. So I think this works as expected, but you need to observe it for a longer period of time.
|
I wonder if the API update race is happening every time in my system because I see it retrying very fast. I would have expected the exponential backoff on retries to increase rapidly so that there was no more than 6-8 attempts in the first minute. In the API update race scenario, the sidecar should not be calling the CSI RPC again, right? |
For any failure including API update failure, we add it to the same queue for retries. In earlier version of the code, we have local retries for some API update failures, but those were removed based on review comments. So the only place to do retry is to add failed object back to the queue. This means it will go thru the same code path. CSI RPC call is likely to happen because we need to do that to query snapshot status as well. |
We do have a WIP PR to use patch instead of update which should reduce the API update failures. |
So the provisioner sidecar has a tunable exponential backoff, using command line params. Perhaps if we could similarly tune the exponential backoff for snapshot creation, then individual vendors could pick the values that make sense for them. |
That seems to be a good one to have. |
Adding the feature tag for adding a tunable exponential backoff option, similar to the option in external-provisioner. |
I was just looking at the backoff intervals as well on csi-snapshotter v3.0.3. Here are times at which errors for $ grep 'CreateSnapshot for content snapcontent-2791ac74-c982-44b9-b8d8-688ba304d53e returned error' csi-snapshotter.log | awk '{print $2}'
21:55:39.589792
21:55:42.028352
21:55:44.417520
21:55:46.887035
21:55:49.226302
21:55:52.848028
21:55:55.126746
21:55:57.465546
21:55:59.911948
21:56:02.334527
21:56:04.905345
21:56:07.759896
21:56:10.189627
21:56:12.624762
21:56:14.916758
21:56:17.446945
21:56:19.906131
21:56:22.520878
21:56:25.091758
21:56:27.470534
21:56:30.011793
21:56:33.088768
21:56:35.488390
21:56:37.873061
21:56:40.428656
21:56:43.134805
21:56:45.527603
21:56:47.809527
21:56:50.240355
21:56:52.633075
21:56:55.051978
21:56:57.666444
21:56:59.981564
21:57:02.778037
21:57:05.222418
21:57:08.082219
21:57:10.696750
21:57:13.162859
21:57:15.745052
21:57:18.164811
21:57:20.673906
21:57:23.002543
21:57:25.429743
21:57:28.122906
21:57:30.459396
21:57:32.756425
21:57:35.091645
21:57:37.929807
21:57:40.241717
21:57:42.585739
21:57:45.055356
21:57:47.400831
21:57:49.810663
21:57:52.483982
21:57:54.911784
21:57:57.620382
21:58:00.098047
21:58:02.559989
21:58:04.895840
21:58:07.411678
21:58:09.891206
21:58:12.235630
21:58:14.648107
21:58:17.288319
21:58:20.013886
21:58:22.512694
21:58:24.884563
21:58:27.531291
21:58:30.290701
21:58:33.095274
21:58:35.497059
21:58:37.866017
21:58:40.367113
21:58:42.998914
21:58:45.520828
21:58:47.889714
21:58:50.311488
21:58:52.637957
21:58:55.094148
21:58:57.729583
21:59:00.327871
21:59:02.854171
21:59:05.599186
21:59:08.120658
21:59:10.454666
21:59:13.096934
21:59:15.888033
21:59:18.238658
21:59:20.744985
21:59:23.117121
21:59:25.437758
21:59:28.023577
21:59:30.507894
21:59:32.851363
21:59:35.599867
21:59:37.973700
21:59:40.261752
21:59:42.713521
21:59:45.249707
21:59:47.802625
21:59:50.105562
21:59:52.518262
21:59:54.992278
21:59:57.686250
22:00:00.282056
22:00:02.903912
22:00:05.388200
22:00:07.750759
22:00:10.286197
22:00:12.627039
22:00:15.166131
22:00:17.641594
22:00:20.077325
22:00:22.471386
22:00:24.852883
22:00:27.343572
22:00:29.792029
22:00:32.176672
22:00:34.591792
22:00:37.228799
22:00:39.795738
22:00:42.229760
22:00:44.611724
22:00:47.301011
22:00:49.792606
22:00:52.091762
22:00:54.496205
22:00:57.356536
22:00:59.756100
22:01:02.199763
22:01:04.521966
22:01:07.295616
22:01:09.677624
22:01:12.087630
22:01:14.392254
22:01:16.942539
22:01:19.354215
22:01:21.749776
22:01:24.445497
22:01:26.823309
22:01:29.307724
22:01:31.680076
22:01:34.099324
22:01:36.330232
22:01:38.677283
22:01:41.175093
22:01:43.471393
22:01:45.897096
22:01:48.289606
22:01:51.142338
22:01:53.620022
22:01:56.134705
22:01:58.790913
22:02:01.225088
22:02:03.575694
22:02:06.074191
22:02:08.670215
22:02:10.931605
22:02:13.271043
22:02:15.737657
22:02:18.471792
22:02:21.010482
22:02:23.375430
22:02:25.710244
22:02:28.219244
22:02:30.607653
22:02:33.144179
22:02:35.697718
22:02:38.049554
22:02:40.403767
22:02:43.000646
22:02:45.281615
22:02:47.708862
22:02:50.218188
22:02:52.525963
22:02:55.082989
22:02:57.557194
22:02:59.919528
22:03:02.250721
22:03:04.517812
22:03:06.917349
22:03:09.349974
22:03:11.718953
22:03:14.148610
22:03:16.634722
22:03:18.930114
22:03:21.564827
22:03:23.882118
22:03:26.709886
22:03:28.983457
22:03:31.395475
22:03:33.992849
22:03:36.330384
22:03:38.589034
22:03:41.037077
22:03:43.461391
22:03:45.927420
22:03:48.257964
22:03:50.897147
22:03:53.279978
22:03:55.662521
22:03:58.031044
22:04:00.758235
22:04:03.487800
22:04:06.114100
22:04:08.468415
22:04:11.311551
22:04:13.934978
22:04:16.699009
22:04:19.269702
22:04:21.690781
22:04:24.053183
22:04:26.819341
22:04:29.759985
22:04:32.208225
22:04:34.693535
22:04:37.133663
22:04:39.907979
22:04:42.514110
22:04:44.988434
22:04:47.503043
22:04:49.857438
22:04:52.438136
22:04:55.523408
22:04:58.113566
22:05:00.782505
22:05:03.577490
22:05:06.405897
22:05:08.933023
22:05:11.309751
22:05:13.714408
22:05:16.117977 The intervals are always around 2-3 seconds with no significant increase. I also checked the code and couldn't tell why the backoff doesn't work. 🤔 |
I verified that the backoff interval is in fact set correctly (and increasing on each retry). However, what I did also notice while watching the VolumeSnapshotContents is that my object was updated at the same frequency that the snapshotter sidecar got triggered. Doing a diff between two updates, I see that the resource version is changing: --- volumesnapshotcontent1.yaml 2021-02-08 00:35:40.000000000 +0100
+++ volumesnapshotcontent2.yaml 2021-02-08 00:35:51.000000000 +0100
@@ -43,9 +43,9 @@
f:snapshot.storage.kubernetes.io/volumesnapshot-being-created: {}
manager: csi-snapshotter
operation: Update
- time: "2021-02-07T23:35:38Z"
+ time: "2021-02-07T23:35:47Z"
name: snapcontent-4d9b8e56-7964-4f5f-bbc0-e9cd9c9da988
- resourceVersion: "25731"
+ resourceVersion: "25748"
selfLink: /apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-4d9b8e56-7964-4f5f-bbc0-e9cd9c9da988
uid: 67c4b8ad-ccdb-43df-b428-399e96d09247
spec: Could there be something that is updating the VolumeSnapshotContent even when nothing really changes? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
Here's a PR that makes ratelimiter tunable: |
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
According to the spec, if the plugin returns ABORTED from CreateSnapshot(), the sidecar should use exponential backup for the retries, but it does not. It retries immediately, consuming 100% CPU as long as the plugin continues to fail. Although the spec only mentions 3 errors codes: ALREADY_EXISTS, ABORTED, and RESOURCE_EXHAUSTED, there are several other possible error codes, and for most of them, exponential backoff would be appropriate.
The one we would like to use is UNSUPPORTED because our plugin does support snapshots, but not for all volumes, so we need a way to fail snapshot requests to the volumes that don't allow them that gives useful feedback to the end user. UNSUPPORTED seems like the clearest way to indicate this kind of failure mode, but the sidecar suffers from the same spinning problem when that error is returned.
The text was updated successfully, but these errors were encountered: