Sidecar spins on CreateSnapshot() retries instead of backing off #463

bswartz · 2021-01-25T18:56:51Z

According to the spec, if the plugin returns ABORTED from CreateSnapshot(), the sidecar should use exponential backup for the retries, but it does not. It retries immediately, consuming 100% CPU as long as the plugin continues to fail. Although the spec only mentions 3 errors codes: ALREADY_EXISTS, ABORTED, and RESOURCE_EXHAUSTED, there are several other possible error codes, and for most of them, exponential backoff would be appropriate.

The one we would like to use is UNSUPPORTED because our plugin does support snapshots, but not for all volumes, so we need a way to fail snapshot requests to the volumes that don't allow them that gives useful feedback to the end user. UNSUPPORTED seems like the clearest way to indicate this kind of failure mode, but the sidecar suffers from the same spinning problem when that error is returned.

xing-yang · 2021-01-28T03:23:42Z

Do you see this message in the logs Failed to sync content %q, will retry again: %v? Can you provide logs?

xing-yang · 2021-01-28T03:30:37Z

I looked at the code. It should get added to the rate-limited queue if syncContent fails. So I want to see which code path it falls thru.

bswartz · 2021-01-29T20:44:33Z

I0129 20:42:04.692484 I0129 20:42:04.692495 I0129 20:42:04.692506 I0129 20:42:04.692515 I0129 20:42:04.692525 I0129 20:42:04.692535 I0129 20:42:04.692544 I0129 20:42:04.692566 I0129 20:42:04.693051 I0129 20:42:04.695572 I0129 20:42:04.880467 I0129 20:42:04.915332 I0129 20:42:04.916541 I0129 20:42:04.916639 I0129 20:42:04.916671 I0129 20:42:05.080487 I0129 20:42:05.092093 E0129 20:42:05.092847 E0129 20:42:05.092885 I0129 20:42:05.092923 I0129 20:42:05.092948 I0129 20:42:05.092974 I0129 20:42:05.092427 I0129 20:42:05.092992 I0129 20:42:05.093023 I0129 20:42:05.093039 I0129 20:42:05.093088 I0129 20:42:05.093112 I0129 20:42:05.093124 I0129 20:42:05.093133 I0129 20:42:05.093186 I0129 20:42:05.280583 I0129 20:42:05.296129 I0129 20:42:05.296176 I0129 20:42:05.296364 I0129 20:42:05.298024 I0129 20:42:05.298091 I0129 20:42:05.300192 I0129 20:42:05.480594 I0129 20:42:05.490450 E0129 20:42:05.490489 E0129 20:42:05.490510 I0129 20:42:05.490547 I0129 20:42:05.490581 I0129 20:42:05.490606 I0129 20:42:05.490623 I0129 20:42:05.490634 I0129 20:42:05.490645 I0129 20:42:05.490654 I0129 20:42:05.490664 I0129 20:42:05.490674 I0129 20:42:05.490683 I0129 20:42:05.490705 I0129 20:42:05.491200 I0129 20:42:05.493018 I0129 20:42:05.680557 I0129 20:42:05.709437 I0129 20:42:05.709472 I0129 20:42:05.709521 I0129 20:42:05.711695 I0129 20:42:05.880549 I0129 20:42:05.894376 E0129 20:42:05.894422 E0129 20:42:05.894483 I0129 20:42:05.894519 I0129 20:42:05.894593 I0129 20:42:05.894618 I0129 20:42:05.894700 I0129 20:42:05.894713 I0129 20:42:05.894723 I0129 20:42:05.894768 I0129 20:42:05.894780 I0129 20:42:05.894790 I0129 20:42:05.894799 I0129 20:42:05.894812 I0129 20:42:05.895757 I0129 20:42:06.080564 I0129 20:42:06.094320 I0129 20:42:06.099089 I0129 20:42:06.099169 I0129 20:42:06.099483 I0129 20:42:06.102777 I0129 20:42:06.102881 I0129 20:42:06.280502 I0129 20:42:06.293656 E0129 20:42:06.293694 E0129 20:42:06.293915 I0129 20:42:06.293972 I0129 20:42:06.293998 I0129 20:42:06.294025 I0129 20:42:06.294043 I0129 20:42:06.294055 I0129 20:42:06.294065 I0129 20:42:06.294082 I0129 20:42:06.294091 I0129 20:42:06.294102 I0129 20:42:06.294111 I0129 20:42:06.294134 I0129 20:42:06.294319 I0129 20:42:06.296578 I0129 20:42:06.480404 I0129 20:42:06.489413 I0129 20:42:06.489443 I0129 20:42:06.489455 I0129 20:42:06.490347 I0129 20:42:06.680354 1 snapshot_controller.go:57] synchronizing VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:531] Check if VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] should be deleted.
1 snapshot_controller.go:76] syncContent: Call CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:102] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: started
1 snapshot_controller.go:268] createSnapshotWrapper: Creating snapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 through the plugin ...
1 snapshot_controller.go:181] getCSISnapshotInput for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [standard-snapshot-class]
1 snapshotter.go:56] CSI CreateSnapshot: snapshot-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 returned error: rpc error: code = Internal desc = Intentional error
1 request.go:581] Throttling request took 184.384129ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller_base.go:140] enqueued "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" for sync
1 snapshot_controller.go:605] Removed VolumeSnapshotBeingCreated annotation from volume snapshot content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2593
1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 request.go:581] Throttling request took 163.320982ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42/status
1 snapshot_controller.go:166] updating VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] error status failed Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": the object has been modified; please apply your changes to the latest version and try again
1 snapshot_controller.go:106] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:261] could not sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:163] Failed to sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42", will retry again: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:174] syncContentByKey[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2593
1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller.go:57] synchronizing VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:531] Check if VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] should be deleted.
1 snapshot_controller.go:76] syncContent: Call CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:102] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: started
1 snapshot_controller.go:268] createSnapshotWrapper: Creating snapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 through the plugin ...
1 snapshot_controller.go:181] getCSISnapshotInput for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [standard-snapshot-class]
1 snapshot_controller.go:568] setAnnVolumeSnapshotBeingCreated: set annotation [snapshot.storage.kubernetes.io/volumesnapshot-being-created:yes] on content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42].
1 request.go:581] Throttling request took 187.074188ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2594
1 snapshot_controller.go:583] setAnnVolumeSnapshotBeingCreated: volume snapshot content &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 GenerateName: Namespace: SelfLink:/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 UID:87bd19e2-a03d-4a02-86ca-1be3a4f922cf ResourceVersion:2594 Generation:1 CreationTimestamp:2021-01-29 20:37:25 +0000 UTC DeletionTimestamp: DeletionGracePeriodSeconds: Labels:map[] Annotations:map[snapshot.storage.kubernetes.io/volumesnapshot-being-created:yes] OwnerReferences:[] Finalizers:[snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection] ClusterName: ManagedFields:[{Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1beta1 Time:2021-01-29 20:37:25 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:finalizers":{".":{},"v:"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"":{}}},"f:spec":{".":{},"f:deletionPolicy":{},"f:driver":{},"f:source":{".":{},"f:volumeHandle":{}},"f:volumeSnapshotClassName":{},"f:volumeSnapshotRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}}} {Manager:csi-snapshotter Operation:Update APIVersion:snapshot.storage.k8s.io/v1beta1 Time:2021-01-29 20:42:05 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:annotations":{".":{},"f:snapshot.storage.kubernetes.io/volumesnapshot-being-created":{}}}}}]} Spec:{VolumeSnapshotRef:{Kind:VolumeSnapshot Namespace:default Name:snap1 UID:087c1a00-d0f0-47b9-aa6c-e0aea482fe42 APIVersion:snapshot.storage.k8s.io/v1beta1 ResourceVersion:781 FieldPath:} DeletionPolicy:Delete Driver:csi.test.net VolumeSnapshotClassName:0xc000264350 Source:{VolumeHandle:0xc000264340 SnapshotHandle:}} Status:}
1 snapshotter.go:56] CSI CreateSnapshot: snapshot-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 returned error: rpc error: code = Internal desc = Intentional error
1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller_base.go:140] enqueued "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" for sync
1 request.go:581] Throttling request took 182.239294ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42/status
1 snapshot_controller.go:166] updating VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] error status failed Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": the object has been modified; please apply your changes to the latest version and try again
1 snapshot_controller.go:106] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:261] could not sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:163] Failed to sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42", will retry again: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:174] syncContentByKey[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2594
1 snapshot_controller.go:57] synchronizing VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:531] Check if VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] should be deleted.
1 snapshot_controller.go:76] syncContent: Call CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:102] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: started
1 snapshot_controller.go:268] createSnapshotWrapper: Creating snapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 through the plugin ...
1 snapshot_controller.go:181] getCSISnapshotInput for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [standard-snapshot-class]
1 snapshotter.go:56] CSI CreateSnapshot: snapshot-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 returned error: rpc error: code = Internal desc = Intentional error
1 request.go:581] Throttling request took 187.322587ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:605] Removed VolumeSnapshotBeingCreated annotation from volume snapshot content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2595
1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller_base.go:140] enqueued "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" for sync
1 request.go:581] Throttling request took 170.815009ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42/status
1 snapshot_controller.go:166] updating VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] error status failed Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": the object has been modified; please apply your changes to the latest version and try again
1 snapshot_controller.go:106] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:261] could not sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:163] Failed to sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42", will retry again: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:174] syncContentByKey[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2595
1 snapshot_controller.go:57] synchronizing VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:531] Check if VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] should be deleted.
1 snapshot_controller.go:76] syncContent: Call CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:102] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: started
1 snapshot_controller.go:268] createSnapshotWrapper: Creating snapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 through the plugin ...
1 snapshot_controller.go:181] getCSISnapshotInput for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [standard-snapshot-class]
1 snapshot_controller.go:568] setAnnVolumeSnapshotBeingCreated: set annotation [snapshot.storage.kubernetes.io/volumesnapshot-being-created:yes] on content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42].
1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 request.go:581] Throttling request took 185.357092ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller_base.go:140] enqueued "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" for sync
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2601
1 snapshot_controller.go:583] setAnnVolumeSnapshotBeingCreated: volume snapshot content &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 GenerateName: Namespace: SelfLink:/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 UID:87bd19e2-a03d-4a02-86ca-1be3a4f922cf ResourceVersion:2601 Generation:1 CreationTimestamp:2021-01-29 20:37:25 +0000 UTC DeletionTimestamp: DeletionGracePeriodSeconds: Labels:map[] Annotations:map[snapshot.storage.kubernetes.io/volumesnapshot-being-created:yes] OwnerReferences:[] Finalizers:[snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection] ClusterName: ManagedFields:[{Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1beta1 Time:2021-01-29 20:37:25 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:finalizers":{".":{},"v:"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"":{}}},"f:spec":{".":{},"f:deletionPolicy":{},"f:driver":{},"f:source":{".":{},"f:volumeHandle":{}},"f:volumeSnapshotClassName":{},"f:volumeSnapshotRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}}} {Manager:csi-snapshotter Operation:Update APIVersion:snapshot.storage.k8s.io/v1beta1 Time:2021-01-29 20:42:06 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:annotations":{".":{},"f:snapshot.storage.kubernetes.io/volumesnapshot-being-created":{}}}}}]} Spec:{VolumeSnapshotRef:{Kind:VolumeSnapshot Namespace:default Name:snap1 UID:087c1a00-d0f0-47b9-aa6c-e0aea482fe42 APIVersion:snapshot.storage.k8s.io/v1beta1 ResourceVersion:781 FieldPath:} DeletionPolicy:Delete Driver:csi.test.net VolumeSnapshotClassName:0xc0006989f0 Source:{VolumeHandle:0xc0006989e0 SnapshotHandle:}} Status:}
1 snapshotter.go:56] CSI CreateSnapshot: snapshot-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 returned error: rpc error: code = Internal desc = Intentional error
1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 request.go:581] Throttling request took 177.313713ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42/status
1 snapshot_controller.go:166] updating VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] error status failed Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": the object has been modified; please apply your changes to the latest version and try again
1 snapshot_controller.go:106] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:261] could not sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42": failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:163] Failed to sync content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42", will retry again: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller_base.go:174] syncContentByKey[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2601
1 snapshot_controller.go:57] synchronizing VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:531] Check if VolumeSnapshotContent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42] should be deleted.
1 snapshot_controller.go:76] syncContent: Call CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:102] createSnapshot for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]: started
1 snapshot_controller.go:268] createSnapshotWrapper: Creating snapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 through the plugin ...
1 snapshot_controller.go:181] getCSISnapshotInput for content [snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [standard-snapshot-class]
1 snapshotter.go:56] CSI CreateSnapshot: snapshot-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, pvc-ed4fd2f7-216d-43c7-8679-20796d06916f: %!q()
1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42 returned error: rpc error: code = Internal desc = Intentional error
1 request.go:581] Throttling request took 183.514597ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 snapshot_controller.go:605] Removed VolumeSnapshotBeingCreated annotation from volume snapshot content snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42
1 util.go:258] storeObjectUpdate updating content "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" with version 2603
1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42]
1 snapshot_controller_base.go:140] enqueued "snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42" for sync
1 request.go:581] Throttling request took 190.80882ms, request: PUT:https://10.32.0.1:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-087c1a00-d0f0-47b9-aa6c-e0aea482fe42/status

bswartz · 2021-01-29T21:02:30Z

This behavior can be easily reproduced by taking any CSI plugin that otherwise works and support snapshots and replacing the CreateSnapshot() implementation with something like this:

func (x *X) CreateSnapshot(context.Context, *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error) {
return nil, status.Error(codes.Internal, "Intentional error")
}

xing-yang · 2021-01-31T02:02:00Z

I also did some experiments by returning failure at create snapshot time. In my testing, I checked the logs after about 30 minutes after issuing the first create snapshot call. Initially the retry happened instantly, but gradually the wait time was increased. You can see that the wait time between 2 tries incremented from less than 1 second to more than 20 minutes at the end of the logs I provided below.

I saw one instance when the retry happened quicker than the normal exponential backoff pattern, but I think that could be caused by a different retry from an API update failure.

So I think this works as expected, but you need to observe it for a longer period of time.

> First failure:
I0131 01:10:01.482415       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> 7,806 ms later:
I0131 01:10:01.490221       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> 18,931 ms later:
I0131 01:10:01.509152       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

>  72 ms later (this one could be triggered by a different API update failure):
I0131 01:10:01.509224       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> 40,392 ms later:
I0131 01:10:01.549616       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> 162,601 ms later:
I0131 01:10:01.712217       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

I0131 01:10:02.032678       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

I0131 01:10:02.674143       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

I0131 01:10:03.955158       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> more than 2 minutes later
I0131 01:10:06.516007       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> more than 5 minutes later
I0131 01:10:11.637578       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

>  more than 10 minutes later
I0131 01:10:21.878156       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

> more than 20 minutes later
I0131 01:10:42.358613       1 snapshot_controller_base.go:166] Failed to sync content "snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466", will retry again: failed to take snapshot of the volume, fa413292-6360-11eb-9418-0242ac110003: "Test create snapshot failure for snapshot content snapcontent-48ac69f6-8412-43f6-a4aa-1af8d4604466"

bswartz · 2021-01-31T21:44:31Z

I wonder if the API update race is happening every time in my system because I see it retrying very fast. I would have expected the exponential backoff on retries to increase rapidly so that there was no more than 6-8 attempts in the first minute. In the API update race scenario, the sidecar should not be calling the CSI RPC again, right?

xing-yang · 2021-02-01T17:24:11Z

For any failure including API update failure, we add it to the same queue for retries. In earlier version of the code, we have local retries for some API update failures, but those were removed based on review comments. So the only place to do retry is to add failed object back to the queue. This means it will go thru the same code path. CSI RPC call is likely to happen because we need to do that to query snapshot status as well.

xing-yang · 2021-02-01T17:25:04Z

We do have a WIP PR to use patch instead of update which should reduce the API update failures.

bswartz · 2021-02-01T18:01:42Z

So the provisioner sidecar has a tunable exponential backoff, using command line params. Perhaps if we could similarly tune the exponential backoff for snapshot creation, then individual vendors could pick the values that make sense for them.

xing-yang · 2021-02-01T18:30:35Z

That seems to be a good one to have.

xing-yang · 2021-02-01T20:57:26Z

Adding the feature tag for adding a tunable exponential backoff option, similar to the option in external-provisioner.

timoreimann · 2021-02-07T22:25:16Z

I was just looking at the backoff intervals as well on csi-snapshotter v3.0.3. Here are times at which errors for CreateSnapshot() were reported over a period of 10 minutes:

$ grep 'CreateSnapshot for content snapcontent-2791ac74-c982-44b9-b8d8-688ba304d53e returned error' csi-snapshotter.log | awk '{print $2}'
21:55:39.589792
21:55:42.028352
21:55:44.417520
21:55:46.887035
21:55:49.226302
21:55:52.848028
21:55:55.126746
21:55:57.465546
21:55:59.911948
21:56:02.334527
21:56:04.905345
21:56:07.759896
21:56:10.189627
21:56:12.624762
21:56:14.916758
21:56:17.446945
21:56:19.906131
21:56:22.520878
21:56:25.091758
21:56:27.470534
21:56:30.011793
21:56:33.088768
21:56:35.488390
21:56:37.873061
21:56:40.428656
21:56:43.134805
21:56:45.527603
21:56:47.809527
21:56:50.240355
21:56:52.633075
21:56:55.051978
21:56:57.666444
21:56:59.981564
21:57:02.778037
21:57:05.222418
21:57:08.082219
21:57:10.696750
21:57:13.162859
21:57:15.745052
21:57:18.164811
21:57:20.673906
21:57:23.002543
21:57:25.429743
21:57:28.122906
21:57:30.459396
21:57:32.756425
21:57:35.091645
21:57:37.929807
21:57:40.241717
21:57:42.585739
21:57:45.055356
21:57:47.400831
21:57:49.810663
21:57:52.483982
21:57:54.911784
21:57:57.620382
21:58:00.098047
21:58:02.559989
21:58:04.895840
21:58:07.411678
21:58:09.891206
21:58:12.235630
21:58:14.648107
21:58:17.288319
21:58:20.013886
21:58:22.512694
21:58:24.884563
21:58:27.531291
21:58:30.290701
21:58:33.095274
21:58:35.497059
21:58:37.866017
21:58:40.367113
21:58:42.998914
21:58:45.520828
21:58:47.889714
21:58:50.311488
21:58:52.637957
21:58:55.094148
21:58:57.729583
21:59:00.327871
21:59:02.854171
21:59:05.599186
21:59:08.120658
21:59:10.454666
21:59:13.096934
21:59:15.888033
21:59:18.238658
21:59:20.744985
21:59:23.117121
21:59:25.437758
21:59:28.023577
21:59:30.507894
21:59:32.851363
21:59:35.599867
21:59:37.973700
21:59:40.261752
21:59:42.713521
21:59:45.249707
21:59:47.802625
21:59:50.105562
21:59:52.518262
21:59:54.992278
21:59:57.686250
22:00:00.282056
22:00:02.903912
22:00:05.388200
22:00:07.750759
22:00:10.286197
22:00:12.627039
22:00:15.166131
22:00:17.641594
22:00:20.077325
22:00:22.471386
22:00:24.852883
22:00:27.343572
22:00:29.792029
22:00:32.176672
22:00:34.591792
22:00:37.228799
22:00:39.795738
22:00:42.229760
22:00:44.611724
22:00:47.301011
22:00:49.792606
22:00:52.091762
22:00:54.496205
22:00:57.356536
22:00:59.756100
22:01:02.199763
22:01:04.521966
22:01:07.295616
22:01:09.677624
22:01:12.087630
22:01:14.392254
22:01:16.942539
22:01:19.354215
22:01:21.749776
22:01:24.445497
22:01:26.823309
22:01:29.307724
22:01:31.680076
22:01:34.099324
22:01:36.330232
22:01:38.677283
22:01:41.175093
22:01:43.471393
22:01:45.897096
22:01:48.289606
22:01:51.142338
22:01:53.620022
22:01:56.134705
22:01:58.790913
22:02:01.225088
22:02:03.575694
22:02:06.074191
22:02:08.670215
22:02:10.931605
22:02:13.271043
22:02:15.737657
22:02:18.471792
22:02:21.010482
22:02:23.375430
22:02:25.710244
22:02:28.219244
22:02:30.607653
22:02:33.144179
22:02:35.697718
22:02:38.049554
22:02:40.403767
22:02:43.000646
22:02:45.281615
22:02:47.708862
22:02:50.218188
22:02:52.525963
22:02:55.082989
22:02:57.557194
22:02:59.919528
22:03:02.250721
22:03:04.517812
22:03:06.917349
22:03:09.349974
22:03:11.718953
22:03:14.148610
22:03:16.634722
22:03:18.930114
22:03:21.564827
22:03:23.882118
22:03:26.709886
22:03:28.983457
22:03:31.395475
22:03:33.992849
22:03:36.330384
22:03:38.589034
22:03:41.037077
22:03:43.461391
22:03:45.927420
22:03:48.257964
22:03:50.897147
22:03:53.279978
22:03:55.662521
22:03:58.031044
22:04:00.758235
22:04:03.487800
22:04:06.114100
22:04:08.468415
22:04:11.311551
22:04:13.934978
22:04:16.699009
22:04:19.269702
22:04:21.690781
22:04:24.053183
22:04:26.819341
22:04:29.759985
22:04:32.208225
22:04:34.693535
22:04:37.133663
22:04:39.907979
22:04:42.514110
22:04:44.988434
22:04:47.503043
22:04:49.857438
22:04:52.438136
22:04:55.523408
22:04:58.113566
22:05:00.782505
22:05:03.577490
22:05:06.405897
22:05:08.933023
22:05:11.309751
22:05:13.714408
22:05:16.117977

The intervals are always around 2-3 seconds with no significant increase.

I also checked the code and couldn't tell why the backoff doesn't work. 🤔

timoreimann · 2021-02-07T23:49:21Z

I verified that the backoff interval is in fact set correctly (and increasing on each retry).

However, what I did also notice while watching the VolumeSnapshotContents is that my object was updated at the same frequency that the snapshotter sidecar got triggered. Doing a diff between two updates, I see that the resource version is changing:

--- volumesnapshotcontent1.yaml	2021-02-08 00:35:40.000000000 +0100
+++ volumesnapshotcontent2.yaml	2021-02-08 00:35:51.000000000 +0100
@@ -43,9 +43,9 @@
           f:snapshot.storage.kubernetes.io/volumesnapshot-being-created: {}
     manager: csi-snapshotter
     operation: Update
-    time: "2021-02-07T23:35:38Z"
+    time: "2021-02-07T23:35:47Z"
   name: snapcontent-4d9b8e56-7964-4f5f-bbc0-e9cd9c9da988
-  resourceVersion: "25731"
+  resourceVersion: "25748"
   selfLink: /apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-4d9b8e56-7964-4f5f-bbc0-e9cd9c9da988
   uid: 67c4b8ad-ccdb-43df-b428-399e96d09247
 spec:

Could there be something that is updating the VolumeSnapshotContent even when nothing really changes?

fejta-bot · 2021-05-09T00:07:55Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

timoreimann · 2021-05-09T15:03:34Z

/remove-lifecycle stale

xing-yang · 2021-06-03T13:33:50Z

Here's a PR that makes ratelimiter tunable:
#308

Two new timeout values ( retryIntervalStart & retryIntervalMax ) have been added to set the ratelimiter for volumesnapshotcontent queue. Fix# kubernetes-csi#463 Signed-off-by: Humble Chirammal <hchiramm@redhat.com> ```release-note `retry-interval-start` and `retry-interval-max` arguments are added to csi-snapshotter sidecar which controls retry interval of failed volume snapshot creation and deletion. These values set the ratelimiter for volumesnapshotcontent queue. ``` Signed-off-by: Humble Chirammal <hchiramm@redhat.com>

xing-yang added kind/feature Categorizes issue or PR as related to a new feature. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 1, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021

xing-yang mentioned this issue Jun 3, 2021

Add ability to customize VolumeSnapshotContent workqueue #308

Merged

k8s-ci-robot closed this as completed in #308 Jun 10, 2021

timoreimann mentioned this issue Nov 28, 2021

[feature request] More forgiveness in rate limiting digitalocean/csi-digitalocean#404

Closed

ConnorJC3 mentioned this issue Feb 13, 2024

Discard unnecessary VolumeSnapshotContent updates to prevent rapid RPC calls #1009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecar spins on CreateSnapshot() retries instead of backing off #463

Sidecar spins on CreateSnapshot() retries instead of backing off #463

bswartz commented Jan 25, 2021

xing-yang commented Jan 28, 2021

xing-yang commented Jan 28, 2021

bswartz commented Jan 29, 2021

bswartz commented Jan 29, 2021

xing-yang commented Jan 31, 2021 •

edited

Loading

bswartz commented Jan 31, 2021

xing-yang commented Feb 1, 2021

xing-yang commented Feb 1, 2021

bswartz commented Feb 1, 2021 •

edited

Loading

xing-yang commented Feb 1, 2021

xing-yang commented Feb 1, 2021

timoreimann commented Feb 7, 2021

timoreimann commented Feb 7, 2021 •

edited

Loading

fejta-bot commented May 9, 2021

timoreimann commented May 9, 2021

xing-yang commented Jun 3, 2021

Sidecar spins on CreateSnapshot() retries instead of backing off #463

Sidecar spins on CreateSnapshot() retries instead of backing off #463

Comments

bswartz commented Jan 25, 2021

xing-yang commented Jan 28, 2021

xing-yang commented Jan 28, 2021

bswartz commented Jan 29, 2021

bswartz commented Jan 29, 2021

xing-yang commented Jan 31, 2021 • edited Loading

bswartz commented Jan 31, 2021

xing-yang commented Feb 1, 2021

xing-yang commented Feb 1, 2021

bswartz commented Feb 1, 2021 • edited Loading

xing-yang commented Feb 1, 2021

xing-yang commented Feb 1, 2021

timoreimann commented Feb 7, 2021

timoreimann commented Feb 7, 2021 • edited Loading

fejta-bot commented May 9, 2021

timoreimann commented May 9, 2021

xing-yang commented Jun 3, 2021

xing-yang commented Jan 31, 2021 •

edited

Loading

bswartz commented Feb 1, 2021 •

edited

Loading

timoreimann commented Feb 7, 2021 •

edited

Loading