Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable idempotency for Snapshot create request #1883

Closed
Yuggupta27 opened this issue Feb 19, 2021 · 15 comments · Fixed by #2260
Closed

Enable idempotency for Snapshot create request #1883

Yuggupta27 opened this issue Feb 19, 2021 · 15 comments · Fixed by #2260
Assignees
Labels
bug Something isn't working component/cephfs Issues related to CephFS component/rbd Issues related to RBD Priority-0 highest priority issue

Comments

@Yuggupta27
Copy link
Contributor

Describe the bug

On multiple snapshot creation requests, if a request fails for some reason, a new snapshot ID is generated.
In case the latter request succeeds, the external-snapshotter updates the Ready-to-use as TRUE for the old
snapshotHandle and not for the new one.

[ygupta@localhost rbd]$ klo csi-rbdplugin-provisioner-8954fb89b-rsh9j csi-rbdplugin | grep "snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110"
I0217 09:26:36.985087       1 utils.go:132] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC call: /csi.v1.Controller/CreateSnapshot
I0217 09:26:36.986320       1 utils.go:133] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC request: {"name":"snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110","parameters":{"clusterID":"myClusterID","pool":"test-pool"},"secrets":"***stripped***","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}
I0217 09:26:36.990708       1 omap.go:84] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="replicapool", namespace="", name="csi.volume.d22f078e-7101-11eb-9ab8-0242ac110010"): map[csi.imageid:251f48cd86b1 csi.imagename:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volname:pvc-208e495f-6031-48d7-94ae-011e6e52862e csi.volume.owner:default]
I0217 09:26:37.014807       1 omap.go:84] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snaps.default"): map[]
I0217 09:26:37.050289       1 omap.go:148] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 set omap keys (pool="test-pool", namespace="", name="csi.snaps.default"): map[csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110:3bb2fef2-7102-11eb-9ab8-0242ac110010])
I0217 09:26:37.058222       1 omap.go:148] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 set omap keys (pool="test-pool", namespace="", name="csi.snap.3bb2fef2-7102-11eb-9ab8-0242ac110010"): map[csi.imagename:csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 csi.snapname:snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 csi.source:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volume.owner:default])
I0217 09:26:37.058552       1 rbd_journal.go:377] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 generated Volume ID (0001-000b-myClusterID-0000000000000006-3bb2fef2-7102-11eb-9ab8-0242ac110010) and image name (csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010) for request name (snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110)
I0217 09:26:37.058820       1 rbd_util.go:887] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap create replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:38.089340       1 rbd_util.go:932] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: clone replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 test-pool/csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 (features: [deep-flatten layering]) using mon 10.108.44.90:6789
I0217 09:26:38.184061       1 rbd_util.go:899] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap rm replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:38.264739       1 rbd_util.go:887] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap create test-pool/csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010@csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:39.196653       1 omap.go:148] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 set omap keys (pool="test-pool", namespace="", name="csi.snap.3bb2fef2-7102-11eb-9ab8-0242ac110010"): map[csi.imageid:251fa313e3fb])
I0217 09:26:39.196726       1 rbd_util.go:332] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 executing [rbd task add flatten test-pool/csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=/tmp/csi/keys/keyfile-838421790 -m 10.108.44.90:6789] for image (csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010) using mon 10.108.44.90:6789, pool test-pool
I0217 09:26:39.992141       1 cephcmds.go:59] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 command succeeded: ceph [rbd task add flatten test-pool/csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=***stripped*** -m 10.108.44.90:6789]
I0217 09:26:39.997530       1 utils.go:138] ID: 139 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC response: {"snapshot":{"creation_time":{"nanos":126081631,"seconds":1613553998},"size_bytes":1073741824,"snapshot_id":"0001-000b-myClusterID-0000000000000006-3bb2fef2-7102-11eb-9ab8-0242ac110010","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}}
I0217 09:26:40.107741       1 utils.go:132] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC call: /csi.v1.Controller/CreateSnapshot
I0217 09:26:40.108136       1 utils.go:133] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC request: {"name":"snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110","parameters":{"clusterID":"myClusterID","pool":"test-pool"},"secrets":"***stripped***","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}
I0217 09:26:40.120611       1 omap.go:84] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="replicapool", namespace="", name="csi.volume.d22f078e-7101-11eb-9ab8-0242ac110010"): map[csi.imageid:251f48cd86b1 csi.imagename:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volname:pvc-208e495f-6031-48d7-94ae-011e6e52862e csi.volume.owner:default]
I0217 09:26:40.161953       1 omap.go:84] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snaps.default"): map[csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110:3bb2fef2-7102-11eb-9ab8-0242ac110010]
I0217 09:26:40.162684       1 omap.go:84] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snap.3bb2fef2-7102-11eb-9ab8-0242ac110010"): map[csi.imageid:251fa313e3fb csi.imagename:csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 csi.snapname:snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 csi.source:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volume.owner:default]
I0217 09:26:40.325935       1 rbd_util.go:332] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 executing [rbd task add flatten test-pool/csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=/tmp/csi/keys/keyfile-270013413 -m 10.108.44.90:6789] for image (csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010) using mon 10.108.44.90:6789, pool test-pool
I0217 09:26:41.070941       1 cephcmds.go:59] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 command succeeded: ceph [rbd task add flatten test-pool/csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=***stripped*** -m 10.108.44.90:6789]
E0217 09:26:41.071016       1 utils.go:136] ID: 141 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC error: rpc error: code = Internal desc = flatten in progress: flatten is in progress for image csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010
I0217 09:26:41.085672       1 utils.go:132] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC call: /csi.v1.Controller/CreateSnapshot
I0217 09:26:41.085759       1 utils.go:133] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC request: {"name":"snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110","parameters":{"clusterID":"myClusterID","pool":"test-pool"},"secrets":"***stripped***","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}
I0217 09:26:41.087452       1 omap.go:84] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="replicapool", namespace="", name="csi.volume.d22f078e-7101-11eb-9ab8-0242ac110010"): map[csi.imageid:251f48cd86b1 csi.imagename:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volname:pvc-208e495f-6031-48d7-94ae-011e6e52862e csi.volume.owner:default]
I0217 09:26:41.112204       1 omap.go:84] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snaps.default"): map[csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110:3bb2fef2-7102-11eb-9ab8-0242ac110010]
I0217 09:26:41.113539       1 omap.go:84] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snap.3bb2fef2-7102-11eb-9ab8-0242ac110010"): map[csi.imageid:251fa313e3fb csi.imagename:csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 csi.snapname:snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 csi.source:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volume.owner:default]
I0217 09:26:41.157947       1 rbd_util.go:899] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap rm replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:41.202514       1 rbd_util.go:899] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap rm replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:41.238417       1 rbd_util.go:361] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: delete csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789, pool test-pool
I0217 09:26:41.295164       1 rbd_util.go:332] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 executing [rbd task add trash remove test-pool/251fa313e3fb --id admin --keyfile=/tmp/csi/keys/keyfile-532455424 -m 10.108.44.90:6789] for image (csi-snap-3bb2fef2-7102-11eb-9ab8-0242ac110010) using mon 10.108.44.90:6789, pool test-pool
I0217 09:26:41.956085       1 cephcmds.go:59] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 command succeeded: ceph [rbd task add trash remove test-pool/251fa313e3fb --id admin --keyfile=***stripped*** -m 10.108.44.90:6789]
I0217 09:26:41.974708       1 omap.go:118] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 removed omap keys (pool="test-pool", namespace="", name="csi.snaps.default"): [csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110]
I0217 09:26:42.012765       1 omap.go:148] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 set omap keys (pool="test-pool", namespace="", name="csi.snaps.default"): map[csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110:3ea79a3b-7102-11eb-9ab8-0242ac110010])
I0217 09:26:42.025126       1 omap.go:148] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 set omap keys (pool="test-pool", namespace="", name="csi.snap.3ea79a3b-7102-11eb-9ab8-0242ac110010"): map[csi.imagename:csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 csi.snapname:snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 csi.source:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volume.owner:default])
I0217 09:26:42.025295       1 rbd_journal.go:377] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 generated Volume ID (0001-000b-myClusterID-0000000000000006-3ea79a3b-7102-11eb-9ab8-0242ac110010) and image name (csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010) for request name (snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110)
I0217 09:26:42.025494       1 rbd_util.go:887] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap create replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:42.180025       1 rbd_util.go:932] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: clone replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 (features: [layering deep-flatten]) using mon 10.108.44.90:6789
I0217 09:26:42.271802       1 rbd_util.go:899] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap rm replicapool/csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010@csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:42.321405       1 rbd_util.go:887] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 rbd: snap create test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010@csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 using mon 10.108.44.90:6789
I0217 09:26:43.270822       1 omap.go:148] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 set omap keys (pool="test-pool", namespace="", name="csi.snap.3ea79a3b-7102-11eb-9ab8-0242ac110010"): map[csi.imageid:251fa313e3fb])
I0217 09:26:43.270923       1 rbd_util.go:332] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 executing [rbd task add flatten test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=/tmp/csi/keys/keyfile-532455424 -m 10.108.44.90:6789] for image (csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010) using mon 10.108.44.90:6789, pool test-pool
I0217 09:26:45.058281       1 cephcmds.go:59] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 command succeeded: ceph [rbd task add flatten test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=***stripped*** -m 10.108.44.90:6789]
I0217 09:26:45.058922       1 utils.go:138] ID: 143 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC response: {"snapshot":{"creation_time":{"nanos":224265991,"seconds":1613554002},"size_bytes":1073741824,"snapshot_id":"0001-000b-myClusterID-0000000000000006-3ea79a3b-7102-11eb-9ab8-0242ac110010","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}}
I0217 09:26:45.121447       1 utils.go:132] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC call: /csi.v1.Controller/CreateSnapshot
I0217 09:26:45.121535       1 utils.go:133] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC request: {"name":"snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110","parameters":{"clusterID":"myClusterID","pool":"test-pool"},"secrets":"***stripped***","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}
I0217 09:26:45.131987       1 omap.go:84] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="replicapool", namespace="", name="csi.volume.d22f078e-7101-11eb-9ab8-0242ac110010"): map[csi.imageid:251f48cd86b1 csi.imagename:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volname:pvc-208e495f-6031-48d7-94ae-011e6e52862e csi.volume.owner:default]
I0217 09:26:45.205455       1 omap.go:84] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snaps.default"): map[csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110:3ea79a3b-7102-11eb-9ab8-0242ac110010]
I0217 09:26:45.208197       1 omap.go:84] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snap.3ea79a3b-7102-11eb-9ab8-0242ac110010"): map[csi.imageid:251fa313e3fb csi.imagename:csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 csi.snapname:snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 csi.source:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volume.owner:default]
I0217 09:26:45.497198       1 rbd_util.go:332] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 executing [rbd task add flatten test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=/tmp/csi/keys/keyfile-071172959 -m 10.108.44.90:6789] for image (csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010) using mon 10.108.44.90:6789, pool test-pool
I0217 09:26:46.322812       1 cephcmds.go:59] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 command succeeded: ceph [rbd task add flatten test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 --id admin --keyfile=***stripped*** -m 10.108.44.90:6789]
E0217 09:26:46.322938       1 utils.go:136] ID: 145 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC error: rpc error: code = Internal desc = flatten in progress: flatten is in progress for image csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010
I0217 09:26:46.386389       1 utils.go:132] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC call: /csi.v1.Controller/CreateSnapshot
I0217 09:26:46.387343       1 utils.go:133] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC request: {"name":"snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110","parameters":{"clusterID":"myClusterID","pool":"test-pool"},"secrets":"***stripped***","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}
I0217 09:26:46.404758       1 omap.go:84] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="replicapool", namespace="", name="csi.volume.d22f078e-7101-11eb-9ab8-0242ac110010"): map[csi.imageid:251f48cd86b1 csi.imagename:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volname:pvc-208e495f-6031-48d7-94ae-011e6e52862e csi.volume.owner:default]
I0217 09:26:46.448164       1 omap.go:84] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snaps.default"): map[csi.snap.snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110:3ea79a3b-7102-11eb-9ab8-0242ac110010]
I0217 09:26:46.457778       1 omap.go:84] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 got omap values: (pool="test-pool", namespace="", name="csi.snap.3ea79a3b-7102-11eb-9ab8-0242ac110010"): map[csi.imageid:251fa313e3fb csi.imagename:csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010 csi.snapname:snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 csi.source:csi-vol-d22f078e-7101-11eb-9ab8-0242ac110010 csi.volume.owner:default]
I0217 09:26:46.568541       1 rbd_journal.go:229] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 found existing image (0001-000b-myClusterID-0000000000000006-3ea79a3b-7102-11eb-9ab8-0242ac110010) with name (csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010) for request (snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110)
I0217 09:26:46.602113       1 rbd_util.go:492] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 clone depth is (0), configured softlimit (4) and hardlimit (8) for test-pool/csi-snap-3ea79a3b-7102-11eb-9ab8-0242ac110010
I0217 09:26:46.602735       1 utils.go:138] ID: 147 Req-ID: snapshot-0c8f040b-df1c-4c50-8cfa-2df34089c110 GRPC response: {"snapshot":{"creation_time":{"nanos":224265991,"seconds":1613554002},"ready_to_use":true,"size_bytes":1073741824,"snapshot_id":"0001-000b-myClusterID-0000000000000006-3ea79a3b-7102-11eb-9ab8-0242ac110010","source_volume_id":"0001-000b-myClusterID-0000000000000002-d22f078e-7101-11eb-9ab8-0242ac110010"}}

Environment details

  • Image/version of Ceph CSI driver : canary
  • Kernel version :
  • Kubernetes cluster version : v1.19.2
  • Ceph cluster version : v14.2.10

Logs

csi-provisioner
csi-rbdplugin

@humblec
Copy link
Collaborator

humblec commented Feb 19, 2021

@Yuggupta27 I am yet to look at the logs in detail, however in summary it looks like a bug in external snapshotter sidecar. Isnt it ( or did I miss anything here?) ? if yes, can you please open an issue in https://github.com/kubernetes-csi/external-snapshotter/issues/ ? please let me know if you need help.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Feb 19, 2021

@humblec we got confirmation in kubernetes slack channel. this is not a bug in the external snapshotter this is a bug in cephcsi as cephcsi should not return the new snapshotID when it gets the second request with the same name.

@Yuggupta27
Copy link
Contributor Author

Yuggupta27 commented Feb 19, 2021

@Yuggupta27 I am yet to look at the logs in detail, however in summary it looks like a bug in external snapshotter sidecar. Isnt it ( or did I miss anything here?) ? if yes, can you please open an issue in https://github.com/kubernetes-csi/external-snapshotter/issues/ ? please let me know if you need help.

@Madhu-1 and I had a discussion with the Kubernetes team in their slack, where they suggested that If CSI driver receives the same snapshot request multiple times, it should create only one snapshot ID and not multiple; to handle idempotency.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Feb 19, 2021

This is applicable for both cephfs and RBD snapshot operations.

cc @ShyamsundarR

@Madhu-1 Madhu-1 added the Priority-0 highest priority issue label Feb 19, 2021
@ShyamsundarR
Copy link
Contributor

This is applicable for both cephfs and RBD snapshot operations.

cc @ShyamsundarR

Ack! Like the same volume name generates the same volume ID/handle the same snapshot name should generate the same snapshot ID. This is applicable to both providers as well as stated above. Will watch out for the PR, curious to know what we missed, but will look at the PR for the same.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Feb 19, 2021

@Yuggupta27 if you have logs please attach the complete logs for further analysis.

Looking for a volunteer to take it on priority

@Madhu-1 Madhu-1 added component/cephfs Issues related to CephFS component/rbd Issues related to RBD bug Something isn't working labels Feb 19, 2021
@Madhu-1 Madhu-1 added this to the release-3.3.0 milestone Feb 19, 2021
@Yuggupta27
Copy link
Contributor Author

@Yuggupta27 if you have logs please attach the complete logs for further analysis.

Looking for a volunteer to take it on priority

Sure Madhu, attached the csi-provisioner and rbd-plugin logs in the issue... Let me know If I can provide any other logs for the same.

@agarwal-mudit
Copy link
Contributor

I might be wrong but I think this problem should be only with RBD and not with CephFS:

RBD:
We just pass the snapshot name and go-ceph API also does the same it doesn't care about the ID, so ID can change during the second request?

` type rbdSnapshot struct {

// SourceVolumeID is the volume ID of RbdImageName, that is exchanged with CSI drivers

// RbdImageName is the name of the RBD image, that is this rbdSnapshot's source image

// RbdSnapName is the name of the RBD snapshot backing this rbdSnapshot

// SnapID is the snapshot ID that is exchanged with CSI drivers, identifying this rbdSnapshot

// RequestName is the CSI generated snapshot name for the rbdSnapshot

// JournalPool is the ceph pool in which the CSI snapshot Journal is stored

// Pool is where the image snapshot journal and snapshot is stored, and could be the same as `JournalPool`

// ImageID contains the image id of cloned image

SourceVolumeID string

RbdImageName   string

ReservedID     string

NamePrefix     string

RbdSnapName    string

SnapID         string`


 `func (rv *rbdVolume) createSnapshot(ctx context.Context, pOpts *rbdSnapshot) error {

_, err = image.CreateSnapshot(pOpts.RbdSnapName)

return err
 }`

CephFS:

It takes snap ID as an argument and pass the same to go-ceph:

  `  func (vo *volumeOptions) createSnapshot(ctx context.Context, snapID, volID volumeID) error {

   fsa, err := vo.conn.GetFSAdmin()

    if err != nil {

	util.ErrorLog(ctx, "could not get FSAdmin: %s", err)

	return err
}`

@agarwal-mudit
Copy link
Contributor

But if my assumption is correct we should be seeing this issue more frequently.

@agarwal-mudit
Copy link
Contributor

Please ignore my last comments, shouldn't be the reason for this issue.

@Rakshith-R
Copy link
Contributor

Rakshith-R commented Jul 16, 2021

Summarising findings & discussions.
(see https://kubernetes.slack.com/archives/C8EJ01Z46/p1626353172285500?thread_ts=1626353172.285500&cid=C8EJ01Z46)

From the external snapshotter /k8s perspective:

  • On Snapshot request, application using parent PVC is frozen. Then createSnapshot call is sent to CSI driver
  • CSI driver returns snapshot-id & readyToUse: false after cutting the snapshot from parent PVC, therefore the application using the parent PVC is thawed.
  • Then external-snapshotter continuously polls until readyToUse: true state.
  • The snapshot-id returned is expected to be idempotent in subsequent calls, its not updated or even checked if csi driver returns a new id in subsequent requests).
  • It is expected readyToUse: true state is reached at some point of time and cannot be stopped by CSI driver(all errors will just be bubbled to snapshotcontent status).

If something goes wrong after initial snapshot-id & readyToUse: false response and csi driver returns errors, it will be responsibility of the application/user running it to issue a delete snapshot request.

Current RBD csi driver behavior:

  • RBD csi driver when it hits rbdimageclone depth hardlimit or force flatten(snapshot to different pool case), adds flattening and returns snapshot-id & readyToUse: false.
  • When error occurs (indeed a very low chance) and rbd tmp-snap/ tmp-clone/snap is not recoverable (calls undoSnapshotClone ())

Note: Both the above behaviors are very rare and their occurences can be debated but have been spotted in the wild. (here , search for snapshot-7f08e8b1-3691-49a8-a283-d0d0b9a55285)

Things that happen currently

  • Cleanup everything rbd tmp-snap/ tmp-clone/snap and omap relating to old snapshot-id.
  • Generate new snapshot-id and create new tmp-snap/tmp-clone/snap, return new snapshot-id to external-snapshotter(has no effect 😞, still retains the old snapshot-id )
  • This new tmp-snap happens on a PVC that maybe in active use by an application !!
  • Eventually, the subsequent request succeeds, snapshotcontent will still retain the original snapshot-id !
  • and PVC-restore will fail with
    E0715 11:13:56.841643 1 controllerserver.go:682] ID: 707 Req-ID: pvc-4e5cbe5f-1623-402e-982a-26a08b6da064 failed to get backend snapshot for 0001-0024-cd8b81d5-27be-44a3-8230-d3e1a2369def-0000000000000002-60e3ec6e-e55d-11eb-9cdd-5ef793f1116d: key not found: no snap source in omap for "csi.snap.60e3ec6e-e55d-11eb-9cdd-5ef793f1116d"

Possible solutions and consequences:

Soln 1:

Return error after initial failure forever until we get a delete snapshot request to cleanup omap : (because external-snapshotter just keeps on retrying)
Disadv: just bad user experience

Soln 2:

Return snapshot-id & readyToUse: true only after everything is done(including flattening). Takes care of all leaks.
Disadv: application is frozen for prolonged time until flattening is done completely (frequency of occurence should be very low )

Soln 3:

Mark it as a known issue, suggest manual cleanup of stale rbd images and omap.

My preference would be to have Soln 2 adopted. Given that flattening is rarely required and we avoid leaving any stale resource with this solution.

Please read the discussion at slack in the first link with regards to behavior of external-snapshotter and expected behavior of CSI driver which lead me to the above solutions.

@humblec
Copy link
Collaborator

humblec commented Jul 19, 2021

Soln 2:

Return snapshot-id & readyToUse: true only after everything is done(including flattening). Takes care of all leaks.
Disadv: application is frozen for prolonged time until flattening is done completely (frequency of occurence should be very low )

True, imo, the right solution would be we ( CSI driver) return the proper Snapshot ID/Handle, once we have it properly recorded in the storage system. Cascading a new volume handle in between after returning another one to controller is not the correct approach. in short, we should return the handle once we are having it and certified that , we are good.
@Rakshith-R

@nixpanic
Copy link
Member

I think that Soln 2 is the most appropriate.

There might be a twist/enhancement possible, in case flattening is needed. Can the snapshot-id be returned, with readyToUse: false, so that subsequent requests can validate if flattening is completed?

@Rakshith-R
Copy link
Contributor

Rakshith-R commented Jul 20, 2021

Can the snapshot-id be returned, with readyToUse: false, so that subsequent requests can validate if flattening is completed?

@nixpanic, The problem is when there's a failure after such a response snapshot-id be returned, with readyToUse: false.

Currently we are just cleaning up, creating a new snapshot-id(which is never updated by the external snapshotter) and taking another snapshot on a PVC with possibly a running application using it.

So if we do this return snapshot-id, with readyToUse: false and there's a failure, we would have to
Soln 1:

Return error after initial failure forever until we get a delete snapshot request to cleanup omap : (because external-snapshotter just keeps on retrying)
Disadv: just bad user experience

If there is no failure, we do just set readyToUse: true after flattening is done (Current behaviour).

@humblec
Copy link
Collaborator

humblec commented Jul 28, 2021

As discussed we have to continue the analysis and get into the completion in upcoming releases. Moving this out of 3.4.0 release.

@humblec humblec removed this from the release-3.4.0 milestone Jul 28, 2021
Rakshith-R added a commit to Rakshith-R/ceph-csi that referenced this issue Jul 30, 2021
This commit fixes snapshot id idempotency issue by
always returning an error when flattening is in progress
and not using `readyToUse:false` response.

Updates: ceph#1883

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/ceph-csi that referenced this issue Jul 30, 2021
This commit fixes snapshot id idempotency issue by
always returning an error when flattening is in progress
and not using `readyToUse:false` response.

Updates: ceph#1883

Signed-off-by: Rakshith R <rar@redhat.com>
@mergify mergify bot closed this as completed in #2260 Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/cephfs Issues related to CephFS component/rbd Issues related to RBD Priority-0 highest priority issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants