Update spec doc for GroupController service #545

carlory · 2023-06-14T19:08:01Z

What type of PR is this?

/kind document

Special notes for your reviewer:

Does this PR introduce an API-breaking change?:

none

spec.md

carlory · 2023-06-15T17:14:12Z

/cc @xing-yang @jdef @saad-ali @bswartz

spec.md

carlory · 2023-06-20T09:15:57Z

A new change is that the RPC Interactions section is added.

cc @jdef @xing-yang

spec.md

carlory · 2023-06-26T02:12:20Z

/cc @xing-yang

spec.md

carlory · 2023-07-26T03:12:20Z

@xing-yang updated. please review again.

spec.md

bswartz · 2023-10-06T14:56:58Z

spec.md

+
+A `CreateVolumeGroupSnapshot` operation SHOULD return with a `group_snapshot_id` when the group snapshot is cut successfully. If a `CreateVolumeGroupSnapshot` operation times out before the group snapshot is cut, leaving the CO without an ID with which to reference a group snapshot, and the CO also decides that it no longer needs/wants the group snapshot in question then the CO MAY choose one of the following paths:
+
+1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.


It's unclear what a "server-side gRPC error" is. Please be specific about which error indicate that no cleanup is necessary.

Also rephrase the last part for grammar.

If the implementation of the SP's group snapshot function is flawed, for example, if two data volumes are snapshotted at the same time, one will succeed and the other will always fail, then no matter how many times CO retries the call, it will not get the expected results. It maybe lead to a deadlock in CO. But in the spec, I can't find a suitable status code to tell CO not to retry.

If we can assume that the SP implementation can handle the above situation, then the following description is not needed.

If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.

If the implementation of the SP's group snapshot function is flawed, for example, if two data volumes are snapshotted at the same time, one will succeed and the other will always fail, then no matter how many times CO retries the call, it will not get the expected results. It maybe lead to a deadlock in CO. But in the spec, I can't find a suitable status code to tell CO not to retry.

This is correct, and because there's nothing the CO can do to address cases like this, we have to put the burden on the SP to detect and fix this situation. I would propose that of the SP fails to snapshot any member of the group and there isn't hope that a retry will succeed, then the SP must clean up any other snapshots related to that group snapshot and return the terminal GRPC error.

The only other solution I can see is to return a "successful" snapshot but mark it as broken or partial with some explicit boolean value in the response message. This would allow the CO to stop retying and then decide whether to keep the broken group snapshot or clean it up the normal way.

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

@bswartz @xing-yang

I drop this sentence If the CreateVolumeGroupSnapshot RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked. and open an issue #554 to track it.

bswartz · 2023-10-06T15:02:40Z

spec.md

+
+1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.
+
+2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up. But this way isn't considered as a good practice.


This is bad guidance, and we can't recommend this in the spec. We need to be clear about what the CO must do and what is optional, and for optional things, how the SP can ensure correct behavior under either option.

There is a similar expression in the spec.

The following content is excerpted from PRC interactions of the controller service.

2. The CO takes no further action regarding the timed out RPC, a snapshot is possibly leaked and the operator/user is expected to clean up.

Thanks for pointing this out. We should probably clean up the language for volume snapshots too. It should be fine to duplicate the existing language from volume snapshots to group snapshots. I just want to make sure the language we're copying it up to date and accurate.

Agreed that this sentence regarding volume snapshot should be updated. SP should clean up the snapshot in this case.

spec.md

xing-yang · 2023-10-12T18:16:56Z

spec.md

+
+A `CreateVolumeGroupSnapshot` operation SHOULD return with a `group_snapshot_id` when the group snapshot is cut successfully. If a `CreateVolumeGroupSnapshot` operation times out before the group snapshot is cut, leaving the CO without an ID with which to reference a group snapshot, and the CO also decides that it no longer needs/wants the group snapshot in question then the CO MAY choose one of the following paths:
+
+1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.


How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

xing-yang · 2023-10-12T18:20:44Z

spec.md

+
+1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.
+
+2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up. But this way isn't considered as a good practice.


Agreed that this sentence regarding volume snapshot should be updated. SP should clean up the snapshot in this case.

Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>

bswartz · 2023-10-13T16:31:32Z

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

I don't necessarily recommend this, but we could add an additional boolean field in the response message to indicate it.

It's a unique problem for group snapshots because if you have multiple volumes and most of the snapshots were okay, but one cannot succeed, there are good arguments both to preserve the partial group snapshot and arguments to delete it and try over again. My personal inclination would to be to delete it -- in which case the SP could do that work with no RPC changes. But if we return success with a "partial" boolean set then the CO could also delete it, while also having the option to not delete it.

saad-ali · 2023-10-14T00:01:35Z

Confirmed @carlory signed CSI CLA

saad-ali · 2024-01-31T17:34:44Z

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

I don't necessarily recommend this, but we could add an additional boolean field in the response message to indicate it.

It's a unique problem for group snapshots because if you have multiple volumes and most of the snapshots were okay, but one cannot succeed, there are good arguments both to preserve the partial group snapshot and arguments to delete it and try over again. My personal inclination would to be to delete it -- in which case the SP could do that work with no RPC changes. But if we return success with a "partial" boolean set then the CO could also delete it, while also having the option to not delete it.

Per @xing-yang should fail and clean up if even one snapshot in group fails because they have to be consistent.
Per @bswartz if SP gets 9/10 it is responsible for error handling.

@xing-yang will take another look at this PR.

saad-ali · 2024-06-18T17:44:03Z

spec.md

+
+1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`.
+
+2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up.


Per @bswartz and @xing-yang - It is not possible to tell k8s partial completion. And if partial completion, it is SP responsibility to clean up. It's all or nothing, if it's not all, SP must clean up.

carlory commented Jun 14, 2023

View reviewed changes

spec.md Outdated Show resolved Hide resolved

carlory commented Jun 14, 2023

View reviewed changes

spec.md Show resolved Hide resolved

carlory force-pushed the patch-001 branch 2 times, most recently from 152274d to a645f38 Compare June 15, 2023 17:08

carlory changed the title ~~[WIP] Update spec doc for GroupController service~~ Update spec doc for GroupController service Jun 15, 2023

carlory force-pushed the patch-001 branch 2 times, most recently from d4c8f16 to 0d0b96a Compare June 15, 2023 17:12

carlory force-pushed the patch-001 branch from 0d0b96a to 4a7ec8e Compare June 15, 2023 17:14

jdef reviewed Jun 15, 2023

View reviewed changes

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

carlory force-pushed the patch-001 branch 2 times, most recently from a582f56 to a1f4eb1 Compare June 20, 2023 09:12

carlory force-pushed the patch-001 branch from a1f4eb1 to 1c46508 Compare June 20, 2023 09:18

xing-yang reviewed Jun 20, 2023

View reviewed changes

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

spec.md Show resolved Hide resolved

spec.md Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

carlory mentioned this pull request Jun 25, 2023

check if the csi driver is supported the volume group snapshot cap kubernetes-csi/external-snapshotter#864

Merged

carlory force-pushed the patch-001 branch from 1c46508 to 3ccc7a6 Compare June 26, 2023 02:10

xing-yang reviewed Jul 25, 2023

View reviewed changes

spec.md Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

carlory force-pushed the patch-001 branch from 3ccc7a6 to 4285551 Compare July 25, 2023 16:45

bswartz requested changes Oct 6, 2023

View reviewed changes

xing-yang reviewed Oct 12, 2023

View reviewed changes

carlory force-pushed the patch-001 branch from 4285551 to a25c120 Compare October 13, 2023 07:00

carlory mentioned this pull request Oct 13, 2023

Track Additional Comments for the RPC Interactions of VolumeSnapshot and GroupVolumeSnapshot #554

Open

update spec doc for groupcontroller service

3fe0259

Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>

carlory force-pushed the patch-001 branch from a25c120 to 3fe0259 Compare October 13, 2023 07:19

saad-ali reviewed Jun 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update spec doc for GroupController service #545

Update spec doc for GroupController service #545

carlory commented Jun 14, 2023

carlory commented Jun 15, 2023

carlory commented Jun 20, 2023

carlory commented Jun 26, 2023

carlory commented Jul 26, 2023

bswartz Oct 6, 2023

bswartz Oct 6, 2023

carlory Oct 11, 2023

carlory Oct 11, 2023

bswartz Oct 12, 2023

xing-yang Oct 12, 2023

carlory Oct 13, 2023 •

edited

Loading

bswartz Oct 6, 2023

carlory Oct 11, 2023 •

edited

Loading

bswartz Oct 12, 2023

xing-yang Oct 12, 2023

xing-yang Oct 12, 2023

xing-yang Oct 12, 2023

bswartz commented Oct 13, 2023

saad-ali commented Oct 14, 2023

saad-ali commented Jan 31, 2024

saad-ali Jun 18, 2024


		A `CreateVolumeGroupSnapshot` operation SHOULD return with a `group_snapshot_id` when the group snapshot is cut successfully. If a `CreateVolumeGroupSnapshot` operation times out before the group snapshot is cut, leaving the CO without an ID with which to reference a group snapshot, and the CO also decides that it no longer needs/wants the group snapshot in question then the CO MAY choose one of the following paths:

		1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.


		1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.

		2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up. But this way isn't considered as a good practice.

Update spec doc for GroupController service #545

Are you sure you want to change the base?

Update spec doc for GroupController service #545

Conversation

carlory commented Jun 14, 2023

carlory commented Jun 15, 2023

carlory commented Jun 20, 2023

carlory commented Jun 26, 2023

carlory commented Jul 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlory Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlory Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bswartz commented Oct 13, 2023

saad-ali commented Oct 14, 2023

saad-ali commented Jan 31, 2024

Choose a reason for hiding this comment

carlory Oct 13, 2023 •

edited

Loading

carlory Oct 11, 2023 •

edited

Loading