Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update spec doc for GroupController service #545

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

carlory
Copy link

@carlory carlory commented Jun 14, 2023

What type of PR is this?

/kind document

Special notes for your reviewer:

Does this PR introduce an API-breaking change?:

none

spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
@carlory carlory force-pushed the patch-001 branch 2 times, most recently from 152274d to a645f38 Compare June 15, 2023 17:08
@carlory carlory changed the title [WIP] Update spec doc for GroupController service Update spec doc for GroupController service Jun 15, 2023
@carlory carlory force-pushed the patch-001 branch 2 times, most recently from d4c8f16 to 0d0b96a Compare June 15, 2023 17:12
@carlory
Copy link
Author

carlory commented Jun 15, 2023

/cc @xing-yang @jdef @saad-ali @bswartz

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@carlory carlory force-pushed the patch-001 branch 2 times, most recently from a582f56 to a1f4eb1 Compare June 20, 2023 09:12
@carlory
Copy link
Author

carlory commented Jun 20, 2023

A new change is that the RPC Interactions section is added.

cc @jdef @xing-yang

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@carlory
Copy link
Author

carlory commented Jun 26, 2023

/cc @xing-yang

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@carlory
Copy link
Author

carlory commented Jul 26, 2023

@xing-yang updated. please review again.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated

A `CreateVolumeGroupSnapshot` operation SHOULD return with a `group_snapshot_id` when the group snapshot is cut successfully. If a `CreateVolumeGroupSnapshot` operation times out before the group snapshot is cut, leaving the CO without an ID with which to reference a group snapshot, and the CO also decides that it no longer needs/wants the group snapshot in question then the CO MAY choose one of the following paths:

1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what a "server-side gRPC error" is. Please be specific about which error indicate that no cleanup is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also rephrase the last part for grammar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementation of the SP's group snapshot function is flawed, for example, if two data volumes are snapshotted at the same time, one will succeed and the other will always fail, then no matter how many times CO retries the call, it will not get the expected results. It maybe lead to a deadlock in CO. But in the spec, I can't find a suitable status code to tell CO not to retry.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can assume that the SP implementation can handle the above situation, then the following description is not needed.

If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementation of the SP's group snapshot function is flawed, for example, if two data volumes are snapshotted at the same time, one will succeed and the other will always fail, then no matter how many times CO retries the call, it will not get the expected results. It maybe lead to a deadlock in CO. But in the spec, I can't find a suitable status code to tell CO not to retry.

This is correct, and because there's nothing the CO can do to address cases like this, we have to put the burden on the SP to detect and fix this situation. I would propose that of the SP fails to snapshot any member of the group and there isn't hope that a retry will succeed, then the SP must clean up any other snapshots related to that group snapshot and return the terminal GRPC error.

The only other solution I can see is to return a "successful" snapshot but mark it as broken or partial with some explicit boolean value in the response message. This would allow the CO to stop retying and then decide whether to keep the broken group snapshot or clean it up the normal way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

Copy link
Author

@carlory carlory Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz @xing-yang

I drop this sentence If the CreateVolumeGroupSnapshot RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked. and open an issue #554 to track it.

spec.md Outdated

1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.

2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up. But this way isn't considered as a good practice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad guidance, and we can't recommend this in the spec. We need to be clear about what the CO must do and what is optional, and for optional things, how the SP can ensure correct behavior under either option.

Copy link
Author

@carlory carlory Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar expression in the spec.

The following content is excerpted from PRC interactions of the controller service.

2. The CO takes no further action regarding the timed out RPC, a snapshot is possibly leaked 
and the operator/user is expected to clean up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. We should probably clean up the language for volume snapshots too. It should be fine to duplicate the existing language from volume snapshots to group snapshots. I just want to make sure the language we're copying it up to date and accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this sentence regarding volume snapshot should be updated. SP should clean up the snapshot in this case.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated

A `CreateVolumeGroupSnapshot` operation SHOULD return with a `group_snapshot_id` when the group snapshot is cut successfully. If a `CreateVolumeGroupSnapshot` operation times out before the group snapshot is cut, leaving the CO without an ID with which to reference a group snapshot, and the CO also decides that it no longer needs/wants the group snapshot in question then the CO MAY choose one of the following paths:

1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

spec.md Outdated

1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`. If the `CreateVolumeGroupSnapshot` RPC returns a server-side gRPC error, it means that SP do clean up and make sure no snapshots are leaked.

2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up. But this way isn't considered as a good practice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this sentence regarding volume snapshot should be updated. SP should clean up the snapshot in this case.

Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>
@bswartz
Copy link
Contributor

bswartz commented Oct 13, 2023

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

I don't necessarily recommend this, but we could add an additional boolean field in the response message to indicate it.

It's a unique problem for group snapshots because if you have multiple volumes and most of the snapshots were okay, but one cannot succeed, there are good arguments both to preserve the partial group snapshot and arguments to delete it and try over again. My personal inclination would to be to delete it -- in which case the SP could do that work with no RPC changes. But if we return success with a "partial" boolean set then the CO could also delete it, while also having the option to not delete it.

@saad-ali
Copy link
Member

Confirmed @carlory signed CSI CLA

@saad-ali
Copy link
Member

How do we mark a snapshot as "broken" or "partial"? Currently we don't have a way to do that and communicate back to CO.

I don't necessarily recommend this, but we could add an additional boolean field in the response message to indicate it.

It's a unique problem for group snapshots because if you have multiple volumes and most of the snapshots were okay, but one cannot succeed, there are good arguments both to preserve the partial group snapshot and arguments to delete it and try over again. My personal inclination would to be to delete it -- in which case the SP could do that work with no RPC changes. But if we return success with a "partial" boolean set then the CO could also delete it, while also having the option to not delete it.

Per @xing-yang should fail and clean up if even one snapshot in group fails because they have to be consistent.
Per @bswartz if SP gets 9/10 it is responsible for error handling.

@xing-yang will take another look at this PR.


1. Retry the `CreateVolumeGroupSnapshot` RPC to possibly obtain a group snapshot ID that may be used to execute a `DeleteVolumeGroupSnapshot` RPC; upon success execute `DeleteVolumeGroupSnapshot`.

2. The CO takes no further action regarding the timed out RPC, a group snapshot is possibly leaked and the operator/user is expected to clean up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @bswartz and @xing-yang - It is not possible to tell k8s partial completion. And if partial completion, it is SP responsibility to clean up. It's all or nothing, if it's not all, SP must clean up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants