Retry on transient error when waiting for snapshot to be ready #2508

hairyhum · 2023-12-01T21:59:26Z

Change Overview

CSI snapshot controller might add errors to the snapshot status which it will recover from.

Make snapshotter.WaitOnReadyToUse retry (up to 100 times) on those errors. Backoff mechanism makes it so 100 retries is minutes, hopefuly should be enough for most cases.

Unfortunately CSI snapshotter uses strings to inform of error reason and does not provide error code or type when reporting, hence for now we use regexp to match on transient error. If CSI snapshotter uses better error format in the future, we can also change that.

Pull request type

Please check the type of change your PR introduces:

Test Plan

💪 Manual
⚡ Unit test
💚 E2E

Follow up

Need to run some integration tests and monitor future issues in case we need to adjust the backoff retries count.
Monitor kubernetes-csi/external-snapshotter#748 as those improvements could reduce the chance of errors (probably won't eliminate them completely though)
Monitor kubernetes-csi/external-snapshotter#970 if there are improvements to the format of error

CSI snapshot controller might add errors to the snapshot status which it will recover from. Make snapshotter.WaitOnReadyToUse retry (up to 100 times) on those errors. Backoff mechanism makes it so 100 retries is minutes, hopefuly should be enough for most cases. Unfortunately CSI snapshotter uses strings to inform of error reason and does not provide error code or type when reporting, hence for now we use regexp to match on transient error. If CSI snapshotter uses better error format in the future, we can also change that.

Move retry logic to snapshot.go

pkg/kube/snapshot/snapshot_alpha.go

pkg/kube/snapshot/snapshot_stable.go

pkg/kube/snapshot/snapshot.go

viveksinghggits · 2023-12-04T14:49:35Z

pkg/kube/snapshot/snapshot_alpha.go

+		return false, err
+	}
+	// Error can be set while waiting for creation
+	if vs.Status.Error != nil {


Just wanted to confirm that the events are in VolumeSnapshot resource right, and not in the VolumeSnapshotContent resource?

Yes. I don't think we read VolumeSnapshotContent status anywhere in our code. But status errors are propagated by snapshot_controller to VolumeSnapshot if there is an error in VolumeSnapshotContent as I understand.

viveksinghggits · 2023-12-04T14:51:43Z

@hairyhum do we know that this error eventually always goes away and VolumeSnapshot gets into the ReadyToUse state?

hairyhum · 2023-12-04T16:19:52Z

@hairyhum do we know that this error eventually always goes away and VolumeSnapshot gets into the ReadyToUse state?

For that particular error - yes, it is recoverable eventually, other errors may be not. But there is also a retry limit in case shapshot controller fails to recover from that one.

ewhamilton · 2023-12-04T20:39:45Z

pkg/kube/snapshot/snapshot.go

+	isReadyFunc func(*unstructured.Unstructured) (bool, error),
+) error {
+	retries := 100
+	return poll.WaitWithRetries(ctx, retries, isTransientError, func(context.Context) (bool, error) {


Question: Is this the right poll to use here? Why is 100 retries the correct value?

It appears to me that there is currently only 1 other file where Kanister code uses poll.WaitWithRetries.

It is more common to use poll.Wait, sometimes with the ctx set to have a timeout. Or to call poll.WaitWithBackoff and a maximum time set.

We could do that. The difference here is that we only retry on this specific type of error, while poll.WaitWithBackoff would not distinguish that.
The 100 value is pretty arbitrary just to avoid the infinite wait. We could use poll.WaitWithBackoffWithRetries to configure max time to wait as well, but I don't know what's the good value in this context.

Current setting for the context timeout is infinity because there's a long-running process. Inserting a new time limit here would mean potentially creating a new failure scenario. Setting retries would worst case result in similar behaviour to what we have right now.

As @pavannd1 indicated we can check this in as is and iterate on the polling mechanism-- this will improve robustness as is.

We can retry only on a specific error when using poll.Wait or poll.WaitWithBackoff by making the check for that error within the function called by poll. See for example how an apierrors.IsNotFound error is handled in RepoServiceHandler.createService.

See

kanister/pkg/controllers/repositoryserver/handler.go

Line 138 in 2756ffb

err = poll.WaitWithBackoff(ctx, backoff.Backoff{

Yes, there are some subtle differences between terminating a poll with a timeout or by retry count-- in particular whether ctx expiration can affect a call made within the poll loop. But I don't think those differences matter when the call being made is just a Get.

Ultimately the retry count is going to be picked based on a some knowledge of how long it is reasonable to wait for the snapshot controller to resolve the transient. In my mind it is clearer to just express that as a time value instead of trying to calculate the retry count from desired time and the backoff parameters.

See the many instances of poll.Wait(timeoutCtx,... in pkg/app where an app-specific timeout is known and used. Or pkg/kube.WaitForPodReady.

Upon further reflection, I've realized that the code as implemented has the following behavior:

Will wait indefinitely for the controller to update the status to be complete or some form of error.

Once it encounters a transient error it will poll a bounded number of times (with default backoff parameters)

If a non-transient error is encountered it will return an error immediately.

Importantly, setting a timeout on the context at beginning of all polling will affect the first wait, the wait for controller to ready snapshot before setting any error, transient or not.

While I still think it would be better to have an explicit time limit on retrying of transients, and we probably could work out a way to build it into a retry function closure, at that point it is no longer simple and clear.

Best solution is probably to keep retry count on trasients and add a comment about expected time that count corresponds to using default backoff.

pavannd1

Minor improvements suggested. We can merge and iterate on the polling mechanism.

pavannd1 · 2023-12-06T22:45:44Z