On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952

kaovilai · 2023-10-13T05:19:47Z

Describe the problem/challenge you have
A custom resource can take some time from create to become Get-able due to some processing inside an apiserver resulting in failed restore for the object and a log like so

time="2023-09-15T17:51:09Z" level=error msg="Error retrieving cluster version of isim-dev-restored/postgresql:13-el9: imagestreamtags.image.openshift.io \"postgresql:13-el9\" not found" logSource="/remote-source/velero/app/pkg/res
tore/restore.go:1285" restore=velero/1a885c89-d880-5472-ac25-04e823668177-2023-09-15-10-51-25-isim-dev-restored

Example case: openshift/openshift-velero-plugin#204

Problem analysis

// Get retrieves an image that has been tagged by stream and tag. `id` is of the format <stream name>:<tag>.
func (r *REST) Get(ctx context.Context, id string, options *metav1.GetOptions) (runtime.Object, error) {
...
	image, err := r.imageFor(ctx, tag, imageStream) // <-- this is returning an IsNotFoundError
	if err != nil {
		return nil, err
	}
...}

https://github.com/openshift/openshift-apiserver/blob/9573998170f3bb7ae7e946c11b7e9fc414120df4/pkg/image/apiserver/registry/imagestreamtag/rest.go#L127C1-L145C2

which calls

// imageFor retrieves the most recent image for a tag in a given imageStreem.
func (r *REST) imageFor(ctx context.Context, tag string, imageStream *imageapi.ImageStream) (*imageapi.Image, error) {
	event := internalimageutil.LatestTaggedImage(imageStream, tag)
	if event == nil || len(event.Image) == 0 {
		return nil, kapierrors.NewNotFound(imagegroup.Resource("imagestreamtags"), imageutil.JoinImageStreamTag(imageStream.Name, tag)) // <-- this is where error originated
	}

	return r.imageRegistry.GetImage(ctx, event.Image, &metav1.GetOptions{})
}

https://github.com/openshift/openshift-apiserver/blob/9573998170f3bb7ae7e946c11b7e9fc414120df4/pkg/image/apiserver/registry/imagestreamtag/rest.go#L439C1-L447C2

which is caused by TagEvent not yet existing due to network speeds etc in processing the imagestreamtag creation

// LatestTaggedImage returns the most recent TagEvent for the specified image
// repository and tag. Will resolve lookups for the empty tag. Returns nil
// if tag isn't present in stream.status.tags.

While this is an OpenShift apiserver specific example, the problem can be described and resolved generically and may apply in other environments.

Describe the solution you'd like

A few retries would resolve this issue.
Implementation: #6949

Anything else you would like to add:

Environment:

Velero version (use velero version):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "The project would be better with this feature added"
👎 for "This feature will not enhance the project in a meaningful way"

The text was updated successfully, but these errors were encountered:

reasonerjt · 2023-10-16T06:28:42Z

While this is an OpenShift apiserver specific example, the problem can be described and resolved generically and may apply in other environments.

Is it really generic? Seems it's a flaw in the implementation of openshift? i.e. if a resource is created and returned a 2xx it should be get-able?

sseago · 2023-10-16T16:39:13Z

@reasonerjt It is generic in the sense that any controller that behaves this way will cause velero to fail like this. I have heard that there are some other resources that do this as well, but unfortunately I don't know exactly which ones, so I can't verify that.
In any case, in terms of the OpenShift ImageStreamTag, it's not so much a flaw as it is a consequence of implementation.

I think fixing it with retries is the cleanest approach, since it will also fix it for any other resources that behave this way. However, if there are objections to this approach, we could also fix it in a RIA plugin for that specific resource type, but that won't help if it turns up in other places as well.

sseago · 2023-10-18T18:10:51Z

FYI, we are also seeing this same issue with openshift role bindings.

sseago · 2023-10-18T18:13:26Z

@kaovilai @reasonerjt Another possibility here is we don't retry, but instead of logging this as an error, we just log a warning when this happens. The scenario is "can't create, already exists, but we also can't Get yet" -- I think a warning is appropriate, since the resource wouldn't be created anyway if Get succeeded. The only thing we lose vs. retry is that we won't be able to apply an "Update" existing resource policy, but for the particular types we've seen this with, "Update" doesn't really make much sense here anyway.

kaovilai · 2023-10-24T03:43:17Z

FYI PR to warn instead of error as an alternative approach to erroring out which would avoid the need to retry.
#7004

kaovilai mentioned this issue Oct 13, 2023

Retry Get on IsNotFound after restore Create error on AlreadyExists #6949

Closed

3 tasks

sseago assigned kaovilai Oct 13, 2023

kaovilai mentioned this issue Oct 23, 2023

restore: Use warning when Create IsAlreadyExist and Get error #7004

Merged

3 tasks

kaovilai mentioned this issue Oct 24, 2023

Add logic to fail DPA reconcile on wrong BSL secret key openshift/oadp-operator#1191

Merged

kaovilai mentioned this issue Nov 2, 2023

Go CSV Markers to populate specDescriptors for "owned" CRDs from another module operator-framework/operator-sdk#6616

Open

sseago closed this as completed in #7004 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952

On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952

kaovilai commented Oct 13, 2023 •

edited by reasonerjt

Loading

reasonerjt commented Oct 16, 2023

sseago commented Oct 16, 2023

sseago commented Oct 18, 2023

sseago commented Oct 18, 2023

kaovilai commented Oct 24, 2023 •

edited

Loading

On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952

On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952

Comments

kaovilai commented Oct 13, 2023 • edited by reasonerjt Loading

reasonerjt commented Oct 16, 2023

sseago commented Oct 16, 2023

sseago commented Oct 18, 2023

sseago commented Oct 18, 2023

kaovilai commented Oct 24, 2023 • edited Loading

kaovilai commented Oct 13, 2023 •

edited by reasonerjt

Loading

kaovilai commented Oct 24, 2023 •

edited

Loading