Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952

Closed
kaovilai opened this issue Oct 13, 2023 · 5 comments · Fixed by #7004
Assignees

Comments

@kaovilai
Copy link
Contributor

kaovilai commented Oct 13, 2023

Describe the problem/challenge you have
A custom resource can take some time from create to become Get-able due to some processing inside an apiserver resulting in failed restore for the object and a log like so

time="2023-09-15T17:51:09Z" level=error msg="Error retrieving cluster version of isim-dev-restored/postgresql:13-el9: imagestreamtags.image.openshift.io \"postgresql:13-el9\" not found" logSource="/remote-source/velero/app/pkg/res
tore/restore.go:1285" restore=velero/1a885c89-d880-5472-ac25-04e823668177-2023-09-15-10-51-25-isim-dev-restored

Example case: openshift/openshift-velero-plugin#204

Problem analysis
// Get retrieves an image that has been tagged by stream and tag. `id` is of the format <stream name>:<tag>.
func (r *REST) Get(ctx context.Context, id string, options *metav1.GetOptions) (runtime.Object, error) {
...
	image, err := r.imageFor(ctx, tag, imageStream) // <-- this is returning an IsNotFoundError
	if err != nil {
		return nil, err
	}
...}

https://github.com/openshift/openshift-apiserver/blob/9573998170f3bb7ae7e946c11b7e9fc414120df4/pkg/image/apiserver/registry/imagestreamtag/rest.go#L127C1-L145C2

which calls

// imageFor retrieves the most recent image for a tag in a given imageStreem.
func (r *REST) imageFor(ctx context.Context, tag string, imageStream *imageapi.ImageStream) (*imageapi.Image, error) {
	event := internalimageutil.LatestTaggedImage(imageStream, tag)
	if event == nil || len(event.Image) == 0 {
		return nil, kapierrors.NewNotFound(imagegroup.Resource("imagestreamtags"), imageutil.JoinImageStreamTag(imageStream.Name, tag)) // <-- this is where error originated
	}

	return r.imageRegistry.GetImage(ctx, event.Image, &metav1.GetOptions{})
}

https://github.com/openshift/openshift-apiserver/blob/9573998170f3bb7ae7e946c11b7e9fc414120df4/pkg/image/apiserver/registry/imagestreamtag/rest.go#L439C1-L447C2

which is caused by TagEvent not yet existing due to network speeds etc in processing the imagestreamtag creation

// LatestTaggedImage returns the most recent TagEvent for the specified image
// repository and tag. Will resolve lookups for the empty tag. Returns nil
// if tag isn't present in stream.status.tags.

While this is an OpenShift apiserver specific example, the problem can be described and resolved generically and may apply in other environments.

Describe the solution you'd like

A few retries would resolve this issue.
Implementation: #6949

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@reasonerjt
Copy link
Contributor

While this is an OpenShift apiserver specific example, the problem can be described and resolved generically and may apply in other environments.

Is it really generic? Seems it's a flaw in the implementation of openshift? i.e. if a resource is created and returned a 2xx it should be get-able?

@sseago
Copy link
Collaborator

sseago commented Oct 16, 2023

@reasonerjt It is generic in the sense that any controller that behaves this way will cause velero to fail like this. I have heard that there are some other resources that do this as well, but unfortunately I don't know exactly which ones, so I can't verify that.
In any case, in terms of the OpenShift ImageStreamTag, it's not so much a flaw as it is a consequence of implementation.

I think fixing it with retries is the cleanest approach, since it will also fix it for any other resources that behave this way. However, if there are objections to this approach, we could also fix it in a RIA plugin for that specific resource type, but that won't help if it turns up in other places as well.

@sseago
Copy link
Collaborator

sseago commented Oct 18, 2023

FYI, we are also seeing this same issue with openshift role bindings.

@sseago
Copy link
Collaborator

sseago commented Oct 18, 2023

@kaovilai @reasonerjt Another possibility here is we don't retry, but instead of logging this as an error, we just log a warning when this happens. The scenario is "can't create, already exists, but we also can't Get yet" -- I think a warning is appropriate, since the resource wouldn't be created anyway if Get succeeded. The only thing we lose vs. retry is that we won't be able to apply an "Update" existing resource policy, but for the particular types we've seen this with, "Update" doesn't really make much sense here anyway.

@kaovilai
Copy link
Contributor Author

kaovilai commented Oct 24, 2023

FYI PR to warn instead of error as an alternative approach to erroring out which would avoid the need to retry.
#7004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants