-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On restore, object can fail to create from AlreadyExists and fail to Get with IsNotFound without retry #6952
Comments
Is it really generic? Seems it's a flaw in the implementation of openshift? i.e. if a resource is created and returned a 2xx it should be get-able? |
@reasonerjt It is generic in the sense that any controller that behaves this way will cause velero to fail like this. I have heard that there are some other resources that do this as well, but unfortunately I don't know exactly which ones, so I can't verify that. I think fixing it with retries is the cleanest approach, since it will also fix it for any other resources that behave this way. However, if there are objections to this approach, we could also fix it in a RIA plugin for that specific resource type, but that won't help if it turns up in other places as well. |
FYI, we are also seeing this same issue with openshift role bindings. |
@kaovilai @reasonerjt Another possibility here is we don't retry, but instead of logging this as an error, we just log a warning when this happens. The scenario is "can't create, already exists, but we also can't Get yet" -- I think a warning is appropriate, since the resource wouldn't be created anyway if Get succeeded. The only thing we lose vs. retry is that we won't be able to apply an "Update" existing resource policy, but for the particular types we've seen this with, "Update" doesn't really make much sense here anyway. |
FYI PR to warn instead of error as an alternative approach to erroring out which would avoid the need to retry. |
Describe the problem/challenge you have
A custom resource can take some time from create to become Get-able due to some processing inside an apiserver resulting in failed restore for the object and a log like so
Example case: openshift/openshift-velero-plugin#204
Problem analysis
https://github.com/openshift/openshift-apiserver/blob/9573998170f3bb7ae7e946c11b7e9fc414120df4/pkg/image/apiserver/registry/imagestreamtag/rest.go#L127C1-L145C2
which calls
https://github.com/openshift/openshift-apiserver/blob/9573998170f3bb7ae7e946c11b7e9fc414120df4/pkg/image/apiserver/registry/imagestreamtag/rest.go#L439C1-L447C2
which is caused by TagEvent not yet existing due to network speeds etc in processing the imagestreamtag creation
While this is an OpenShift apiserver specific example, the problem can be described and resolved generically and may apply in other environments.
Describe the solution you'd like
A few retries would resolve this issue.
Implementation: #6949
Anything else you would like to add:
Environment:
velero version
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: