Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Closed

Conversation

kaovilai
Copy link
Contributor

@kaovilai kaovilai commented May 30, 2024

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7207

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@kaovilai

This comment was marked as resolved.

@kaovilai kaovilai force-pushed the retry-restore-cr-status-update branch 3 times, most recently from 42d9dba to 3045d3d Compare May 31, 2024 01:52
@kaovilai kaovilai marked this pull request as ready for review May 31, 2024 01:56
Copy link

codecov bot commented May 31, 2024

Codecov Report

Attention: Patch coverage is 3.33333% with 29 lines in your changes are missing coverage. Please review.

Project coverage is 58.72%. Comparing base (33633d8) to head (1587186).
Report is 4 commits behind head on main.

Files Patch % Lines
pkg/client/retry.go 0.00% 20 Missing ⚠️
pkg/util/kube/client.go 0.00% 6 Missing ⚠️
pkg/controller/backup_controller.go 0.00% 0 Missing and 1 partial ⚠️
pkg/controller/backup_finalizer_controller.go 0.00% 0 Missing and 1 partial ⚠️
pkg/controller/restore_controller.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7845      +/-   ##
==========================================
- Coverage   58.79%   58.72%   -0.07%     
==========================================
  Files         345      345              
  Lines       28764    28785      +21     
==========================================
- Hits        16911    16905       -6     
- Misses      10425    10451      +26     
- Partials     1428     1429       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// TODO: consider using a more specific error type to retry, for now, we retry on all errors
// specific errors:
// - connection refused: https://pkg.go.dev/syscall#:~:text=Errno(0x67)-,ECONNREFUSED,-%3D%20Errno(0x6f
return retry.OnError(retry.DefaultBackoff, func(err error) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check my math if this backoff is enough time to solve linked issue.
https://pkg.go.dev/k8s.io/client-go@v0.29.0/util/retry#pkg-variables

var DefaultBackoff = wait.Backoff{
	Steps:    4,
	Duration: 10 * time.Millisecond,
	Factor:   5.0,
	Jitter:   0.1,
}

@kaovilai kaovilai force-pushed the retry-restore-cr-status-update branch from ae07397 to 044dbb7 Compare May 31, 2024 13:34
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai kaovilai force-pushed the retry-restore-cr-status-update branch from 044dbb7 to 8ef8fcb Compare May 31, 2024 13:35
@kaovilai
Copy link
Contributor Author

Received feedback to expand retry to other status patches and potentially for backup too. Will ask for feedback in issue.

@kaovilai kaovilai marked this pull request as draft May 31, 2024 15:25
@kaovilai kaovilai force-pushed the retry-restore-cr-status-update branch 4 times, most recently from eeb3856 to 8abf577 Compare May 31, 2024 21:21
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai kaovilai force-pushed the retry-restore-cr-status-update branch 4 times, most recently from e0f5b03 to 07aaa6b Compare May 31, 2024 23:05
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai kaovilai changed the title Retry restore completion status patch Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores May 31, 2024
@kaovilai kaovilai force-pushed the retry-restore-cr-status-update branch from 07aaa6b to 1587186 Compare May 31, 2024 23:08
@kaovilai kaovilai marked this pull request as ready for review May 31, 2024 23:09
@github-actions github-actions bot requested a review from reasonerjt May 31, 2024 23:09
@Missxiaoguo
Copy link

We are working on making each k8s API client call to be retriable in case of internal server errors caused by temporary API outages. The client-go offers the capability to implement a custom middleware transport for adding extra behavior or processing to the default http transport (https://github.com/kubernetes/client-go/blob/master/transport/config.go#L68). We are utilizing it to wrap the default http request with retries for specific error types https://github.com/openshift-kni/lifecycle-agent/pull/548/files#diff-99516c035075962960ff61611a293268a97b3a6535639e92580d0ef8de6eb8cf.
Adding retry logic as middleware enhances the resilience of each API client call. Not sure if your team considers this solution!

@kaovilai
Copy link
Contributor Author

kaovilai commented Jun 4, 2024

That's cool. Will check, thanks @Missxiaoguo

@sseago
Copy link
Collaborator

sseago commented Jun 4, 2024

I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.

@Missxiaoguo
Copy link

Missxiaoguo commented Jun 4, 2024

I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.

yeah, my understanding is that the hang depends on the duration of API server outage regardless of small or big restore. The difference with a big restore is more underlying requests based on the number of resources. But it's very reasonable for your guys to be cautious!

@kaovilai
Copy link
Contributor Author

kaovilai commented Jun 5, 2024

Trying Requeue + Fail without velero pod restart in another PR and we can discuss if we want this retry PR later.

@kaovilai
Copy link
Contributor Author

kaovilai commented Jun 5, 2024

From community meeting, this PR is currently considered too specific and not applicable to most users and for the users that it applies to, backups during apiserver outage is considered risky.

@kaovilai
Copy link
Contributor Author

kaovilai commented Jun 5, 2024

Requeue approach at #7863

@kaovilai
Copy link
Contributor Author

kaovilai commented Jul 3, 2024

closing as this won't get merged, we will discuss a better solution in the future. Downstream we do favor #7863 as that would "retry" via requeue for a longer time period.

@kaovilai kaovilai closed this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"
3 participants