Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

kaovilai · 2024-05-30T21:03:47Z

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7207

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

codecov · 2024-05-31T01:57:23Z

Codecov Report

Attention: Patch coverage is 3.33333% with 29 lines in your changes are missing coverage. Please review.

Project coverage is 58.72%. Comparing base (33633d8) to head (1587186).
Report is 4 commits behind head on main.

Files	Patch %	Lines
pkg/client/retry.go	0.00%	20 Missing ⚠️
pkg/util/kube/client.go	0.00%	6 Missing ⚠️
pkg/controller/backup_controller.go	0.00%	0 Missing and 1 partial ⚠️
pkg/controller/backup_finalizer_controller.go	0.00%	0 Missing and 1 partial ⚠️
pkg/controller/restore_controller.go	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7845      +/-   ##
==========================================
- Coverage   58.79%   58.72%   -0.07%     
==========================================
  Files         345      345              
  Lines       28764    28785      +21     
==========================================
- Hits        16911    16905       -6     
- Misses      10425    10451      +26     
- Partials     1428     1429       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kaovilai · 2024-05-31T02:19:25Z

pkg/controller/restore_finalizer_controller.go

+	// TODO: consider using a more specific error type to retry, for now, we retry on all errors
+	// specific errors:
+	// - connection refused: https://pkg.go.dev/syscall#:~:text=Errno(0x67)-,ECONNREFUSED,-%3D%20Errno(0x6f
+	return retry.OnError(retry.DefaultBackoff, func(err error) bool {


double check my math if this backoff is enough time to solve linked issue.
https://pkg.go.dev/k8s.io/client-go@v0.29.0/util/retry#pkg-variables

var DefaultBackoff = wait.Backoff{ Steps: 4, Duration: 10 * time.Millisecond, Factor: 5.0, Jitter: 0.1, }

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai · 2024-05-31T14:59:31Z

Received feedback to expand retry to other status patches and potentially for backup too. Will ask for feedback in issue.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

Missxiaoguo · 2024-06-03T18:52:50Z

We are working on making each k8s API client call to be retriable in case of internal server errors caused by temporary API outages. The client-go offers the capability to implement a custom middleware transport for adding extra behavior or processing to the default http transport (https://github.com/kubernetes/client-go/blob/master/transport/config.go#L68). We are utilizing it to wrap the default http request with retries for specific error types https://github.com/openshift-kni/lifecycle-agent/pull/548/files#diff-99516c035075962960ff61611a293268a97b3a6535639e92580d0ef8de6eb8cf.
Adding retry logic as middleware enhances the resilience of each API client call. Not sure if your team considers this solution!

kaovilai · 2024-06-04T15:43:10Z

That's cool. Will check, thanks @Missxiaoguo

sseago · 2024-06-04T19:01:57Z

I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.

Missxiaoguo · 2024-06-04T20:41:14Z

I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.

yeah, my understanding is that the hang depends on the duration of API server outage regardless of small or big restore. The difference with a big restore is more underlying requests based on the number of resources. But it's very reasonable for your guys to be cautious!

kaovilai · 2024-06-05T15:01:23Z

Trying Requeue + Fail without velero pod restart in another PR and we can discuss if we want this retry PR later.

kaovilai · 2024-06-05T15:02:43Z

From community meeting, this PR is currently considered too specific and not applicable to most users and for the users that it applies to, backups during apiserver outage is considered risky.

kaovilai · 2024-06-05T18:34:32Z

Requeue approach at #7863

kaovilai · 2024-07-03T00:12:20Z

closing as this won't get merged, we will discuss a better solution in the future. Downstream we do favor #7863 as that would "retry" via requeue for a longer time period.

github-actions bot added the has-unit-tests label May 30, 2024

This comment was marked as resolved.

Sign in to view

kaovilai force-pushed the retry-restore-cr-status-update branch 3 times, most recently from 42d9dba to 3045d3d Compare May 31, 2024 01:52

github-actions bot added the has-changelog label May 31, 2024

kaovilai marked this pull request as ready for review May 31, 2024 01:56

github-actions bot assigned kaovilai May 31, 2024

github-actions bot requested review from blackpiglet and shubham-pampattiwar May 31, 2024 01:56

kaovilai force-pushed the retry-restore-cr-status-update branch from 3045d3d to ae07397 Compare May 31, 2024 02:10

kaovilai mentioned this pull request May 31, 2024

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

Closed

kaovilai commented May 31, 2024

View reviewed changes

kaovilai force-pushed the retry-restore-cr-status-update branch from ae07397 to 044dbb7 Compare May 31, 2024 13:34

Retry restore completion status patch

8ef8fcb

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai force-pushed the retry-restore-cr-status-update branch from 044dbb7 to 8ef8fcb Compare May 31, 2024 13:35

kaovilai marked this pull request as draft May 31, 2024 15:25

kaovilai force-pushed the retry-restore-cr-status-update branch 4 times, most recently from eeb3856 to 8abf577 Compare May 31, 2024 21:21

kubeutil.PatchResourceWithRetriesOnErrors and unit tests updates

ba7cd07

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai force-pushed the retry-restore-cr-status-update branch 4 times, most recently from e0f5b03 to 07aaa6b Compare May 31, 2024 23:05

add backup finalizer patch retries

1587186

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai changed the title ~~Retry restore completion status patch~~ Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores May 31, 2024

kaovilai force-pushed the retry-restore-cr-status-update branch from 07aaa6b to 1587186 Compare May 31, 2024 23:08

kaovilai marked this pull request as ready for review May 31, 2024 23:09

github-actions bot requested a review from reasonerjt May 31, 2024 23:09

sseago approved these changes Jun 4, 2024

View reviewed changes

kaovilai closed this Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

kaovilai commented May 30, 2024 •

edited

Loading

This comment was marked as resolved.

codecov bot commented May 31, 2024 •

edited

Loading

kaovilai May 31, 2024

kaovilai commented May 31, 2024

Missxiaoguo commented Jun 3, 2024

kaovilai commented Jun 4, 2024

sseago commented Jun 4, 2024

Missxiaoguo commented Jun 4, 2024 •

edited

Loading

kaovilai commented Jun 5, 2024

kaovilai commented Jun 5, 2024 •

edited

Loading

kaovilai commented Jun 5, 2024

kaovilai commented Jul 3, 2024

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Conversation

kaovilai commented May 30, 2024 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

This comment was marked as resolved.

codecov bot commented May 31, 2024 • edited Loading

Codecov Report

kaovilai May 31, 2024

Choose a reason for hiding this comment

kaovilai commented May 31, 2024

Missxiaoguo commented Jun 3, 2024

kaovilai commented Jun 4, 2024

sseago commented Jun 4, 2024

Missxiaoguo commented Jun 4, 2024 • edited Loading

kaovilai commented Jun 5, 2024

kaovilai commented Jun 5, 2024 • edited Loading

kaovilai commented Jun 5, 2024

kaovilai commented Jul 3, 2024

kaovilai commented May 30, 2024 •

edited

Loading

codecov bot commented May 31, 2024 •

edited

Loading

Missxiaoguo commented Jun 4, 2024 •

edited

Loading

kaovilai commented Jun 5, 2024 •

edited

Loading