Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry unpacking jobs on failure #3016

Merged
merged 4 commits into from
Oct 3, 2023

Conversation

ankitathomas
Copy link
Contributor

@ankitathomas ankitathomas commented Aug 22, 2023

Description of the change:
Recreate failed bundle unpack jobs to allow for automatic retries on unpacking failure.

Motivation for the change:
Bundle unpack jobs may fail due to network or configuration issues in the cluster that may be transient or resolved with user intervention. Since unpack jobs have deterministic names referencing the bundle they correspond to, recovery from unpack failures requires manual intervention for deleting the associated unpack jobs.

This PR automates recreation of failed unpack jobs indefinitely with a minimum guaranteed interval between jobs if specified by a new operatorGroup annotation.

@openshift-ci openshift-ci bot requested review from njhale and tmshort August 22, 2023 14:35
job, err = c.client.BatchV1().Jobs(fresh.GetNamespace()).Create(context.TODO(), fresh, metav1.CreateOptions{})
}
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to retry without limit?

Copy link
Collaborator

@perdasilva perdasilva Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the retry cadence? is it exp backoff?
Maybe it's ok to retry forever as long as we're not hammering the apiserver?

Copy link
Contributor Author

@ankitathomas ankitathomas Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs whenever olm syncs resolves a namespace - we use the default client-go workqueue so we have exp backoff up to ~15 min.

We do however reset this backoff each time the new unpack job begins, so this can become as short as a 5 second retry loop if the unpack timeout is short enough.

@@ -651,6 +651,14 @@ func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath

return
}
// Cleanup old unpacking job and retry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't care about persisting the failed job at all, this can be simplified to deleting the job immediately after failure and waiting for the next resolver run.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should persist the failed job - we need a debug trail of some sort

@@ -651,6 +651,14 @@ func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath

return
}
// Cleanup old unpacking job and retry
if _, isFailed := getCondition(job, batchv1.JobFailed); isFailed {
err = c.client.BatchV1().Jobs(job.GetNamespace()).Delete(context.TODO(), job.GetName(), metav1.DeleteOptions{})
Copy link
Member

@varshaprasad96 varshaprasad96 Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete it manually and not set TTL to garbage collect? (https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs).

I haven't looked into the entire code, but I'm assuming the controller should re-create a new Job incase it has not been unpacked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, the current implementation requires completed jobs to persist to indicate an unpacked bundle.

@ankitathomas ankitathomas changed the title retry unpacking jobs on failure WIP: retry unpacking jobs on failure Aug 29, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2023

// BundleUnpackRetryMinimumIntervalAnnotationKey sets a minimum interval to wait before
// attempting to recreate a failed unpack job for a bundle.
BundleUnpackRetryMinimumIntervalAnnotationKey = "operatorframework.io/bundle-unpack-min-retry-interval"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you have a follow-up PR to document how to use this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up with operator-framework/olm-docs#313 once the PR is merged

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
… unpack jobs

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
@ankitathomas ankitathomas changed the title WIP: retry unpacking jobs on failure retry unpacking jobs on failure Sep 26, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2023
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Copy link
Contributor

@tmshort tmshort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2023
@openshift-ci
Copy link

openshift-ci bot commented Oct 2, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankitathomas, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2023
@ankitathomas ankitathomas added this pull request to the merge queue Oct 3, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 3, 2023
@tmshort tmshort added this pull request to the merge queue Oct 3, 2023
Merged via the queue into operator-framework:master with commit 4fc64d2 Oct 3, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants