Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Use error type rather than strings #878

Merged
merged 2 commits into from
Jun 4, 2024

Conversation

tmshort
Copy link
Contributor

@tmshort tmshort commented May 21, 2024

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@tmshort tmshort requested a review from a team as a code owner May 21, 2024 16:01
@tmshort tmshort changed the title Use error type rather than strings ✨ Use error type rather than strings May 21, 2024
Copy link

codecov bot commented May 21, 2024

Codecov Report

Attention: Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.

Project coverage is 77.93%. Comparing base (9f0c6a9) to head (928e1ba).

Files Patch % Lines
internal/controllers/common_controller.go 0.00% 8 Missing ⚠️
...nternal/controllers/clusterextension_controller.go 77.27% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #878      +/-   ##
==========================================
+ Coverage   76.21%   77.93%   +1.72%     
==========================================
  Files          17       17              
  Lines        1177     1142      -35     
==========================================
- Hits          897      890       -7     
+ Misses        202      175      -27     
+ Partials       78       77       -1     
Flag Coverage Δ
e2e 59.80% <46.66%> (+1.60%) ⬆️
unit 60.14% <53.33%> (+2.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tmshort tmshort force-pushed the error-types branch 2 times, most recently from ec163f2 to b6d0e4a Compare May 21, 2024 17:57
@@ -184,11 +213,11 @@ func (r *ClusterExtensionReconciler) handleResolutionErrors(ext *ocv1alpha1.Clus
if errors.As(err, &aggErrs) {
for _, err := range aggErrs.Errors() {
errorMessage := err.Error()
if strings.Contains(errorMessage, "no package") {
if errors.As(err, &NoPackageError{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to verify that this works as expected? IIRC, the last time I did something like this I discovered it did not work as expected and had to do it this way: https://github.com/operator-framework/catalogd/blob/fa4a29bae42a5725d932b1bf6c31381f185e99f7/pkg/controllers/core/clustercatalog_controller.go#L79-L82

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-off topic, but @everettraven, that recoverable error thing is a feature built directly into controller-runtime since 0.15: https://github.com/kubernetes-sigs/controller-runtime/blob/735b6073bb253c0449bfcf6641855dcf2118bb15/pkg/reconcile/reconcile.go#L105-L109

Copy link
Member

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't contain the whole diff related to error handling and I can't comment on the relevant lines. But I'll try to do my best to explain why I think error handling should be changed.

TL;DR: Opaque errors should do the job here.

The need for custom error types seems to arise from handleResolutionErrors which for some reason attempts to distinguish between different errors. I do not think there is no need to handle errors separately:

  1. All branches in handleResolutionErrors seem to boil down to the same call:
    setResolvedStatusConditionFailed(&ext.Status.Conditions, errorMessage, ext.Generation)
    And some variation of the following (only the reason is different):
    ensureAllConditionsWithReason(ext, "ResolutionFailed", errorMessage)
  2. First two branches (NoPackageError and InvalidVersionRangeError) are the same and set reason to ResolutionFailed.
  3. Last two branches (one handling non-NoPackageError and non-InvalidVersionRangeError errors and one handling non-Aggregate errors) are also identical and set reason to InstallationStatusUnknown.
  4. I think InstallationStatusUnknown should not be used here since we are handling errors related to the resolution. I believe it is reasonable to always use ResolutionFailed.

Above leaves us with just the following:

setResolvedStatusConditionFailed(&ext.Status.Conditions, errorMessage, ext.Generation)
ensureAllConditionsWithReason(ext, ocv1alpha1.ReasonResolutionFailed, errorMessage)

If we do that we do not need custom error types.

I also have doubts about ensureAllConditionsWithReason: we only use it in handleResolutionErrors and do not seem to use this approach consistently (approach of setting all missing conditions to the same reason & message).

I think all of this can be done like this:

bundle, err := r.resolve(ctx, *ext)
if err != nil {
    setResolvedStatusConditionFailed(&ext.Status.Conditions, err.Error(), ext.Generation)
    // setInstalledStatusConditionFailed, and other relevant conditions
    // in the same way we do after r.resolve(ctx, *ext) call.
    return ctrl.Result{}, err
}

// Somewhere at the end of reconcile deal with "missing" conditions (instead of using ensureAllConditionsWithReason in place)

// Handle no package found errors, potentially setting status conditions
setResolvedStatusConditionFailed(&ext.Status.Conditions, errorMessage, ext.Generation)
ensureAllConditionsWithReason(ext, "ResolutionFailed", errorMessage)
} else if strings.Contains(errorMessage, "invalid version range") {
} else if errors.As(err, &InvalidVersionRangeError{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to distinguish between different errors here? Why do we need custom error types?

Handling of NoPackageError and InvalidVersionRangeError is literally the same.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2024
@tmshort
Copy link
Contributor Author

tmshort commented May 29, 2024

@m1kola (github doesn't support comment threads), I do believe we need to distinguish between Resolution errors, which likely involve errors in the ClusterExtension resource, vs. other errors that can occur on the cluster. In the first case, the user may need to fix it, and can likely do it themselves. In the other case, it may be a temporal error on the cluster, or something else that the user cannot handle.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2024
@tmshort tmshort force-pushed the error-types branch 2 times, most recently from 26550da to 5cfe589 Compare May 29, 2024 18:46
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2024
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tmshort tmshort changed the base branch from helm-poc to main May 30, 2024 19:52
Copy link

netlify bot commented May 30, 2024

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 928e1ba
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/665f2fbc41879700082961cb
😎 Deploy Preview https://deploy-preview-878--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@tmshort tmshort removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2024
@m1kola
Copy link
Member

m1kola commented Jun 3, 2024

@m1kola (github doesn't support comment threads), I do believe we need to distinguish between Resolution errors, which likely involve errors in the ClusterExtension resource, vs. other errors that can occur on the cluster. In the first case, the user may need to fix it, and can likely do it themselves. In the other case, it may be a temporal error on the cluster, or something else that the user cannot handle.

@tmshort handleResolutionErrors only handles resolution errors (only errors returned from r.resolve(ctx, *ext)).

How I see it:

  • If something is wrong with ClusterExtension spec (e.g. we can't find a package) - we want to make sure that the error is very human readable and appears on the condition. Human can take an action based on the error message.
  • If it is something temporal (e.g. we timeout fetching bundles from catalogs) - then we should also put errors into the condition. Ideally adding some context for troubleshooting (e.g. not just "request timed out", but "error fetching bundles: request timed out", etc). Eventually temporal errors will be resolved during one of the attempts to reconcile.
  • If it is something permanent - same as temporal. I believe at the moment we do not have any permanent/terminal conditions. So every error which appears to be "permanent" is logically just a temporal issue which has not been resolved yet.

In all the these cases we can fail resolution condition and write error into the condition message. I don't think that setting reason to InstallationStatusUnknown aids UX.

There is a chance that I'm missing something (e.g. maybe you expect programmatic clients to take advantage of a distinct combination of reasons and take some action?). Happy to discuss it further here or jump on a quick call if you like.

@joelanford
Copy link
Member

@ankitathomas's PR is super relevant to the discussion of temporal/transient vs permanent. #842

I think the Progressing condition that we've discussed should make it much easier for us to tell users "we're going to try again" vs "we're not trying again, there is something you need to fix"

@tmshort
Copy link
Contributor Author

tmshort commented Jun 3, 2024

@m1kola

handleResolutionErrors only handles resolution errors (only errors returned from r.resolve(ctx, *ext)).

Non-resolution (e.g. timeout) errors can occur in r.resolve(), and thus any kind of errors are handled by handleResolutionErrors. Would it make more sense to remove the function (since it's a lot smaller now), and put the functionality into the main function?

This is just trying to clean up error processing. It's not meant to clean up the status conditions (but it has to do something with status conditions). It looks as though #842 changed title, so it's doing something different now, issue #880 is meant to clean up the status conditions. The hope is that we can have smaller changes (which is why I'm trying to keep this small).

@tmshort
Copy link
Contributor Author

tmshort commented Jun 3, 2024

This PR is focused on removing the string checks for error types, and all the aggregated errors that were being used - issues noted in the big Helm PR #846. It was not meant to fix all the problems with status conditions.

Separate out resolution errors from generic errors.

Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Copy link
Member

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love removing code. Looks good to me.

@tmshort tmshort added this pull request to the merge queue Jun 4, 2024
Merged via the queue into operator-framework:main with commit 841db34 Jun 4, 2024
16 of 17 checks passed
@tmshort tmshort deleted the error-types branch June 4, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants