Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent OLM from creating InstallPlans when bundle unpack fails #2942

Conversation

m1kola
Copy link
Member

@m1kola m1kola commented Mar 22, 2023

Prerequisites:


Description of the change:

This PR changes how InstallPlans are being created.

Motivation for the change:

Today OLM creates InstallPlans even if bundle images are not available for some reason. For example, catalog source contains references to an unreachable registry.

Consider this scenario:

  • User installs an operator and enables auto upgrades
  • Catalog source receives an update which references unreachable bundle images (e.g. pre-release from a private registry)
  • OLM creates an InstallPlan for an upgrade which features unreachable bundle images
  • At this point upgrade is stuck since the cluster can not pull bundle images
  • Catalog source receives a corrective update which contains only reachable bundle images

Workaround: to resolve this issue users have to manually remove a failed InstallPlan. After that OLM is expected to create a new InstallPlan based on the latest state of catalog sources.

The goal: We would like to find a solution which prevents InstallPlan from being created with unreachable images in the first place. This eliminates the need to manually delete InstallPlan in scenarios like the one described above.

Architectural changes:

Before the change:

  • On a Namespace reconciliation (syncResolvingNamespace) OLM creates an InstallPlan if required (e.g. new sub is created or a new update in a catalog is available for an existing sub)
  • On InstallPlan reconciliation (syncInstallPlans) OLM creates a Job to unpack a relevant bundle.
    • On unsuccessful job unpacking OLM reports failures as InstallPlan conditions

After this change:

  • On a Namespace reconciliation (syncResolvingNamespace):
    • OLM first creates a Job to unpack a relevant bundle and waits for the job to sucessfully complete
      • On unsuccessful job unpacking OLM reports failures as Subscription conditions (set on all subs in the namespace)
    • Only after successfull completion OLM creates an InstallPlan

Testing remarks:

To reproduce:

  • Create a test cluster and install OLM (make run-local from this repo can be helpful)
  • Create a catalog source:
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: alex-operators
      namespace: olm
    spec:
      image: quay.io/agreene/index:unreachable-bundle-upgrade
      displayName: Alex Operators
      priority: -100
      publisher: Alex
      sourceType: grpc
      updateStrategy:
        registryPoll:
          interval: 10m0s
  • And create a subscription:
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: quay
      namespace: operators
    spec:
      channel: unavailable-image-upgrade
      installPlanApproval: Automatic
      name: quay
      source: alex-operators
      sourceNamespace: olm
      startingCSV: quay-operator.v3.8.3

In master - you will see two InstallPlans: one for quay-operator.v3.8.3 (successfull install) and one for quay-operator.v3.8.4 which will eventually (10min by default) report failure to unpack in bundle lookup conditions.

In this branch you will see only one InstallPlan for quay-operator.v3.8.3 (successfull install). Eventually Subscription will get a condition reporting failure to unpack (SubscriptionBundleUnpackFailed). During unpacking it should report UnpackingInProgress.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2023
@openshift-ci
Copy link

openshift-ci bot commented Mar 22, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@m1kola m1kola force-pushed the unpack_before_creating_InstallPlan branch 5 times, most recently from fcbdbc8 to b91a5ee Compare March 27, 2023 19:47
@m1kola m1kola force-pushed the unpack_before_creating_InstallPlan branch 6 times, most recently from 4a773e2 to c165ece Compare March 31, 2023 15:36
@m1kola m1kola force-pushed the unpack_before_creating_InstallPlan branch 3 times, most recently from ab0ee98 to 61ceac6 Compare April 12, 2023 14:36
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023
@m1kola m1kola force-pushed the unpack_before_creating_InstallPlan branch from 61ceac6 to a2ec578 Compare April 26, 2023 11:06
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2023
@m1kola m1kola force-pushed the unpack_before_creating_InstallPlan branch from a2ec578 to 3251b11 Compare April 26, 2023 15:29
@@ -39,7 +39,7 @@ const (
BundleLookupFailed operatorsv1alpha1.BundleLookupConditionType = "BundleLookupFailed"
Copy link
Member Author

@m1kola m1kola Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR and I think I can't split it any further into smaller mergable PRs.

it mostly consists of moving code around with minimal changes (e.g. from syncInstallPlans into syncResolvingNamespace) and updating unit & e2e test.

I would suggest going commit by commit starting with "Changes how InstallPlans are being created". And reviewing removed code first will hopefully help to understand what was moved & modified.

Description of the PR provides some context and steps on how to reproduce, so make sure to read it as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual code is not that large; it's the tests...

@m1kola m1kola marked this pull request as ready for review April 27, 2023 14:03
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023
@openshift-ci openshift-ci bot requested review from asmacdo and njhale April 27, 2023 14:03
@m1kola
Copy link
Member Author

m1kola commented Apr 27, 2023

/cc @awgreene @perdasilva @ankitathomas

@tmshort
Copy link
Contributor

tmshort commented May 22, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2023
Signed-off-by: Mikalai Radchuk <mradchuk@redhat.com>
Changes required to account for a new flow where we
prevent `InstallPlan` from being created when unpack
job fails

Signed-off-by: Mikalai Radchuk <mradchuk@redhat.com>
@m1kola m1kola force-pushed the unpack_before_creating_InstallPlan branch from 67b8fc6 to 9ed3d83 Compare May 26, 2023 09:17
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 26, 2023
@m1kola
Copy link
Member Author

m1kola commented May 26, 2023

I updated to address feedback.

Here is the short summary of the diff since the last push which (diff can be found here). It is arguably easier to comparae modified files to the master still.

  • Updated test/e2e/fail_forward_e2e_test.go as per descusion here: reverted (with necessary changes) to having failed InstallPlan.
  • Updated test/e2e/subscription_e2e_test.go to add extra coverage to show how OLM auto-recovers as per this. Initially I wanted it to be a separate PR (Extra E2E coverage for #2942 #2961) but to address fail forward I needed extra test data file which I already had in that separate PR. So I squashed them together here.
  • Rearanged test data in test/e2e/testdata/fail-forward/ for above two to work.

No changes to the controller itself. Only tests.

I hate large PRs like this, but changes changes in test/e2e/fail_forward_e2e_test.go were necessary to ensure that we do not merge something which fails to catch a regression in fail forward feature. And it dragged other things in.

@tmshort @dtfranz @pgodowski @ankitathomas @perdasilva please take another look.

Copy link
Collaborator

@perdasilva perdasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - excellent work!! users will be pleased ^^

}

// Check BundleLookup status conditions to see if the BundleLookupFailed condtion is true
// which means bundle lookup has failed and subscriptions need to be updated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// which means bundle lookup has failed and subscriptions need to be updated
// which means bundle lookup has failed and subscriptions needs to be updated

Copy link
Collaborator

@perdasilva perdasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - excellent work!! users will be pleased ^^

@openshift-ci
Copy link

openshift-ci bot commented May 26, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: m1kola, perdasilva, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2023
@perdasilva
Copy link
Collaborator

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented May 26, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: m1kola, perdasilva, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2023
@openshift-merge-robot openshift-merge-robot merged commit c29863b into operator-framework:master May 26, 2023
@m1kola m1kola deleted the unpack_before_creating_InstallPlan branch May 26, 2023 12:30
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jan 31, 2024
The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes operator-framework#3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Feb 1, 2024
The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes operator-framework#3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Feb 1, 2024
The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes operator-framework#3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 2, 2024
…3166)

The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes #3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Feb 2, 2024
…3166)

The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework/operator-lifecycle-manager#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes #3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 54da66a9996632315827ba37e14823de6405b4d9
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Feb 5, 2024
…3166)

The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework/operator-lifecycle-manager#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes #3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 54da66a9996632315827ba37e14823de6405b4d9
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Feb 6, 2024
…3166)

The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework/operator-lifecycle-manager#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes #3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 54da66a9996632315827ba37e14823de6405b4d9
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Feb 7, 2024
…3166)

The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework/operator-lifecycle-manager#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes #3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 54da66a9996632315827ba37e14823de6405b4d9
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Feb 20, 2024
…3166)

The func `removeSubsCond` takes in a list of pointers to Subscription objects, modifies the
objects that the pointers point to, but return a new list of those pointers. A [PR](operator-framework/operator-lifecycle-manager#2942) included in
the v0.25.0 release [changed the way the output of that function was being used](https://github.com/operator-framework/operator-lifecycle-manager/pull/2942/files#diff-a1760d9b7ac1e93734eea675d8d8938c96a50e995434b163c6f77c91bace9990R1146-R1155) leading to a regression. This PR fixes the `removeSubsCond` function,
fixing the regression as a result.

Closes #3162

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 54da66a9996632315827ba37e14823de6405b4d9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants