Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to bundle-upgrade using operator-sdk cli tool #6204

Closed
talsharon48 opened this issue Nov 27, 2022 · 10 comments
Closed

Unable to bundle-upgrade using operator-sdk cli tool #6204

talsharon48 opened this issue Nov 27, 2022 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.
Milestone

Comments

@talsharon48
Copy link

Bug Report

What did you do?

Hello,
I am using operator-sdk v1.24.0 and trying to use it to create bundle and use them in openshift cluster via OLM.
i wrote a demo operator made bundle from it and created a catalog image using catalog-build and catalog-push.
i have manged to create catalog source from it, saw it in the UI and installed the operator, when i tried upgrading the operator, i have create new bundle version using make bundle VERSION=0.0.2, bundle-build and bundle-push and used operator-sdk upgrade bundle to upgrade it.
i have encountered an error saying "Failed to run bundle upgrage: install plan is not available for subscription : timed out waiting for condition" although when i browsing the catalog i can see the latest version is v0.0.2 and not v0.0.1.
when i uninstalled and installed manually the operator the desired v0.0.2 was deployed.
NOTE: i am working in on-premise environment that i cant upload any logs or code snippets.

What did you expect to see?

running operator-sdk upgrade-bundle command to deploy new version (CSV) of my operator.

What did you see instead? Under which circumstances?

after running the command a pod named was raised and served the new version
although nothing other happened and the error mentioned above was raised.

Environment

Operator type:

/language go

Kubernetes cluster type:

openshift v4.6.15

$ operator-sdk version

operator-sdk version: "v1.24.0", commit: "de6a14d03de3c36dcc9de3891af788b49d15f0f3", kubernetes version: "1.24.2", go version: "go1.18.6", GOOS: "linux", GOARCH: "amd64"

$ go version (if language is Go)

go version go1.18.6 linux/amd64

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"8df677dc147fe8297d90c4757154469a931bdb90", GitTreeState:"clean", BuildDate:"2022-11-04T15:44:27Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}

@varshaprasad96
Copy link
Member

Can you check by running the same command with an increased timeout using the --timeout flag in the command. In some of the cases we have observed the default timeout to be less for openshift clusters.

@varshaprasad96 varshaprasad96 added the triage/support Indicates an issue that is a support question. label Nov 28, 2022
@varshaprasad96 varshaprasad96 added this to the Backlog milestone Nov 28, 2022
@talsharon48
Copy link
Author

hello,
I have tried to use the --timeout=15m0s flag giving it 15 minutes to complete and got the same result
FATA[0900] Failed to run bundle upgrade: install plan is not available for the subscription olm-test: timed out waiting for the condition

@everettraven
Copy link
Contributor

NOTE: i am working in on-premise environment that i cant upload any logs or code snippets.

@talsharon48 just to make sure - you can't share any logs at all? Without some logs or more information I'm not quite sure we will be able to help very much. That said, I'll share a brief overview of what I would check:

  1. Check the logs of the registry pod that is created by the operator-sdk run bundle-upgrade command. If this is failing to start properly, this could be causing OLM to not have the information needed to create the InstallPlan resource.
  2. Check the status of the CatalogSource resource used by operator-sdk run bundle-upgrade. If there is a problem it should be present in the status. You should be able to get detailed information with kubectl describe catalogsource <catalogsource-name>
  3. Check the status of the Subscription resource used by operator-sdk run bundle-upgrade. If there is a problem it should be present in the status. You should be able to get detailed information with kubectl describe subscription <subscription-name>
  4. Check the CSV's and see if one is created for upgrading to v0.0.2 with kubectl get clusterserviceversion. If there is one created, the operator should have been upgraded successfully but for some reason used the same InstallPlan resource for approval as when it was installed. I believe operator-sdk run bundle-upgrade expects it to use a new one, and since that new one doesn't exist it failed with that error (I haven't seen this be the case recently so I'm not sure what may be causing this).

I hope this helps a bit!

@talsharon48
Copy link
Author

talsharon48 commented Nov 30, 2022

@everettraven I can investigate the logs from any pod, ill just have to type here any figures manually for you to see.

  1. The logs from the pod raised by the upgrade-bundle command look good and ending by serving the registry (when I
    choose the operator at the OperatorHub I can see the v0.0.2, the new one, available)
  2. The CatalogSource also looks good with a status saying the ConnectionState is READY and reachable
  3. The Subscription looks good with status showing the details about the current version v0.0.1, the InstallPlan of v0.0.1 and the CatalogHealth which is true
  4. There's no v0.0.2 available only v0.0.1

Any other suggestions? or logs you need?

@everettraven
Copy link
Contributor

@talsharon48 The only other thing I can think to check is if the Subscription has the field installPlanApproval: Manual.

If it does not then that means OLM is automatically approving the upgrade - I'm not sure what impact this has on the actual generation of a new InstallPlan. I believe the operator-sdk run bundle-upgrade command expects it to be set to manual approval so OLM doesn't automatically perform and upgrades.

If that doesn't seem to be causing the problem I can't really think of anything other reason why this issue is happening.

Is it possible for you to provide a way that this problem can be replicated? If I can replicate the problem I can try to dig a bit further.

@talsharon48
Copy link
Author

talsharon48 commented Dec 1, 2022

@everettraven I would be thankful if you help me investigate the problem, I'll explain the steps I have done for you to replicate my problem:

  1. Initialize a new project using operator-sdk init --domain=<domain>.io
  2. Generate a new API using operator-sdk create api --group olm --version v1alpha1 --kind UpgradePOC --controller --resource
  3. build the controller image, I am using harbor registry to store the images, export IMG=<harbor-fqdn/project/repo:v0.0.1>
    make docker-build docker-push
  4. generate bundle using make bundle VERSION=0.0.1
  5. build the bundle image using: export BUNDLE_IMG=<harbor-fqdn/project/repo-bundle:v0.0.1>
    make bundle-build bundle-push
  6. I had a problem when the registry pod is being raised by the upgrade-bundle command it gets permission denied creating the cache dir as a result of the opm registry add command he runs. so I made a workaround which is creating a new binary image for the make catalog-build that uses the opm index add command (can be found in the Makefile). I have created a Dockerfile that looks like this:
    FROM quay.io/operator-framework/opm:latest USER 0
    which forces the user to be root to avoid the permission denial.
    built the image and pushed it to the registry with: docker build -t <harbor-fqdn/project/opm:tag> && docker push <harbor-fqdn/project/opm:tag>
  7. For the make catalog-build command to let me put my own binary image I have slightly edited the Makefile with the following snippet:
ifneq ($(origin BINARY_IMG), undefined)
BINARY_IMG_OPT := --binary-image $(BINARY_IMG)
endif

Also changed the catalog-build endpoint from:

.PHONY: catalog-build
catalog-build: opm ## Build a catalog image.
    $(OPM) index add --container-tool docker --mode semver --tag $(CATALOG_IMG) --bundles $(BUNDLE_IMGS) $(FROM_INDEX_OPT)

To:

.PHONY: catalog-build
catalog-build: opm ## Build a catalog image.
    $(OPM) index add --container-tool docker --mode semver --tag $(CATALOG_IMG) --bundles $(BUNDLE_IMGS) $(FROM_INDEX_OPT) $(BINARY_IMG_OPT)
  1. export the binary image and the catalog image names: export CATALOG_IMG=<harbor-fqdn/project/index-catalog:v0.0.1>
    export BINARY_IMG=<harbor-fqdn/project/opm:tag> and build the catalog image using: make catalog-build catalog-push
  2. Create CatalogSource CR using the catalog image we just built with the following YAML:
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-catalog
  namespace: openshift-operators
spec:
  image: <harbor-fqdn/project/index-catalog:v0.0.1>
  displayName: Test_Catalog
  sourceType: grpc
  1. Install the operator by creating a Subscription or via the Openshift UI
  2. build v0.0.2 of the controller image export IMG=<harbor-fqdn/project/repo:v0.0.2>
    make docker-build docker-push
  3. generate new bundle version using make bundle VERSION=0.0.2
  4. build new bundle image using: export BUNDLE_IMG=<harbor-fqdn/project/repo-bundle:v0.0.2>
    make bundle-build bundle-push
  5. upgrade the bundle operator-sdk run bundle-upgrade <harbor-fqdn/project/repo-bundle:v0.0.2> --skip-tls-verify --skip-tls --timeout 15m0s
    Hope this will help you reproduce my situation, looking forward for your response!
    Thank you.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2023
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2023
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed May 2, 2023
@openshift-ci
Copy link

openshift-ci bot commented May 2, 2023

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

4 participants