Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rook to latest master build. #1141

Merged
merged 9 commits into from
Apr 19, 2021

Conversation

obnoxxx
Copy link
Contributor

@obnoxxx obnoxxx commented Mar 30, 2021

For 4.8 preparation, we need a newer rook image.

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

rook update pulls in an update of the controller-runtime dependency from 0.6.3 to 0.8.0.
Seems we need to adapt our code to that.

@obnoxxx obnoxxx force-pushed the update-rook branch 2 times, most recently from 73dd135 to 60fe1aa Compare March 30, 2021 10:58
@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

Weeding out the build errors one by one...
I am left with two:

$ make
Running deps-update
go mod tidy && go mod vendor
go: creating new go.mod: module tmp
go get: added sigs.k8s.io/controller-tools v0.4.1
Updating generated code
/Users/madam/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
Building the ocs-operator binary
hack/go-build.sh
# github.com/openshift/ocs-operator/controllers/storagecluster
controllers/storagecluster/external_resources.go:182:4: cannot use "60s" (type string) as type *"k8s.io/apimachinery/pkg/apis/meta/v1".Duration in field value
controllers/storagecluster/kms_resources.go:79:38: cannot use runtimeObj (type "k8s.io/apimachinery/pkg/runtime".Object) as type client.Object in argument to r.Client.Delete:
	"k8s.io/apimachinery/pkg/runtime".Object does not implement client.Object (missing GetAnnotations method)
make: *** [build-go] Error 2

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

Remaining error:

# github.com/openshift/ocs-operator/controllers/storagecluster
controllers/storagecluster/kms_resources.go:79:38: cannot use runtimeObj (type "k8s.io/apimachinery/pkg/runtime".Object) as type client.Object in argument to r.Client.Delete:
	"k8s.io/apimachinery/pkg/runtime".Object does not implement client.Object (missing GetAnnotations method)

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

Locally, I am getting this error now:

"error loading manifests from directory: error checking provided apis in bundle ocs-operator.v4.8.0: couldn't find ceph.rook.io/v1/CephNFS (cephnfses) in bundle. found: map[noobaa.io/v1alpha1/BackingStore (backingstores):{} noobaa.io/v1alpha1/BucketClass (bucketclasses):{} noobaa.io/v1alpha1/NamespaceStore (namespacestores):{} noobaa.io/v1alpha1/NooBaa (noobaas):{} objectbucket.io/v1alpha1/ObjectBucket (objectbuckets):{} objectbucket.io/v1alpha1/ObjectBucketClaim (objectbucketclaims):{} ocs.openshift.io/v1/OCSInitialization (ocsinitializations):{} ocs.openshift.io/v1/StorageCluster (storageclusters):{}]"

@@ -176,10 +176,11 @@ func (r *StorageClusterReconciler) newExternalCephObjectStoreInstances(
return nil, err
}
// enable bucket healthcheck
time60s, _ := time.ParseDuration("60s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to parse time anymore, just use 1 * time.Minute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed (yet to push). Actually time.Minute is sufficient.

Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did all the CRDs get deleted in c428b8b?

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

Why did all the CRDs get deleted in c428b8b?

Right that is weird... Investigating.

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

updated with the current state
still some lint/test compile errors left
and then to clarify why the crds got removed after updating the rook version

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Mar 30, 2021

We are making progress. Now rook's image is fixed upstream via rook/rook#7512 and the image target starts to pass. Still some things to fix from the controller-runtime update...

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 1, 2021

@jarrpa - I incorporated your patches on top of my PR and rebased. Looks a lot better! 😄 Only waiting for the e2e tests to complete now - everything else passed!

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 1, 2021

From the test output:

...
time="2021-04-01T09:38:40Z" level=info msg="Pod ocs-operator-bundle-e2e-aws-ipi-install-install succeeded after 34m10s"
time="2021-04-01T09:38:40Z" level=info msg="Executing pod \"ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe\" running image \"stable:cli\""
time="2021-04-01T09:38:50Z" level=info msg="Container place-entrypoint in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe completed successfully"
time="2021-04-01T09:38:50Z" level=info msg="Container cp-entrypoint-wrapper in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe completed successfully"
time="2021-04-01T09:49:40Z" level=info msg="Container sidecar in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe completed successfully"
Error from server (NotFound): namespaces "openshift-storage" not found
OO_INSTALL_NAMESPACE is 'openshift-storage' which does not exist: creating
Installing "ocs-operator" in namespace "openshift-storage"
OO_TARGET_NAMESPACES is '!install': targeting operator installation namespace (openshift-storage)
OperatorGroup does not exist: creating it
OperatorGroup name is "oo-z6rqx"
Creating CatalogSource
CatalogSource name is "oo-dzq4l"
Creating Subscription
Subscription name is "oo-qjvv5"
Waiting for installPlan to be created
installplan.operators.coreos.com/install-vz8f9 patched
Install Plan approved
Waiting for ClusterServiceVersion to become ready...
Timed out waiting for csv to become ready
Dumping Namespace openshift-storage as /logs/artifacts/ns-openshift-storage.yaml
Dumping OperatorGroup oo-z6rqx as /logs/artifacts/og-oo-z6rqx.yaml
Dumping CatalogSource oo-dzq4l as /logs/artifacts/cs-oo-dzq4l.yaml
Dumping Subscription oo-qjvv5 as /logs/artifacts/sub-oo-qjvv5.yaml
  Subscription oo-qjvv5 status state: UpgradePending
ClusterServiceVersion was never created
Dumping all ClusterServiceVersions in namespace openshift-storage to /logs/artifacts/openshift-storage-all-csvs.yaml
Dumping all installPlans in namespace openshift-storage as /logs/artifacts/installPlans.yaml
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2021-04-01T09:49:38Z"}
error: failed to execute wrapped command: exit status 1
time="2021-04-01T09:49:40Z" level=info msg="Container test in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe failed, exit code 1, reason Error"
time="2021-04-01T09:49:40Z" level=info msg="Executing pod \"ocs-operator-bundle-e2e-aws-gather-aws-console\" running image \"pipeline:ocp-4.5-upi-installer\""
time="2021-04-01T09:49:51Z" level=info msg="Container place-entrypoint in pod ocs-operator-bundle-e2e-aws-gather-aws-console completed successfully"
time="2021-04-01T09:49:51Z" level=info msg="Container cp-entrypoint-wrapper in pod ocs-operator-bundle-e2e-aws-gather-aws-console completed successfully"
...

Not sure if this is a flake or real.

FWIW, I see pipeline:ocp-4.5-upi-installer - I think we shouldn't be running OCP 4.5 for testing OCS 4.8 / master, right?

@umangachapagain
Copy link
Contributor

From the test output:

...
time="2021-04-01T09:38:40Z" level=info msg="Pod ocs-operator-bundle-e2e-aws-ipi-install-install succeeded after 34m10s"
time="2021-04-01T09:38:40Z" level=info msg="Executing pod \"ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe\" running image \"stable:cli\""
time="2021-04-01T09:38:50Z" level=info msg="Container place-entrypoint in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe completed successfully"
time="2021-04-01T09:38:50Z" level=info msg="Container cp-entrypoint-wrapper in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe completed successfully"
time="2021-04-01T09:49:40Z" level=info msg="Container sidecar in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe completed successfully"
Error from server (NotFound): namespaces "openshift-storage" not found
OO_INSTALL_NAMESPACE is 'openshift-storage' which does not exist: creating
Installing "ocs-operator" in namespace "openshift-storage"
OO_TARGET_NAMESPACES is '!install': targeting operator installation namespace (openshift-storage)
OperatorGroup does not exist: creating it
OperatorGroup name is "oo-z6rqx"
Creating CatalogSource
CatalogSource name is "oo-dzq4l"
Creating Subscription
Subscription name is "oo-qjvv5"
Waiting for installPlan to be created
installplan.operators.coreos.com/install-vz8f9 patched
Install Plan approved
Waiting for ClusterServiceVersion to become ready...
Timed out waiting for csv to become ready
Dumping Namespace openshift-storage as /logs/artifacts/ns-openshift-storage.yaml
Dumping OperatorGroup oo-z6rqx as /logs/artifacts/og-oo-z6rqx.yaml
Dumping CatalogSource oo-dzq4l as /logs/artifacts/cs-oo-dzq4l.yaml
Dumping Subscription oo-qjvv5 as /logs/artifacts/sub-oo-qjvv5.yaml
  Subscription oo-qjvv5 status state: UpgradePending
ClusterServiceVersion was never created
Dumping all ClusterServiceVersions in namespace openshift-storage to /logs/artifacts/openshift-storage-all-csvs.yaml
Dumping all installPlans in namespace openshift-storage as /logs/artifacts/installPlans.yaml
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2021-04-01T09:49:38Z"}
error: failed to execute wrapped command: exit status 1
time="2021-04-01T09:49:40Z" level=info msg="Container test in pod ocs-operator-bundle-e2e-aws-optional-operators-ci-subscribe failed, exit code 1, reason Error"
time="2021-04-01T09:49:40Z" level=info msg="Executing pod \"ocs-operator-bundle-e2e-aws-gather-aws-console\" running image \"pipeline:ocp-4.5-upi-installer\""
time="2021-04-01T09:49:51Z" level=info msg="Container place-entrypoint in pod ocs-operator-bundle-e2e-aws-gather-aws-console completed successfully"
time="2021-04-01T09:49:51Z" level=info msg="Container cp-entrypoint-wrapper in pod ocs-operator-bundle-e2e-aws-gather-aws-console completed successfully"
...

Not sure if this is a flake or real.

FWIW, I see pipeline:ocp-4.5-upi-installer - I think we shouldn't be running OCP 4.5 for testing OCS 4.8 / master, right?

We're using OCP 4.7.
The CSV was never created, so it doesn't look like a flake to me.

@umangachapagain
Copy link
Contributor

/retest

@jarrpa
Copy link
Member

jarrpa commented Apr 1, 2021

Okay, the e2e failures may indeed be a flake. It seems it failed to even install the OCS CSV. Looking around, I found that the CatalogSource they generate to hold the CSV was in TRANSIENT_FAILURE:

time="2021-04-01T12:41:49Z" level=info msg="state.Key.Namespace=openshift-storage state.Key.Name=oo-bfs6t state.State=TRANSIENT_FAILURE"
time="2021-04-01T12:41:49Z" level=error msg="failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.74.197:50051: connect: no route to host\"" catalog="{oo-bfs6t openshift-storage}"

So, as is often the case, we are probably dealing with network issues and/or API limits.

/test ocs-operator-bundle-e2e-aws

@jarrpa
Copy link
Member

jarrpa commented Apr 1, 2021

/retest

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 2, 2021

Created PR #1144 for only the CSV part of the rook upgrade.
Interestingly, it also fails e2e tests, similarly to this one.

@obnoxxx obnoxxx force-pushed the update-rook branch 2 times, most recently from e3d0638 to 52a1e92 Compare April 9, 2021 08:02
@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 19, 2021

@raghavendra-talur yay, your patch fixes it!
I'm going to polish this PR and then we can move forward.

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 19, 2021

When working on polishing / squashing the patches, I came across the test code that @jarrpa commented out. Removing the comments lets the unit tests fail this way:

=== RUN   TestCreateSCCs                                                                                                                              
    ocsinitialization_controller_test.go:234:                                                                                                         
                Error Trace:    ocsinitialization_controller_test.go:234                                                                              
                Error:          Received unexpected error:                                                                                            
                                Operation cannot be fulfilled on ocsinitializations.ocs.openshift.io "ocsinit": object was modified                   
                Test:           TestCreateSCCs                                                                                                        
                Messages:       [Case 1]: failed to update ocsInit status                                                                             
--- FAIL: TestCreateSCCs (0.00s)                                                                                                                      

Not sure really how to fix this. But ideally, we should be commenting out such lines...

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 19, 2021

@jarrpa , @raghavendra-talur , @umangachapagain - any idea?

@jarrpa
Copy link
Member

jarrpa commented Apr 19, 2021

When working on polishing / squashing the patches, I came across the test code that @jarrpa commented out. Removing the comments lets the unit tests fail this way:

=== RUN   TestCreateSCCs                                                                                                                              
    ocsinitialization_controller_test.go:234:                                                                                                         
                Error Trace:    ocsinitialization_controller_test.go:234                                                                              
                Error:          Received unexpected error:                                                                                            
                                Operation cannot be fulfilled on ocsinitializations.ocs.openshift.io "ocsinit": object was modified                   
                Test:           TestCreateSCCs                                                                                                        
                Messages:       [Case 1]: failed to update ocsInit status                                                                             
--- FAIL: TestCreateSCCs (0.00s)                                                                                                                      

Not sure really how to fix this. But ideally, we should be commenting out such lines...

No idea. I wasn't able to figure anything further out, hence why I had to leave it like that just to make progress. It has something to do with how the fake reconciler is implemented, but that is still fairly opaque to me.

@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 19, 2021

@jarrpa - do you suggest to move forward with the commented-out lines?

(It just feels odd, since the lines seemed to have been there for a purpose... ;-)

@jarrpa
Copy link
Member

jarrpa commented Apr 19, 2021

For the sake of expediency, yes. There should at least me a TODO to figure out why they were suddenly a problem.

obnoxxx and others added 9 commits April 19, 2021 18:22
Latest rook master image.

Signed-off-by: Michael Adam <obnox@redhat.com>
Signed-off-by: Michael Adam <obnox@redhat.com>
Latest master version with crds (re)addition to image.
Picking up (among other things) crd updates, including
volumereplication support.

This is committing the result of: `go get github.com/rook/rook@1b77b9e`

Signed-off-by: Michael Adam <obnox@redhat.com>
Signed-off-by: Michael Adam <obnox@redhat.com>
Why is there a diff now?...

Signed-off-by: Michael Adam <obnox@redhat.com>
Quite some signatures have changed. Most importantly:

- reconciler.Reconcile()now requries a context argument
- fix use of interval to new Duration type
- UpdateEvent has changed
- NewFakeClientWithScheme is deprecated, using NewClientBuilder instead.

Signed-off-by: Michael Adam <obnox@redhat.com>
Signed-off-by: Jose A. Rivera <jarrpa@redhat.com>
Prevent us from creating a second operatorgroup in the namsepace.

Co-authored-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Michael Adam <obnox@redhat.com>
The controller runtime now uses leases for its leader election mechanism by
default.

It looks like the RBAC configuration in kubebuilder does not add this
permission by default though. We add it manually here.

Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
@obnoxxx
Copy link
Contributor Author

obnoxxx commented Apr 19, 2021

updated to the latest and greatest rook image rook/ceph:v1.6.0.95.gf4cfc7a
and polished the patches

still the only weird thing is the commented out test code lines... I added a TODO comment

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jarrpa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2021
@jarrpa
Copy link
Member

jarrpa commented Apr 19, 2021

/override ci/prow/ci-index-dev-master-dependencies

@openshift-ci-robot
Copy link

@jarrpa: Overrode contexts on behalf of jarrpa: ci/prow/ci-index-dev-master-dependencies

In response to this:

/override ci/prow/ci-index-dev-master-dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@obnoxxx obnoxxx removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2021
@openshift-merge-robot openshift-merge-robot merged commit dee571f into red-hat-storage:master Apr 19, 2021
@raghavendra-talur raghavendra-talur mentioned this pull request Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants