Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests related to probe specs for controller-manager pods - excepting cnf-app-mac-operator which is already fixed #32

Merged
merged 6 commits into from
Jan 10, 2024

Conversation

ramperher
Copy link
Collaborator

@ramperher ramperher commented Dec 19, 2023

build-depends: rh-nfv-int/nfv-example-cnf-deploy#44

This change intends to fix the following tnf tests for all controller-manager pods:

lifecycle-liveness-probe
lifecycle-readiness-probe
lifecycle-startup-probe

Including the missing content, using tnf_test_example as source of information.

For the moment, we're just testing the Ansible operators based on operator-sdk. For these cases, they only have liveness and readiness probes implemented natively, so for startup probe, we're using right now the same endpoint than liveness probe: https://github.com/operator-framework/operator-sdk/pull/4326/files. The final fix for this will be to implement startup probe in operator-sdk, but this may take some time till having a new release of operator-sdk, so doing this in the meantime.

The other pods under test directly runs commands or scripts. For them, a webserver will be developed, running in background. This is not included in this change.

@ramperher
Copy link
Collaborator Author

check dallas ocp-4.14-vanilla example-cnf

testpmd-operator/testpmd-allinone.yaml Outdated Show resolved Hide resolved
testpmd-operator/testpmd-allinone.yaml Outdated Show resolved Hide resolved
cnf-app-mac-operator/config/manager/manager.yaml Outdated Show resolved Hide resolved
trex-operator/trex-allinone.yaml Outdated Show resolved Hide resolved
@ramperher
Copy link
Collaborator Author

What worked was the teardown of example-cnf, I'll try with a fresh installation because it looks like the deployment of the operators is not working fine, but I don't know if that's because of my change or if it's because the teardown-reinstallation process

@ramperher
Copy link
Collaborator Author

check dallas ocp-4.14-vanilla example-cnf

@ramperher
Copy link
Collaborator Author

check dallas ocp-4.13-vanilla example-cnf

@ramperher
Copy link
Collaborator Author

I'm trying to test this by re-deploying example-cnf in a running cluster but it's not working in any way.
I could see the following message in the cnf-app-mac-operator subscription, and this makes me think that there's an issue with the mirrored-redhat-operators/openshift-marketplace source?

$ oc get Subscription -n example-cnf -o json | less
...
                "conditions": [
                    {
                        "lastTransitionTime": "2023-12-20T15:41:52Z",
                        "message": "all available catalogsources are healthy",
                        "reason": "AllCatalogSourcesHealthy",
                        "status": "False",
                        "type": "CatalogSourcesUnhealthy"
                    },
                    {
                        "message": "failed to populate resolver cache from source mirrored-redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.162.209:50051: connect: connection refused\"",
                        "reason": "ErrorPreventedResolution",
                        "status": "True",
                        "type": "ResolutionFailed"
                    }
                ],
...


The only way of checking this properly is to redeploy this in a fresh cluster, but all attempts failed in the previous executions in the OCP installation... I'll have to retry again.

@ramperher
Copy link
Collaborator Author

check dallas ocp-4.13-vanilla example-cnf

@tonyskapunk
Copy link
Collaborator

I could see the following message in the cnf-app-mac-operator subscription, and this makes me think that there's an issue with the mirrored-redhat-operators/openshift-marketplace source?

It's possible there was an issue with that catalog. But the example-cnf operators are coming from their own catalog: https://www.distributed-ci.io/jobs/e726fa50-fc4a-4687-96ee-7a810b68958f/jobStates?sort=date&task=a9ca950c-8aff-4023-8324-86f1d40d69cd

{
  "displayName": "NFV Example CNF Catalog",
  "image": "quay.io/rh-nfv-int/nfv-example-cnf-catalog@sha256:8d131fc02f53bf67fd40d510ba265adaf66703d6efad3f91750e8c430f2c7ddb",
  "publisher": "Red Hat",
  "sourceType": "grpc",
  "updateStrategy": {
    "registryPoll": {
      "interval": "30m"
    }
  }
}
--- snip ---
{
  "connectionState": {
    "address": "nfv-example-cnf-catalog.openshift-marketplace.svc:50051",
    "lastConnect": "2023-12-21T15:07:38Z",
    "lastObservedState": "READY"
  },
  "registryService": {
    "createdAt": "2023-12-21T15:07:14Z",
    "port": "50051",
    "protocol": "grpc",
    "serviceName": "nfv-example-cnf-catalog",
    "serviceNamespace": "openshift-marketplace"
  }
}
--- snip ---
{
  "creationTimestamp": "2023-12-21T15:07:14Z",
  "generation": 1,
  "name": "nfv-example-cnf-catalog",
  "namespace": "openshift-marketplace",
  "resourceVersion": "2674961",
  "uid": "442dc75a-a31e-48bb-a0ff-211c5b0fb324"
}

The catalog image quay.io/rh-nfv-int/nfv-example-cnf-catalog@sha256:8d131fc02f53bf67fd40d510ba265adaf66703d6efad3f91750e8c430f2c7ddb matches to the catalog built in this PR: https://github.com/openshift-kni/example-cnf/actions/runs/7277032812/job/19828159075#step:4:928. At least there's guarantee that the correct catalog is being used. 😸

Let's give it another try

@ramperher ramperher changed the title Fix tests related to lifecycle and probe specs for pods under test Fix tests related to probe specs for pods under test Dec 22, 2023
@ramperher
Copy link
Collaborator Author

I think I know why this was failing, and it's not because of the catalog source. I finally make the CSV deploy the pod, but it was in CrashLoopBackOff status, and tldr. the first pod created, which was the controller-manager for cnf-app-mac-operator, didn't have /bin/sh in the $PATH, so the command failed, and as the lifecycle/preStart failed, the pod was not able to move to running status.

As I know this requires more investigation, and I know the liveness/readiness probes are quickly to check, I'll just check that feature in this change and validate that, and I'll check the lifecycle stuff in a separate change. Testing it now.

@ramperher
Copy link
Collaborator Author

I've retried and it's still not creating the pods, still stuck in CrashLoopBackOff... I'll need to take some more time to check this. I'll move this to WIP and continue next year.

@ramperher ramperher changed the title Fix tests related to probe specs for pods under test [WIP] Fix tests related to probe specs for pods under test Dec 22, 2023
@ramperher ramperher changed the title [WIP] Fix tests related to probe specs for pods under test [WIP] Fix tests related to probe specs for pods under test - excepting cnf-app-mac-operator which is already fixed Jan 5, 2024
@ramperher
Copy link
Collaborator Author

check dallas ocp-4.14-vanilla example-cnf

1 similar comment
@ramperher
Copy link
Collaborator Author

check dallas ocp-4.14-vanilla example-cnf

@ramperher
Copy link
Collaborator Author

In this job, readiness/liveness/startup probe tests are passing for all *-controller-manager pods, then we just need to fix the other cases. For them, a webserver must be included to start handling these cases.

@ramperher ramperher changed the title [WIP] Fix tests related to probe specs for pods under test - excepting cnf-app-mac-operator which is already fixed Fix tests related to probe specs for controller-manager pods - excepting cnf-app-mac-operator which is already fixed Jan 10, 2024
@ramperher
Copy link
Collaborator Author

I decided to move this change to "ready for review", because of the following:

  • The absence of startup probe in operator-sdk needs to be included as a new feature. Requested in Support startup probes operator-framework/operator-sdk#6659 and also in Support startup probes kubernetes-sigs/controller-runtime#2644, which is the library used by operator-sdk. When this is fixed, then the controller-manager pods will be updated to use the correct startup probe endpoint
  • For the other pods, a webserver needs to be developed to introduce the endpoints required, if we want to use HTTP endpoints (we could use commands to make these tests to pass, but I think it's better to do the exercise of modifying the image and see what's going on), but for this, we need to address CILAB-1376 before, because the images used by the deployments, replicaset, etc. are hardcoded in the code, we're not using the latest images created for example-cnf

Consequently, I prefer to merge this change, which we know it's working, and target the next steps in new PRs

@ramperher ramperher merged commit e39f0c1 into main Jan 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants