e2e: Move tests to gh action using azure workers #260

ldoktor · 2023-09-22T04:29:56Z

use the azure runners provided by "confidential-containers/infra" to run the kata-clh and kata-qemu workflows.

wainersm · 2023-11-29T13:09:14Z

The pipeline apparently passed because @ldoktor disable the status report, in reality it has failed in:

namespace/kube-flannel created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created
Error from server (NotFound): nodes "garm-ui5NQSNBjt" not found
Error: Process completed with exit code 1.

wainersm · 2023-11-29T20:51:53Z

@ldoktor I suspect the error is at https://github.com/confidential-containers/operator/blob/main/tests/e2e/cluster/up.sh#L60 . The script assume that the assigned node name (see https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#setting-the-node-name) is equal to $(hostname) that might not be true.

Try to print the nodes (kubectl get nodes) names and put other debug messages.

Ah, there are other places where $(hostname) is used.

wainersm · 2023-11-30T17:01:26Z

@ldoktor interesting that lower case fixes the issue. Now some tests are failing because there should be some secrets exported as environment variables. However, we won't run those tests, they are removed on #299

I think we are at point of discussion little details (job names...etc).

wainersm · 2023-11-30T17:09:03Z

.github/workflows/az-e2e.yaml

@@ -0,0 +1,34 @@
+name: azure e2e tests


We already have enclave-cc e2e tests workflow. What it will be testing on this workflow is the ccruntime implementation; so perhaps we can name it ccruntime e2e tests instead of azure e2e tests. Or, as we talked another day, name it ccruntime functional tests. The filename should be renamed properly too.

ack, since the jenkins jobs will be gone it makes sense to avoid specifying az :-)

wainersm · 2023-11-30T17:11:30Z

.github/workflows/az-e2e.yaml

+
+jobs:
+  e2e:
+    name: operator azure e2e tests


Once #299 is merged, we will run only operator tests (install; install and uninstall...etc) so I suggest to just call it operator.

wainersm · 2023-11-30T17:13:27Z

.github/workflows/az-e2e.yaml

+jobs:
+  e2e:
+    name: operator azure e2e tests
+    runs-on: az-ubuntu-2204


I'm about to add Ubuntu 20.04 runner, so the runner name should be part of the matrix below. I.e. two variation of same job running on Ubuntu 22.04 and 20.04.

I just added a new runner to serve Ubuntu 20.04. You can use the label az-ubuntu-2004. I didn't test it works though :)

are they that different? Wouldn't one suffice? Anyway I'll add that, just asking to save some costs...

@ldoktor good question. Ubuntu 20.04 comes with containerd 1.6; by using it we test a feature of the operator that is the installation of containerd 1.7. Whereas on Ubuntu 22.04, containerd 1.7 is already installed.

wainersm · 2023-11-30T17:14:11Z

.github/workflows/az-e2e.yaml

+
+    strategy:
+      matrix:
+        runtimeclass: ["kata-qemu", "kata-clh"]


I'm not sure if makes sense to run operator test for each runtimeClass. But let's leave as is for now.

Perhaps it'd make sense to include the devels. Unless there are much different code-paths we should perhaps just chose one. Do you know whom to ping for that?

wainersm · 2023-11-30T17:15:20Z

.github/workflows/az-e2e.yaml

+        runtimeclass: ["kata-qemu", "kata-clh"]
+
+    steps:
+      - uses: actions/checkout@v3


v4 is already available.

wainersm · 2023-11-30T17:16:22Z

.github/workflows/az-e2e.yaml

+          export PATH="$PATH:/usr/local/bin"
+          ./run-local.sh -r "$RUNTIME_CLASS" -u
+        env:
+          RUNTIME_CLASS: ${{ matrix.runtimeclass }}


I don't think it should export RUNTIME_CLASS, the -r parameter to run-local.sh should account for that.

It's just a name-clash, I wanted to have it available in case we reuse it multiple times. But let me hardcode it.

ldoktor · 2023-12-01T10:26:57Z

@wainersm I addressed all the issues and on top modified the hostname usage. Let me know if you prefer this way or not. It seems to be working well now and once the tests that require credentials are removed it should start passing.

wainersm · 2023-12-04T12:22:15Z

@wainersm I addressed all the issues and on top modified the hostname usage. Let me know if you prefer this way or not. It seems to be working well now and once the tests that require credentials are removed it should start passing.

@ldoktor it looks great! The tests that require credentials were remove, could you rebase so the run this again? ah, I introduced one more hostname on operator.sh, could you replace that occurrence too?

ldoktor · 2023-12-04T13:23:41Z

Rebased & treated 2 new occurrences of $(hostname) in tests/e2e/operator.sh

wainersm · 2023-12-04T14:32:40Z

ccruntime e2e tests / operator tests (kata-clh, az-ubuntu-2004) (pull_request) failed the uninstall test. There is a timeout of 3 min set to uninstall the ccruntime in https://github.com/confidential-containers/operator/blob/main/tests/e2e/operator.sh#L172 . Perhaps we didn't give enough time for uninstall to finish the operation, so increasing the timeout might fix.

ldoktor · 2023-12-04T16:35:24Z

Yep, looks like that also looking at the age of the containers I'm wondering whether the operator is really ready (I mean the pods are ready but I'm wondering whether the init is completed at the time the uninstall happens, which might slow the removal...) Let me try doubling the deadlines...

ldoktor · 2023-12-04T18:27:18Z

@wainersm it passed with 4x deadline but after the testing the output shows ERROR: there are ccruntime pods still running. I don't think this is stable yet and there might be some issues with uninstalling the operator right after installing it... I'll take a second look tomorrow if I get to reproduce things locally.

wainersm · 2023-12-06T12:41:35Z

@wainersm it passed with 4x deadline but after the testing the output shows ERROR: there are ccruntime pods still running. I don't think this is stable yet and there might be some issues with uninstalling the operator right after installing it... I'll take a second look tomorrow if I get to reproduce things locally.

720 seconds to uninstall the operator seems to much time. The fact that sometimes it is not able to finish on that window of time may indicate a legit bug.

I noticed the uninstall operator reached the timeout after the tests executed, i.e., when the workflow tries to revert the system to its pre-testing state.

INFO: Run tests
INFO: Running operator tests for kata-qemu
1..2
ok 1 [cc][operator] Test can uninstall the operator
ok 2 [cc][operator] Test can reinstall the operator
INFO: Uninstall the operator
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
ERROR: there are ccruntime pods still running
Describe pods from confidential-containers-system namespace

ldoktor · 2023-12-07T07:34:11Z

720 seconds to uninstall the operator seems to much time. The fact that sometimes it is not able to finish on that window of time may indicate a legit bug.

Well, trying it on my system (kcli ubuntu VM on a T14s laptop) it usually takes 4.5m to uninstall and 1m to install it. So in unstable cloud environment the 6m seems legit and allowing up to double the time in case of overloaded cloud does not sound all that bad. Perhaps there really isn't a bug (or is but it can recover). Let me run a loop to better examine the timing.

ldoktor · 2023-12-07T10:43:20Z

@wainersm it seems to be stable, uninstall 4.5m and reinstall 50-70s. I think the new deadlines are reasonable and they finish early if the condition is reached. I think it's ready to be merged, what do you think?

wainersm · 2023-12-07T13:00:43Z

@wainersm it seems to be stable, uninstall 4.5m and reinstall 50-70s. I think the new deadlines are reasonable and they finish early if the condition is reached. I think it's ready to be merged, what do you think?

I really appreciated the analysis you did! Yes, I think it is ready to be merged.

wainersm

LGTM. Thanks @ldoktor !

stevenhorsman

LGTM - Thanks @ldoktor

stevenhorsman · 2023-12-07T13:37:33Z

For the last time we can run the clh jenkins tests. After this merged I'll disable the project and remove it from required, and after a grace period we should enable the gha workflow tests as required

stevenhorsman · 2023-12-07T13:37:39Z

/test

stevenhorsman · 2023-12-07T14:31:12Z

/test

stevenhorsman · 2023-12-07T15:35:04Z

Hey @ldoktor - I tried to update this branch after another PR got merged, but now the tests are failing, so I'm not sure if teh auto-merge had issues? It might be worth you doing a re-base and force pushing to remove the extra merge commit, then we can re-try the tests

use the azure runners provided by "confidential-containers/infra" to run the kata-clh and kata-qemu workflows. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

the uninstall timeouts seems to be too low for the azure runners. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

ldoktor · 2023-12-07T18:45:12Z

Rebased, no changes.

stevenhorsman · 2023-12-07T18:53:39Z

/test

stevenhorsman · 2023-12-07T19:27:56Z

We are getting:

# ERROR: there are ccruntime pods still running

on the uninstall test, I'm not sure if that means we need a longer timeout/sleep, or if there is something else going on I'm missing from the debug?

ldoktor · 2023-12-07T21:44:04Z

We are getting:
# ERROR: there are ccruntime pods still running
on the uninstall test, I'm not sure if that means we need a longer timeout/sleep, or if there is something else going on I'm missing from the debug?

I think the timeout is really generous now so this might be an actual issue. I haven't got to this problem locally, I'll try to dig deeper tomorrow.

ldoktor · 2023-12-08T09:16:46Z

Still not reproduced but noticed in GH the manager's restart count is 4 while on my machine I have restart count 0. I'll try to stress my machine and perhaps it could be related to that.

ldoktor · 2023-12-08T16:43:22Z

@wainersm @stevenhorsman what would you say about something like this? On azure the manager pod (and others) are restarted several times before they stabilize, which is likely causing the issues on op uninstall.

Especially on azure workers we are seeing several pod restarts right after CoCo deployment, let's wait for 3x21s which should be enough to detect instabilities as the liveness probe is 15+20s. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

wainersm · 2023-12-11T20:19:33Z

/test

ldoktor changed the title ~~e2e: Move tests to gh action using azure workers~~ WiP e2e: Move tests to gh action using azure workers Sep 22, 2023

ldoktor force-pushed the gh-action branch from c6b06e2 to b7f3c78 Compare October 31, 2023 10:24

ldoktor force-pushed the gh-action branch from b7f3c78 to 3a9c865 Compare November 8, 2023 13:29

BbolroC mentioned this pull request Nov 28, 2023

tests/e2e: Migrate e2e test for s390x to GHA #295

Merged

ldoktor force-pushed the gh-action branch 6 times, most recently from d6f27a1 to fd00a61 Compare November 30, 2023 16:13

wainersm reviewed Nov 30, 2023

View reviewed changes

ldoktor force-pushed the gh-action branch 2 times, most recently from 4d5465e to d099345 Compare December 1, 2023 09:45

ldoktor force-pushed the gh-action branch from d099345 to 4d4972b Compare December 4, 2023 13:23

ldoktor force-pushed the gh-action branch from 4d4972b to dce537e Compare December 4, 2023 16:33

ldoktor force-pushed the gh-action branch from dce537e to ebb1de6 Compare December 4, 2023 17:50

wainersm reviewed Dec 7, 2023

View reviewed changes

wainersm approved these changes Dec 7, 2023

View reviewed changes

stevenhorsman reviewed Dec 7, 2023

View reviewed changes

stevenhorsman approved these changes Dec 7, 2023

View reviewed changes

ldoktor added 2 commits December 7, 2023 19:44

tests/e2e: Move tests to gh action using azure workers

fb4ef8d

use the azure runners provided by "confidential-containers/infra" to run the kata-clh and kata-qemu workflows. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

tests/e2e: Increase timeouts

002882b

the uninstall timeouts seems to be too low for the azure runners. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

ldoktor force-pushed the gh-action branch from 94aa7e6 to 002882b Compare December 7, 2023 18:45

ldoktor force-pushed the gh-action branch from 6719bf4 to ab27cf3 Compare December 8, 2023 15:44

ldoktor force-pushed the gh-action branch from ab27cf3 to 2ac5ce3 Compare December 11, 2023 17:25

wainersm mentioned this pull request Dec 11, 2023

docs/DEVELOPMENT: document the CI #224

Merged

wainersm merged commit acf7c9b into confidential-containers:main Dec 11, 2023
11 checks passed

ldoktor changed the title ~~WiP e2e: Move tests to gh action using azure workers~~ e2e: Move tests to gh action using azure workers Dec 14, 2023

e2e: Move tests to gh action using azure workers #260

e2e: Move tests to gh action using azure workers #260

Conversation

ldoktor commented Sep 22, 2023

wainersm commented Nov 29, 2023

wainersm commented Nov 29, 2023

wainersm commented Nov 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ldoktor commented Dec 1, 2023

wainersm commented Dec 4, 2023

ldoktor commented Dec 4, 2023

wainersm commented Dec 4, 2023

ldoktor commented Dec 4, 2023

ldoktor commented Dec 4, 2023

wainersm commented Dec 6, 2023

ldoktor commented Dec 7, 2023 • edited Loading

ldoktor commented Dec 7, 2023

wainersm commented Dec 7, 2023

wainersm left a comment

Choose a reason for hiding this comment

stevenhorsman left a comment

Choose a reason for hiding this comment

stevenhorsman commented Dec 7, 2023

stevenhorsman commented Dec 7, 2023

stevenhorsman commented Dec 7, 2023

stevenhorsman commented Dec 7, 2023

ldoktor commented Dec 7, 2023

stevenhorsman commented Dec 7, 2023

stevenhorsman commented Dec 7, 2023

ldoktor commented Dec 7, 2023

ldoktor commented Dec 8, 2023

ldoktor commented Dec 8, 2023

wainersm commented Dec 11, 2023

ldoktor commented Dec 7, 2023 •

edited

Loading