Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: Move tests to gh action using azure workers #260

Merged
merged 3 commits into from
Dec 11, 2023

Conversation

ldoktor
Copy link
Contributor

@ldoktor ldoktor commented Sep 22, 2023

use the azure runners provided by "confidential-containers/infra" to run the kata-clh and kata-qemu workflows.

@ldoktor ldoktor changed the title e2e: Move tests to gh action using azure workers WiP e2e: Move tests to gh action using azure workers Sep 22, 2023
@wainersm
Copy link
Member

The pipeline apparently passed because @ldoktor disable the status report, in reality it has failed in:

namespace/kube-flannel created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created
Error from server (NotFound): nodes "garm-ui5NQSNBjt" not found
Error: Process completed with exit code 1.

@wainersm
Copy link
Member

@ldoktor I suspect the error is at https://github.com/confidential-containers/operator/blob/main/tests/e2e/cluster/up.sh#L60 . The script assume that the assigned node name (see https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#setting-the-node-name) is equal to $(hostname) that might not be true.

Try to print the nodes (kubectl get nodes) names and put other debug messages.

Ah, there are other places where $(hostname) is used.

@ldoktor ldoktor force-pushed the gh-action branch 6 times, most recently from d6f27a1 to fd00a61 Compare November 30, 2023 16:13
@wainersm
Copy link
Member

@ldoktor interesting that lower case fixes the issue. Now some tests are failing because there should be some secrets exported as environment variables. However, we won't run those tests, they are removed on #299

I think we are at point of discussion little details (job names...etc).

@@ -0,0 +1,34 @@
name: azure e2e tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have enclave-cc e2e tests workflow. What it will be testing on this workflow is the ccruntime implementation; so perhaps we can name it ccruntime e2e tests instead of azure e2e tests. Or, as we talked another day, name it ccruntime functional tests. The filename should be renamed properly too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, since the jenkins jobs will be gone it makes sense to avoid specifying az :-)


jobs:
e2e:
name: operator azure e2e tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #299 is merged, we will run only operator tests (install; install and uninstall...etc) so I suggest to just call it operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

jobs:
e2e:
name: operator azure e2e tests
runs-on: az-ubuntu-2204
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about to add Ubuntu 20.04 runner, so the runner name should be part of the matrix below. I.e. two variation of same job running on Ubuntu 22.04 and 20.04.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a new runner to serve Ubuntu 20.04. You can use the label az-ubuntu-2004. I didn't test it works though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are they that different? Wouldn't one suffice? Anyway I'll add that, just asking to save some costs...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldoktor good question. Ubuntu 20.04 comes with containerd 1.6; by using it we test a feature of the operator that is the installation of containerd 1.7. Whereas on Ubuntu 22.04, containerd 1.7 is already installed.


strategy:
matrix:
runtimeclass: ["kata-qemu", "kata-clh"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if makes sense to run operator test for each runtimeClass. But let's leave as is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it'd make sense to include the devels. Unless there are much different code-paths we should perhaps just chose one. Do you know whom to ping for that?

runtimeclass: ["kata-qemu", "kata-clh"]

steps:
- uses: actions/checkout@v3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v4 is already available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

export PATH="$PATH:/usr/local/bin"
./run-local.sh -r "$RUNTIME_CLASS" -u
env:
RUNTIME_CLASS: ${{ matrix.runtimeclass }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should export RUNTIME_CLASS, the -r parameter to run-local.sh should account for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a name-clash, I wanted to have it available in case we reuse it multiple times. But let me hardcode it.

@ldoktor ldoktor force-pushed the gh-action branch 2 times, most recently from 4d5465e to d099345 Compare December 1, 2023 09:45
@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 1, 2023

@wainersm I addressed all the issues and on top modified the hostname usage. Let me know if you prefer this way or not. It seems to be working well now and once the tests that require credentials are removed it should start passing.

@wainersm
Copy link
Member

wainersm commented Dec 4, 2023

@wainersm I addressed all the issues and on top modified the hostname usage. Let me know if you prefer this way or not. It seems to be working well now and once the tests that require credentials are removed it should start passing.

@ldoktor it looks great! The tests that require credentials were remove, could you rebase so the run this again? ah, I introduced one more hostname on operator.sh, could you replace that occurrence too?

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 4, 2023

Rebased & treated 2 new occurrences of $(hostname) in tests/e2e/operator.sh

@wainersm
Copy link
Member

wainersm commented Dec 4, 2023

ccruntime e2e tests / operator tests (kata-clh, az-ubuntu-2004) (pull_request) failed the uninstall test. There is a timeout of 3 min set to uninstall the ccruntime in https://github.com/confidential-containers/operator/blob/main/tests/e2e/operator.sh#L172 . Perhaps we didn't give enough time for uninstall to finish the operation, so increasing the timeout might fix.

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 4, 2023

Yep, looks like that also looking at the age of the containers I'm wondering whether the operator is really ready (I mean the pods are ready but I'm wondering whether the init is completed at the time the uninstall happens, which might slow the removal...) Let me try doubling the deadlines...

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 4, 2023

@wainersm it passed with 4x deadline but after the testing the output shows ERROR: there are ccruntime pods still running. I don't think this is stable yet and there might be some issues with uninstalling the operator right after installing it... I'll take a second look tomorrow if I get to reproduce things locally.

@wainersm
Copy link
Member

wainersm commented Dec 6, 2023

@wainersm it passed with 4x deadline but after the testing the output shows ERROR: there are ccruntime pods still running. I don't think this is stable yet and there might be some issues with uninstalling the operator right after installing it... I'll take a second look tomorrow if I get to reproduce things locally.

720 seconds to uninstall the operator seems to much time. The fact that sometimes it is not able to finish on that window of time may indicate a legit bug.

I noticed the uninstall operator reached the timeout after the tests executed, i.e., when the workflow tries to revert the system to its pre-testing state.

INFO: Run tests
INFO: Running operator tests for kata-qemu
1..2
ok 1 [cc][operator] Test can uninstall the operator
ok 2 [cc][operator] Test can reinstall the operator
INFO: Uninstall the operator
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
ERROR: there are ccruntime pods still running
Describe pods from confidential-containers-system namespace

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 7, 2023

720 seconds to uninstall the operator seems to much time. The fact that sometimes it is not able to finish on that window of time may indicate a legit bug.

Well, trying it on my system (kcli ubuntu VM on a T14s laptop) it usually takes 4.5m to uninstall and 1m to install it. So in unstable cloud environment the 6m seems legit and allowing up to double the time in case of overloaded cloud does not sound all that bad. Perhaps there really isn't a bug (or is but it can recover). Let me run a loop to better examine the timing.

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 7, 2023

@wainersm it seems to be stable, uninstall 4.5m and reinstall 50-70s. I think the new deadlines are reasonable and they finish early if the condition is reached. I think it's ready to be merged, what do you think?

@wainersm
Copy link
Member

wainersm commented Dec 7, 2023

@wainersm it seems to be stable, uninstall 4.5m and reinstall 50-70s. I think the new deadlines are reasonable and they finish early if the condition is reached. I think it's ready to be merged, what do you think?

I really appreciated the analysis you did! Yes, I think it is ready to be merged.

Copy link
Member

@wainersm wainersm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @ldoktor !

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks @ldoktor

@stevenhorsman
Copy link
Member

For the last time we can run the clh jenkins tests. After this merged I'll disable the project and remove it from required, and after a grace period we should enable the gha workflow tests as required

@stevenhorsman
Copy link
Member

/test

1 similar comment
@stevenhorsman
Copy link
Member

/test

@stevenhorsman
Copy link
Member

Hey @ldoktor - I tried to update this branch after another PR got merged, but now the tests are failing, so I'm not sure if teh auto-merge had issues? It might be worth you doing a re-base and force pushing to remove the extra merge commit, then we can re-try the tests

use the azure runners provided by "confidential-containers/infra" to run
the kata-clh and kata-qemu workflows.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
the uninstall timeouts seems to be too low for the azure runners.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 7, 2023

Rebased, no changes.

@stevenhorsman
Copy link
Member

/test

@stevenhorsman
Copy link
Member

We are getting:

# ERROR: there are ccruntime pods still running

on the uninstall test, I'm not sure if that means we need a longer timeout/sleep, or if there is something else going on I'm missing from the debug?

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 7, 2023

We are getting:

# ERROR: there are ccruntime pods still running

on the uninstall test, I'm not sure if that means we need a longer timeout/sleep, or if there is something else going on I'm missing from the debug?

I think the timeout is really generous now so this might be an actual issue. I haven't got to this problem locally, I'll try to dig deeper tomorrow.

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 8, 2023

Still not reproduced but noticed in GH the manager's restart count is 4 while on my machine I have restart count 0. I'll try to stress my machine and perhaps it could be related to that.

@ldoktor
Copy link
Contributor Author

ldoktor commented Dec 8, 2023

@wainersm @stevenhorsman what would you say about something like this? On azure the manager pod (and others) are restarted several times before they stabilize, which is likely causing the issues on op uninstall.

Especially on azure workers we are seeing several pod restarts right
after CoCo deployment, let's wait for 3x21s which should be enough to
detect instabilities as the liveness probe is 15+20s.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
@wainersm
Copy link
Member

/test

@wainersm wainersm merged commit acf7c9b into confidential-containers:main Dec 11, 2023
11 checks passed
@ldoktor ldoktor changed the title WiP e2e: Move tests to gh action using azure workers e2e: Move tests to gh action using azure workers Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants