Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterClass self-hosted cluster tests are very flaky #9522

Closed
1 of 3 tasks
killianmuldoon opened this issue Oct 4, 2023 · 19 comments
Closed
1 of 3 tasks

ClusterClass self-hosted cluster tests are very flaky #9522

killianmuldoon opened this issue Oct 4, 2023 · 19 comments
Assignees
Labels
area/clusterclass Issues or PRs related to clusterclass area/machinepool Issues or PRs related to machinepools kind/flake Categorizes issue or PR as related to a flaky test. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Oct 4, 2023

The ClusterClass self-hosted tests have become very flaky to the extent that they regularly fail four times in a row. They fail across all flavors of our e2e-full test jobs (full, dualstack, and mink8s).

These failures only seem to happen on the main branch. We're looking for issues that could be responsible in the diff between these branches.

These tests fail with three different errors as described below. Though there are three different errors it's likely that some change in the main branch caused either self-hosted clusters, or the self-hosted tests themselves, to work differently in a way that exposed these issues. We should investigate both the cause of each specific error and the underlying issue that caused them to be more prevalent on the main branch.

The errors are:

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 4, 2023
@killianmuldoon
Copy link
Contributor Author

/triage accepted

@kubernetes-sigs/cluster-api-release-team

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2023
@nawazkh
Copy link
Member

nawazkh commented Oct 4, 2023

Taking a look.

@nawazkh
Copy link
Member

nawazkh commented Oct 5, 2023

This is the thread I have been posting my analysis so far.

@fabriziopandini
Copy link
Member

/assign @willie-yao

@killianmuldoon killianmuldoon added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Oct 19, 2023
@killianmuldoon killianmuldoon changed the title HA self-hosted cluster with test is failing Self-hosted cluster tests are failing Oct 19, 2023
@killianmuldoon
Copy link
Contributor Author

killianmuldoon commented Oct 19, 2023

These tests fail with three different errors:

These failures only seem to happen on the main branch. They seem to only occur on the ClusterClass versions of the tests. We're looking for issues that could be responsible in the diff between these branches.

@killianmuldoon killianmuldoon changed the title Self-hosted cluster tests are failing ClusterClass Self-hosted cluster tests are very flakiy Oct 20, 2023
@killianmuldoon killianmuldoon changed the title ClusterClass Self-hosted cluster tests are very flakiy ClusterClass Self-hosted cluster tests are very flaky Oct 20, 2023
@killianmuldoon
Copy link
Contributor Author

From the first few days after merging #9570 it looks like problem 1 and problem 2, i.e. etcdImageTagCondition and old nodes remain failures have been fixed.

Toward the end of the week if we don't get any more of those failures we should consider them solved. IMO that should be enough to remove release-blocking from this issue, but we should consider that once we have more data in a few days.

The third failure error adding delete-for-move annotation is still occurring though, and we should look to fix it, though it's not the highest priority flake right now.

@nawazkh
Copy link
Member

nawazkh commented Oct 23, 2023

I see one flake with old nodes remain on capi-mink8s-e2e-main: reference run 1716485778235723776.

Moreover, there are almost consistent flakes on When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster test. I am taking a look at the the diff between a green run and a flake.

Are we sure we want to drop the "release-blocking" tag from this issue?

@killianmuldoon
Copy link
Contributor Author

@nawazkh you're right - I think the state is still pretty unstable so we have to keep release-blocking for now. We have some additional old nodes remain failures as well as a cluster of error adding delete-for-move annotation failures.

Let's keep release blocking on this until those are in a better state.

@killianmuldoon
Copy link
Contributor Author

Current state:

@killianmuldoon killianmuldoon added the kind/flake Categorizes issue or PR as related to a flaky test. label Oct 27, 2023
@killianmuldoon
Copy link
Contributor Author

killianmuldoon commented Oct 30, 2023

After spending some time on it I've come up with a theory for what's causing old nodes remain.

  1. CAPI components are often deployed on the MachinePool instance in the self-hosted cluster.
  2. It is the responsibility of the CAPI MachinePool controller to clean up the nodes of MachinePools when they are no longer listed in Spec.ProviderIDList
  3. When an upgrade happens the DockerMachinePool deletes the container running a MachinePool. This ungracefully stops the node and all running pods.
  4. The capi-controller manager is dead at this point. It can not clean up the old nodes of the MachinePool. From a Kubernetes perspective the Node is NotReady but it doesn't clean up the pods.
  5. This happens after all control plane nodes are upgraded but before all old nodes are drained. Once the CAPI controller manager goes down it no longer reconciles and the last old-version control plane node is not removed in time.
  6. Eventually the nodeEvictionTimeout causes the CAPI pods from the dead node to the cluster.
  7. Once the capi controllers are running again the MP nodes are eventually deleted and the rest of the cluster is eventually upgraded.

The additional time spent waiting for CAPI to recover from the ungraceful shutdown of the MachinePool is the root cause of this issue AFAICT.

I think this is a bug that can be fixed by finding a safer way to remove the docker containers that the MachinePools are running on. We could try to only delete the infrastructure after the Node has already been deleted.

We should consider removing MachinePools from this test after the RC next week to unblock the release. We can duplicate the jobs to keep some coverage so folks can work on fixing those tests.

I'm not sure if there's some explanation leading from this for the error adding delete-for-move annotation bug. I'm going to add some logging to try to figure out if something similar is going on.

@nawazkh @kubernetes-sigs/cluster-api-release-team

@willie-yao (Seeing as you implemented the MP e2e testing): WDYT about removing MPs for this test for this release?

@killianmuldoon killianmuldoon changed the title ClusterClass Self-hosted cluster tests are very flaky ClusterClass self-hosted cluster tests are very flaky Oct 31, 2023
@killianmuldoon killianmuldoon added area/clusterclass Issues or PRs related to clusterclass area/machinepool Issues or PRs related to machinepools labels Nov 1, 2023
@vincepri
Copy link
Member

vincepri commented Nov 8, 2023

-1 on removing tests, or disabling them. If we need, we'll delay the release indefinitely until either the tests are fixed, or the related code (not the tests) is reverted/removed.

@vincepri
Copy link
Member

vincepri commented Nov 8, 2023

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 8, 2023
@willie-yao
Copy link
Contributor

I'm +1 on removing this as a last resort but I think ideally we should be fixing this ahead of the release. Just to be clear: This is only happening on the ClusterClass self-hosted test? And is this because of ClusterClass, or the existence of MachinePools in general? I think the regular self-hosted test without ClusterClass doesn't include MachinePools. If the bug is due to only MachinePools and not ClusterClass, I'm more inclined to remove the test temporarily since it is a pre-existing bug

@CecileRobertMichon
Copy link
Contributor

#8842 will fix this

@CecileRobertMichon
Copy link
Contributor

This should now be fixed. Let's observe the CI signal until next Tuesday (11/21) to confirm we no longer see the flakes.

@furkatgofurov7
Copy link
Member

This should now be fixed. Let's observe the CI signal until next Tuesday (11/21) to confirm we no longer see the flakes.

Looks like it is green after the fix, but will leave it up to CI team Lead and members to confirm.

cc @nawazkh

@adilGhaffarDev
Copy link
Contributor

this is fixed on the main. I don't see any extraordinary flakes related to that.
I still see this flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1725069624287956992 , https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#c6ca02bf53df8d27d1c7
it happens rarely. We will look into it.

@furkatgofurov7
Copy link
Member

this is fixed on the main. I don't see any extraordinary flakes related to that. I still see this flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1725069624287956992 , https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#c6ca02bf53df8d27d1c7 it happens rarely. We will look into it.

Thanks. Let's create a separate issue for the new flake (which is different than the ones described here) if needed and close the current one. Feel free to re-open in the future, if you encounter the same flakes.

/close

@k8s-ci-robot
Copy link
Contributor

@furkatgofurov7: Closing this issue.

In response to this:

this is fixed on the main. I don't see any extraordinary flakes related to that. I still see this flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1725069624287956992 , https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#c6ca02bf53df8d27d1c7 it happens rarely. We will look into it.

Thanks. Let's create a separate issue for the new flake (which is different than the ones described here) if needed and close the current one. Feel free to re-open in the future, if you encounter the same flakes.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass area/machinepool Issues or PRs related to machinepools kind/flake Categorizes issue or PR as related to a flaky test. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

9 participants