Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786

Closed
adilGhaffarDev opened this issue Jun 5, 2023 · 32 comments · Fixed by #9032
Closed
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@adilGhaffarDev
Copy link
Contributor

adilGhaffarDev commented Jun 5, 2023

Which jobs are flaking?

e.g. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1665614080297144320

Which tests are flaking?

periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-ipv6-main
periodic-cluster-api-e2e-dualstack-ipv6-main
periodic-cluster-api-e2e-dualstack-ipv6-main
periodic-cluster-api-e2e-ipv6-main

All failing at Should create a workload cluster

Since when has it been flaking?

01-06-2023
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#295d4f44852a6339ba54

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

To be analyzed.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2023
@adilGhaffarDev
Copy link
Contributor Author

/assign

@killianmuldoon
Copy link
Contributor

/triage accepted

Thanks for opening this @adilGhaffarDev!

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2023
@killianmuldoon
Copy link
Contributor

@ykakarap - you might be able to help out on this one. Could it possibly be related to changes in the MachineSet controller? It appears to be recently introduced.

@sbueringer
Copy link
Member

Please add a link to the concrete job instance that failed under "Which jobs are flaking?".

@ykakarap
Copy link
Contributor

ykakarap commented Jun 5, 2023

I will take a look.

@ykakarap
Copy link
Contributor

ykakarap commented Jun 6, 2023

Initial analysis for the failure:

  • MachineDeployment fails to get all the worker nodes because MachineSet is not scaling up
  • MachineSet is not scaling up because it is waiting for the ControlPlane to finish provisioning (ControlPlaneIsStable preflight)
  • ControlPlane is not considered provisioned by the MachineSet even though in reality it is provisioned because the KCP controller fails to update the KCP object's status (MS looks at KCP status to determine if the CP is provisioned)
  • KCP status update fails because it failed while creating a remote cluster client (KCP needs it to calculate readyReplicas)
  • Creating remote cluster client fails because the client is being rate-limited (this part of the code goes into controller-runtime).
    • Relevant log message from KCP logs:
failed to create remote cluster client: failed to create cluster accessor: error adding index for field "spec.providerID" to cache for remote cluster "quick-start-yafvln/quick-start-9qh7w5": failed to get API group resources: unable to retrieve the complete list of server APIs: v1: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: EOF

So I would say the problem is not actually the preflight checks but something different but the preflight checks now surfaced this problem. Of course the problem still needs to be fixed.

Line of code that fails: here

@sbueringer Any recommendations on how to address the rate-limiting issue?

@sbueringer
Copy link
Member

sbueringer commented Jun 6, 2023

I'm not sure if the problem is the rate limiter or the late limiter just reports the error it had a few seconds before which seems to be : "error from a previous attempt: EOF". How is the requeue behavior of that error case?

EDIT: I'll take a closer look now

@sbueringer
Copy link
Member

sbueringer commented Jun 6, 2023

In my opinion the issue is the following:

  • The worker machine is not created because CP is not stable (preflight check blocks)
  • The CP is not stable because KCP.status.version is not set
  • KCP.status.version is not set because there is no Machine with APIServerPodHealthy = true

There are two possible root causes:

  1. apiserver never gets ready
  2. apiserver gets ready but KCP doesn't see it because it doesn't reconcile after the apiserver got ready

I'll open a PR to also dump kube-system Pods at the end of tests. This should help us to figure out if it's 1. or 2.

Independent of that the ClusterCacheTracker log is not ideal, but I think it's not our issue here.

Some data:
Job: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1665917119549673472

KCP logs:
image

The apiserver pod starts up around 05:14:19 (link)

So it could be that the Pod gets ready after KCP did its last reconcile before deletion at 05:14:27

@sbueringer
Copy link
Member

I'll open a PR to also dump kube-system Pods at the end of tests. This should help us to figure out if it's 1. or 2.

The PR is merged now. Let's take a closer look on the next occurence

@sbueringer
Copy link
Member

PR to improve the CCT error: #8801

@sbueringer
Copy link
Member

sbueringer commented Jun 9, 2023

We had a new occurence now.

Following results:

It's unclear to me at this point why KCP didn't see that the apiserver Pod was ready.

I think we can further triage this by collecting more data, e.g. by adding logs to KCP to surface the Pod object it is seeing.

A potential solution would be to requeue more often in KCP if KCP is not entirely healthy, but it would be good to first figure out what exactly is going on.

@adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks.

@adilGhaffarDev
Copy link
Contributor Author

@adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks.

Yes I can take a look.

@sbueringer
Copy link
Member

@adilGhaffarDev Did you have time to investigate the issue?

@adilGhaffarDev
Copy link
Contributor Author

@adilGhaffarDev Did you have time to investigate the issue?

Sorry I didn't get time to look into it. I will not be able to work on it soon as I am on vacations right now. If someone has time please check it.

@killianmuldoon
Copy link
Contributor

/help

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 12, 2023
@chrischdi
Copy link
Member

chrischdi commented Jul 21, 2023

We had a new occurence now.

Following results:

It's unclear to me at this point why KCP didn't see that the apiserver Pod was ready.

I think we can further triage this by collecting more data, e.g. by adding logs to KCP to surface the Pod object it is seeing.

A potential solution would be to requeue more often in KCP if KCP is not entirely healthy, but it would be good to first figure out what exactly is going on.

@adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks.

Taking another look at the above:

Timestamp Object What Time Source
00:14:07Z 👷‍♀️ Test Test starts waits for ControlPlane Node to exist Build Log
00:14:32Z 🛳️ APIServer Pod Pod got created CreationTimestamp
00:14:37Z 👷‍♀️ Test Test starts waits for ControlPlane Node to be ready Build Log
00:14:41Z 🛳️ APIServer Pod Pod's got ready Status.Conditions
00:14:48Z 🎮 CAPK controller Entering reconcile() for the KCP CAPK logs
00:14:49Z 🎮 CAPK controller Entering reconcile() for the KCP CAPK logs
02:14:51Z 🎮 CAPK controller Entering reconcile() for the KCP CAPK logs
00:14:52Z 🛳️ APIServer Pod Pod's status got updated ManagedFields
00:14:57Z 👷‍♀️ Test Test waits for workload nodes Build Log
00:19:57Z 👷‍♀️ Test Test exits / failed Build Log

From reading the objects again and some trial and error:

  • The information that the kube-apiserver pod got ready according .status.conditions was patched to the Pod object after the last reconcile of CAPK controller.
  • At the state of CAPK reconciling the KCP object: it could have a no-op and return ctrl.Result{}, nil, while still having ControlPlaneComponentsUnhealthy set to false in .status.condition.

The only sources for reconciliations for the KCP object are:

  • Changes to the KCP object itself
  • Changes to the Cluster object (via watch)
  • syncperiod

For this edge case I could see some variants to get to a solution:

  • Do a requeue if the ControlPlaneComponentsUnhealthy condition is set to false, to ensure it gets reconciled again (similar to what @sbueringer proposed above)
  • The direct alternative to this would be a watch on Machines
  • A watch in the workload cluster for the pods.

cc @sbueringer / @killianmuldoon

@chrischdi
Copy link
Member

We also have a similar case where we already do a requeue:

// TODO: remove this as soon as we have a proper remote cluster cache in place.
// Make KCP to requeue in case status is not ready, so we can check for node status without waiting for a full resync (by default 10 minutes).
// Only requeue if we are not going in exponential backoff due to error, or if we are not already re-queueing, or if the object has a deletion timestamp.
if reterr == nil && !res.Requeue && res.RequeueAfter <= 0 && kcp.ObjectMeta.DeletionTimestamp.IsZero() {
if !kcp.Status.Ready {
res = ctrl.Result{RequeueAfter: 20 * time.Second}
}
}

But in this new case, kcp.Status.Ready is true.

I'll go forward and create a PR which would requeue in this new case too.

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Jul 21, 2023

The direct alternative to this would be a watch on Machines

The KCP controller also watches Machines AFAIU through the Owns function in it's builder. KCP is reconciled in response to changes on Machines.

@killianmuldoon
Copy link
Contributor

// TODO: remove this as soon as we have a proper remote cluster cache in place.

Not sure replicating this would be non-controversial - sounds like it shouldn't be needed now that there is a proper remote cluster cache

@chrischdi
Copy link
Member

chrischdi commented Jul 21, 2023

The direct alternative to this would be a watch on Machines

The KCP controller also watches Machines AFAIU through the Owns function in it's builder. KCP is reconciled in response to changes on Machines. I wonder why this reconcile is being missed in this case 🤔

Ah right. The point is not a Machine watch, but a watch to the Pods inside the workload cluster instead. (KCP itself updates the conditions for the pods on the machine)

@killianmuldoon
Copy link
Contributor

The point is not a Machine watch, but a watch to the Pods inside the workload cluster instead.

I don't think this is a good idea - we'd end up caching too many pods. I'm fine with adding something like:

 		res = ctrl.Result{RequeueAfter: 20 * time.Second} 

But we should comment clearly that it's to avoid a watch on the pods, and maybe reassess the case in #8786 (comment) to see if it's actually needed anymore and update the comment to reflect the current state.

Thanks for taking a look at this one!

@chrischdi
Copy link
Member

I totally agree 👍. Created #9032 as proposal to solve this case.

@sbueringer
Copy link
Member

I think a requeue is probably the less-risky option. We could have a metadata-only watch for Pods. So we wouldn't have a memory problem. But I'm more than just a bit concerned about the amount of Pod events we would get (even if we filter down on only the most important pods) and then the corresponding reconciles.

@sbueringer
Copy link
Member

/reopen

to wait for confirmation via CI?

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen

to wait for confirmation via CI?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jul 21, 2023
@chrischdi
Copy link
Member

chrischdi commented Jul 21, 2023

More flexible triage link to confirm later on:

Link

@chrischdi
Copy link
Member

That looks pretty good since merging :-)
image

@chrischdi
Copy link
Member

This is fixed :-)

/close

@k8s-ci-robot
Copy link
Contributor

@chrischdi: Closing this issue.

In response to this:

This is fixed :-)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon
Copy link
Contributor

This is fixed :-)

Awesome work!

@sbueringer
Copy link
Member

Nice work on a tricky issue.

I especialy like the k8s-triage links :D !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants