Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786

adilGhaffarDev · 2023-06-05T11:26:07Z

Which jobs are flaking?

e.g. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1665614080297144320

Which tests are flaking?

periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-ipv6-main
periodic-cluster-api-e2e-dualstack-ipv6-main
periodic-cluster-api-e2e-dualstack-ipv6-main
periodic-cluster-api-e2e-ipv6-main

All failing at Should create a workload cluster

Since when has it been flaking?

01-06-2023
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#295d4f44852a6339ba54

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

To be analyzed.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake

adilGhaffarDev · 2023-06-05T11:26:32Z

/assign

killianmuldoon · 2023-06-05T11:27:31Z

/triage accepted

Thanks for opening this @adilGhaffarDev!

killianmuldoon · 2023-06-05T12:17:57Z

@ykakarap - you might be able to help out on this one. Could it possibly be related to changes in the MachineSet controller? It appears to be recently introduced.

sbueringer · 2023-06-05T13:12:53Z

Please add a link to the concrete job instance that failed under "Which jobs are flaking?".

ykakarap · 2023-06-05T16:03:26Z

I will take a look.

ykakarap · 2023-06-06T04:48:46Z

Initial analysis for the failure:

MachineDeployment fails to get all the worker nodes because MachineSet is not scaling up
MachineSet is not scaling up because it is waiting for the ControlPlane to finish provisioning (ControlPlaneIsStable preflight)
ControlPlane is not considered provisioned by the MachineSet even though in reality it is provisioned because the KCP controller fails to update the KCP object's status (MS looks at KCP status to determine if the CP is provisioned)
KCP status update fails because it failed while creating a remote cluster client (KCP needs it to calculate readyReplicas)
Creating remote cluster client fails because the client is being rate-limited (this part of the code goes into controller-runtime).
- Relevant log message from KCP logs:

failed to create remote cluster client: failed to create cluster accessor: error adding index for field "spec.providerID" to cache for remote cluster "quick-start-yafvln/quick-start-9qh7w5": failed to get API group resources: unable to retrieve the complete list of server APIs: v1: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: EOF

So I would say the problem is not actually the preflight checks but something different but the preflight checks now surfaced this problem. Of course the problem still needs to be fixed.

Line of code that fails: here

@sbueringer Any recommendations on how to address the rate-limiting issue?

sbueringer · 2023-06-06T07:25:20Z

I'm not sure if the problem is the rate limiter or the late limiter just reports the error it had a few seconds before which seems to be : "error from a previous attempt: EOF". How is the requeue behavior of that error case?

EDIT: I'll take a closer look now

sbueringer · 2023-06-06T12:37:48Z

In my opinion the issue is the following:

The worker machine is not created because CP is not stable (preflight check blocks)
The CP is not stable because KCP.status.version is not set
KCP.status.version is not set because there is no Machine with APIServerPodHealthy = true

There are two possible root causes:

apiserver never gets ready
apiserver gets ready but KCP doesn't see it because it doesn't reconcile after the apiserver got ready

I'll open a PR to also dump kube-system Pods at the end of tests. This should help us to figure out if it's 1. or 2.

Independent of that the ClusterCacheTracker log is not ideal, but I think it's not our issue here.

Some data:
Job: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1665917119549673472

KCP logs:

The apiserver pod starts up around 05:14:19 (link)

So it could be that the Pod gets ready after KCP did its last reconcile before deletion at 05:14:27

sbueringer · 2023-06-06T14:54:26Z

I'll open a PR to also dump kube-system Pods at the end of tests. This should help us to figure out if it's 1. or 2.

The PR is merged now. Let's take a closer look on the next occurence

sbueringer · 2023-06-06T16:00:37Z

PR to improve the CCT error: #8801

sbueringer · 2023-06-09T10:14:21Z

We had a new occurence now.

Following results:

Job: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1666583266548781056
kube-apiserver pod is ready since 2023-06-08T00:14:41Z (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1666583266548781056/artifacts/clusters/clusterclass-changes-qgapy7/resources/kube-system/Pod/kube-apiserver-clusterclass-changes-qgapy7-czmt7-kv8ms.yaml)
The KCP Machine still shows APIServerPodHealthy=false (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1666583266548781056/artifacts/clusters/bootstrap/resources/clusterclass-changes-6rkonn/Machine/clusterclass-changes-qgapy7-czmt7-kv8ms.yaml)
Based on logs the KCP object was reconciled after the apiserver Pod became ready:
- original_ts=2023-06-08T02:14:48.714+02:00 msg=Reconcile KubeadmControlPlane
- original_ts=2023-06-08T02:14:49.043+02:00 msg=Reconcile KubeadmControlPlane
- original_ts=2023-06-08T02:14:51.53+02:00 msg=Reconcile KubeadmControlPlane
The Pod object is not cached in KCP, so:
- KCP should have seen that the apiserver pod is ready
- then flagged APIServerPodHealthy as true on the Machine
- and then eventually it should have written KCP.status.version
- This then would unblock the MS preflight check and thus the Machine creation

It's unclear to me at this point why KCP didn't see that the apiserver Pod was ready.

I think we can further triage this by collecting more data, e.g. by adding logs to KCP to surface the Pod object it is seeing.

A potential solution would be to requeue more often in KCP if KCP is not entirely healthy, but it would be good to first figure out what exactly is going on.

@adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks.

adilGhaffarDev · 2023-06-09T12:40:49Z

@adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks.

Yes I can take a look.

sbueringer · 2023-07-05T12:24:40Z

@adilGhaffarDev Did you have time to investigate the issue?

adilGhaffarDev · 2023-07-05T14:13:20Z

@adilGhaffarDev Did you have time to investigate the issue?

Sorry I didn't get time to look into it. I will not be able to work on it soon as I am on vacations right now. If someone has time please check it.

killianmuldoon · 2023-07-12T11:46:04Z

/help

k8s-ci-robot · 2023-07-12T11:46:05Z

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2023-07-21T11:22:24Z

We had a new occurence now.

Following results:

Job: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1666583266548781056

kube-apiserver pod is ready since 2023-06-08T00:14:41Z (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1666583266548781056/artifacts/clusters/clusterclass-changes-qgapy7/resources/kube-system/Pod/kube-apiserver-clusterclass-changes-qgapy7-czmt7-kv8ms.yaml)

The KCP Machine still shows APIServerPodHealthy=false (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1666583266548781056/artifacts/clusters/bootstrap/resources/clusterclass-changes-6rkonn/Machine/clusterclass-changes-qgapy7-czmt7-kv8ms.yaml)

Based on logs the KCP object was reconciled after the apiserver Pod became ready:

original_ts=2023-06-08T02:14:48.714+02:00 msg=Reconcile KubeadmControlPlane

original_ts=2023-06-08T02:14:49.043+02:00 msg=Reconcile KubeadmControlPlane

original_ts=2023-06-08T02:14:51.53+02:00 msg=Reconcile KubeadmControlPlane

The Pod object is not cached in KCP, so:

KCP should have seen that the apiserver pod is ready

then flagged APIServerPodHealthy as true on the Machine

and then eventually it should have written KCP.status.version

This then would unblock the MS preflight check and thus the Machine creation

It's unclear to me at this point why KCP didn't see that the apiserver Pod was ready.

I think we can further triage this by collecting more data, e.g. by adding logs to KCP to surface the Pod object it is seeing.

A potential solution would be to requeue more often in KCP if KCP is not entirely healthy, but it would be good to first figure out what exactly is going on.

@adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks.

Taking another look at the above:

Timestamp	Object	What	Time Source
`00:14:07Z`	👷‍♀️ Test	Test starts waits for ControlPlane Node to exist	Build Log
`00:14:32Z`	🛳️ APIServer Pod	Pod got created	CreationTimestamp
`00:14:37Z`	👷‍♀️ Test	Test starts waits for ControlPlane Node to be ready	Build Log
`00:14:41Z`	🛳️ APIServer Pod	Pod's got ready	Status.Conditions
`00:14:48Z`	🎮 CAPK controller	Entering `reconcile()` for the KCP	CAPK logs
`00:14:49Z`	🎮 CAPK controller	Entering `reconcile()` for the KCP	CAPK logs
`02:14:51Z`	🎮 CAPK controller	Entering `reconcile()` for the KCP	CAPK logs
`00:14:52Z`	🛳️ APIServer Pod	Pod's status got updated	ManagedFields
`00:14:57Z`	👷‍♀️ Test	Test waits for workload nodes	Build Log
`00:19:57Z`	👷‍♀️ Test	Test exits / failed	Build Log

From reading the objects again and some trial and error:

The information that the kube-apiserver pod got ready according .status.conditions was patched to the Pod object after the last reconcile of CAPK controller.
At the state of CAPK reconciling the KCP object: it could have a no-op and return ctrl.Result{}, nil, while still having ControlPlaneComponentsUnhealthy set to false in .status.condition.

The only sources for reconciliations for the KCP object are:

Changes to the KCP object itself
Changes to the Cluster object (via watch)
syncperiod

For this edge case I could see some variants to get to a solution:

Do a requeue if the ControlPlaneComponentsUnhealthy condition is set to false, to ensure it gets reconciled again (similar to what @sbueringer proposed above)
~~The direct alternative to this would be a watch on Machines~~
A watch in the workload cluster for the pods.

cc @sbueringer / @killianmuldoon

chrischdi · 2023-07-21T11:36:59Z

We also have a similar case where we already do a requeue:

cluster-api/controlplane/kubeadm/internal/controllers/controller.go

Lines 225 to 232 in ddbf9cf

    
           // TODO: remove this as soon as we have a proper remote cluster cache in place. 
        
           // Make KCP to requeue in case status is not ready, so we can check for node status without waiting for a full resync (by default 10 minutes). 
        
           // Only requeue if we are not going in exponential backoff due to error, or if we are not already re-queueing, or if the object has a deletion timestamp. 
        
           if reterr == nil && !res.Requeue && res.RequeueAfter <= 0 && kcp.ObjectMeta.DeletionTimestamp.IsZero() { 
        
           	if !kcp.Status.Ready { 
        
           		res = ctrl.Result{RequeueAfter: 20 * time.Second} 
        
           	} 
        
           }

But in this new case, kcp.Status.Ready is true.

I'll go forward and create a PR which would requeue in this new case too.

killianmuldoon · 2023-07-21T11:40:08Z

The direct alternative to this would be a watch on Machines

The KCP controller also watches Machines AFAIU through the Owns function in it's builder. KCP is reconciled in response to changes on Machines.

killianmuldoon · 2023-07-21T11:41:34Z

// TODO: remove this as soon as we have a proper remote cluster cache in place.

Not sure replicating this would be non-controversial - sounds like it shouldn't be needed now that there is a proper remote cluster cache

chrischdi · 2023-07-21T11:53:19Z

The direct alternative to this would be a watch on Machines

The KCP controller also watches Machines AFAIU through the Owns function in it's builder. KCP is reconciled in response to changes on Machines. I wonder why this reconcile is being missed in this case 🤔

Ah right. The point is not a Machine watch, but a watch to the Pods inside the workload cluster instead. (KCP itself updates the conditions for the pods on the machine)

killianmuldoon · 2023-07-21T11:55:39Z

The point is not a Machine watch, but a watch to the Pods inside the workload cluster instead.

I don't think this is a good idea - we'd end up caching too many pods. I'm fine with adding something like:

 		res = ctrl.Result{RequeueAfter: 20 * time.Second}

But we should comment clearly that it's to avoid a watch on the pods, and maybe reassess the case in #8786 (comment) to see if it's actually needed anymore and update the comment to reflect the current state.

Thanks for taking a look at this one!

chrischdi · 2023-07-21T12:04:05Z

I totally agree 👍. Created #9032 as proposal to solve this case.

sbueringer · 2023-07-21T12:04:21Z

I think a requeue is probably the less-risky option. We could have a metadata-only watch for Pods. So we wouldn't have a memory problem. But I'm more than just a bit concerned about the amount of Pod events we would get (even if we filter down on only the most important pods) and then the corresponding reconciles.

sbueringer · 2023-07-21T13:42:42Z

/reopen

to wait for confirmation via CI?

k8s-ci-robot · 2023-07-21T13:42:46Z

@sbueringer: Reopened this issue.

In response to this:

/reopen

to wait for confirmation via CI?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2023-07-21T13:54:45Z

More flexible triage link to confirm later on:

Link

chrischdi · 2023-07-24T06:13:54Z

That looks pretty good since merging :-)

chrischdi · 2023-07-25T07:11:19Z

This is fixed :-)

/close

k8s-ci-robot · 2023-07-25T07:11:22Z

@chrischdi: Closing this issue.

In response to this:

This is fixed :-)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon · 2023-07-25T07:58:24Z

This is fixed :-)

Awesome work!

sbueringer · 2023-07-25T08:00:15Z

Nice work on a tricky issue.

I especialy like the k8s-triage links :D !!

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2023

k8s-ci-robot assigned adilGhaffarDev Jun 5, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2023

sbueringer mentioned this issue Jun 5, 2023

🐛 topology controller should avoid unnecessary rollouts during upgrades #8628

Merged

sbueringer mentioned this issue Jun 6, 2023

🌱 test/e2e: also dump workload cluster kube-system pods #8800

Merged

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 12, 2023

sbueringer mentioned this issue Jul 14, 2023

Timed out waiting for all machines to be exist #8824

Closed

chrischdi mentioned this issue Jul 21, 2023

🐛 requeue KCP object if ControlPlaneComponentsHealthyCondition is not yet true #9032

Merged

k8s-ci-robot closed this as completed in #9032 Jul 21, 2023

k8s-ci-robot reopened this Jul 21, 2023

k8s-ci-robot closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786

Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786

adilGhaffarDev commented Jun 5, 2023 •

edited

Loading

adilGhaffarDev commented Jun 5, 2023

killianmuldoon commented Jun 5, 2023

killianmuldoon commented Jun 5, 2023

sbueringer commented Jun 5, 2023

ykakarap commented Jun 5, 2023

ykakarap commented Jun 6, 2023 •

edited

Loading

sbueringer commented Jun 6, 2023 •

edited

Loading

sbueringer commented Jun 6, 2023 •

edited

Loading

sbueringer commented Jun 6, 2023

sbueringer commented Jun 6, 2023

sbueringer commented Jun 9, 2023 •

edited

Loading

adilGhaffarDev commented Jun 9, 2023

sbueringer commented Jul 5, 2023

adilGhaffarDev commented Jul 5, 2023

killianmuldoon commented Jul 12, 2023

k8s-ci-robot commented Jul 12, 2023

chrischdi commented Jul 21, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023

killianmuldoon commented Jul 21, 2023 •

edited

Loading

killianmuldoon commented Jul 21, 2023

chrischdi commented Jul 21, 2023 •

edited

Loading

killianmuldoon commented Jul 21, 2023

chrischdi commented Jul 21, 2023

sbueringer commented Jul 21, 2023

sbueringer commented Jul 21, 2023

k8s-ci-robot commented Jul 21, 2023

chrischdi commented Jul 21, 2023 •

edited

Loading

chrischdi commented Jul 24, 2023

chrischdi commented Jul 25, 2023

k8s-ci-robot commented Jul 25, 2023

killianmuldoon commented Jul 25, 2023

sbueringer commented Jul 25, 2023

Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786

Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786

Comments

adilGhaffarDev commented Jun 5, 2023 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

adilGhaffarDev commented Jun 5, 2023

killianmuldoon commented Jun 5, 2023

killianmuldoon commented Jun 5, 2023

sbueringer commented Jun 5, 2023

ykakarap commented Jun 5, 2023

ykakarap commented Jun 6, 2023 • edited Loading

sbueringer commented Jun 6, 2023 • edited Loading

sbueringer commented Jun 6, 2023 • edited Loading

sbueringer commented Jun 6, 2023

sbueringer commented Jun 6, 2023

sbueringer commented Jun 9, 2023 • edited Loading

adilGhaffarDev commented Jun 9, 2023

sbueringer commented Jul 5, 2023

adilGhaffarDev commented Jul 5, 2023

killianmuldoon commented Jul 12, 2023

k8s-ci-robot commented Jul 12, 2023

Guidelines

chrischdi commented Jul 21, 2023 • edited Loading

chrischdi commented Jul 21, 2023

killianmuldoon commented Jul 21, 2023 • edited Loading

killianmuldoon commented Jul 21, 2023

chrischdi commented Jul 21, 2023 • edited Loading

killianmuldoon commented Jul 21, 2023

chrischdi commented Jul 21, 2023

sbueringer commented Jul 21, 2023

sbueringer commented Jul 21, 2023

k8s-ci-robot commented Jul 21, 2023

chrischdi commented Jul 21, 2023 • edited Loading

chrischdi commented Jul 24, 2023

chrischdi commented Jul 25, 2023

k8s-ci-robot commented Jul 25, 2023

killianmuldoon commented Jul 25, 2023

sbueringer commented Jul 25, 2023

adilGhaffarDev commented Jun 5, 2023 •

edited

Loading

ykakarap commented Jun 6, 2023 •

edited

Loading

sbueringer commented Jun 6, 2023 •

edited

Loading

sbueringer commented Jun 6, 2023 •

edited

Loading

sbueringer commented Jun 9, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023 •

edited

Loading

killianmuldoon commented Jul 21, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023 •

edited

Loading