CABPK manifest is missing metadata #10165

pierreozoux · 2024-02-16T10:12:05Z

What steps did you take and what happened?

I updated from 1.4.9 to 1.5.5

In capi-kubeadm-bootstrap-controller-manager-59984b8784-q78pk, there is only this Info/warn/error:

I0216 09:59:20.071707       1 cluster_cache_tracker.go:161] "remote/ClusterCacheTracker: Couldn't find controller pod metadata, the ClusterCacheTracker will always access clusters using the regular apiserver endpoint"

In other controller, I don't see errors.

But on capi-controller-manager-58fbc78955-g4njq, there is this error:

E0216 10:00:13.382563       1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="prod/prod-md-0-sm6mt" namespace="prod" name="prod-md-0-sm6mt" reconcileID=xxx MachineDeployment="prod/prod-md-0" Cluster="prod/prod" Machine="prod/prod-md-0-sm6mt-2mz62" node=""

I see this error 2 times for each worker, and I have 3 workers.

If I downgrade to 1.4.9, it goes away. I have the same error in 1.5.5 nad on 1.6.1.

What did you expect to happen?

no errors in the logs

Cluster API version

1.5.5

Kubernetes version

1.27.9

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

chrischdi · 2024-02-19T15:42:28Z

Potentially related / improved / fixed already on main:

🐛 ClusterCacheTracker: Use RequeueAfter instead of immediate requeue on ErrClusterLocked to not have exponentially increasing requeue time #9810

It was not cherry-picked back (only for CRS to fix CI related flakes) because we were not able to reproduce it for the other controllers in real.

Would be awesome if this could be checked if this also resolves this issue. That would need more details though how to exactly reproduce it.

chrischdi · 2024-02-19T15:43:55Z

Question: is the issue for you persistent or resolves itself after ~17 mins?

fabriziopandini · 2024-02-27T14:18:39Z

/triage accepted

remote/ClusterCacheTracker: Couldn't find controller pod metadata, the ClusterCacheTracker will always access clusters using the regular apiserver endpoint

Is because we are missing https://github.com/fabriziopandini/cluster-api/blob/0f47a19e038ee6b0d3b1e7675a62cdaf84face8c/controlplane/kubeadm/config/manager/manager.yaml#L28-L40 on the CABPK manifest (probably a leftover of a PR that started using ClusterCacheTracker in CAPBK)

This should be fixed in main and backported as far as possible

NOTE: This issue doesn't cause any issues to users, just degraded performance since we are using an additional client + client cache for self-hosted clusters.

Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already

Is not an issue
It is a transient condition that happens when two threads/reconcile loops running in parallel are trying to get a client from ClusterCacheTracker, and the client does not exist yet.
This error can appear many times if the underlying connection is not stable, and the reconcile loops/ClusterCacheTracker continuously tries to recreate connections, and each operation takes 10s before timing out (during those 10s other reconcile loops/ClusterCacheTracker gets the lock error).

fabriziopandini · 2024-02-27T14:19:15Z

/good-first-issue

k8s-ci-robot · 2024-02-27T14:19:18Z

@fabriziopandini:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nikParasyr · 2024-02-28T12:00:25Z

/assign

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2024

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 27, 2024

fabriziopandini changed the title ~~failed to create cluster accessor: failed to get lock for cluster: cluster is locked already~~ CABPK manifest is missing metadata Feb 27, 2024

jessehu mentioned this issue Feb 28, 2024

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

Open

k8s-ci-robot assigned nikParasyr Feb 28, 2024

nikParasyr mentioned this issue Feb 28, 2024

🌱 Add pod metadata to capbk manager #10208

Merged

k8s-ci-robot closed this as completed in #10208 Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CABPK manifest is missing metadata #10165

CABPK manifest is missing metadata #10165

pierreozoux commented Feb 16, 2024

chrischdi commented Feb 19, 2024 •

edited

Loading

chrischdi commented Feb 19, 2024

fabriziopandini commented Feb 27, 2024 •

edited

Loading

fabriziopandini commented Feb 27, 2024

k8s-ci-robot commented Feb 27, 2024

nikParasyr commented Feb 28, 2024

CABPK manifest is missing metadata #10165

CABPK manifest is missing metadata #10165

Comments

pierreozoux commented Feb 16, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

chrischdi commented Feb 19, 2024 • edited Loading

chrischdi commented Feb 19, 2024

fabriziopandini commented Feb 27, 2024 • edited Loading

fabriziopandini commented Feb 27, 2024

k8s-ci-robot commented Feb 27, 2024

Guidelines

nikParasyr commented Feb 28, 2024

chrischdi commented Feb 19, 2024 •

edited

Loading

fabriziopandini commented Feb 27, 2024 •

edited

Loading