Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CABPK manifest is missing metadata #10165

Closed
pierreozoux opened this issue Feb 16, 2024 · 6 comments · Fixed by #10208
Closed

CABPK manifest is missing metadata #10165

pierreozoux opened this issue Feb 16, 2024 · 6 comments · Fixed by #10208
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pierreozoux
Copy link

What steps did you take and what happened?

I updated from 1.4.9 to 1.5.5

In capi-kubeadm-bootstrap-controller-manager-59984b8784-q78pk, there is only this Info/warn/error:

I0216 09:59:20.071707       1 cluster_cache_tracker.go:161] "remote/ClusterCacheTracker: Couldn't find controller pod metadata, the ClusterCacheTracker will always access clusters using the regular apiserver endpoint"

In other controller, I don't see errors.

But on capi-controller-manager-58fbc78955-g4njq, there is this error:

E0216 10:00:13.382563       1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="prod/prod-md-0-sm6mt" namespace="prod" name="prod-md-0-sm6mt" reconcileID=xxx MachineDeployment="prod/prod-md-0" Cluster="prod/prod" Machine="prod/prod-md-0-sm6mt-2mz62" node=""

I see this error 2 times for each worker, and I have 3 workers.

If I downgrade to 1.4.9, it goes away. I have the same error in 1.5.5 nad on 1.6.1.

What did you expect to happen?

no errors in the logs

Cluster API version

1.5.5

Kubernetes version

1.27.9

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2024
@chrischdi
Copy link
Member

chrischdi commented Feb 19, 2024

Potentially related / improved / fixed already on main:

It was not cherry-picked back (only for CRS to fix CI related flakes) because we were not able to reproduce it for the other controllers in real.

Would be awesome if this could be checked if this also resolves this issue. That would need more details though how to exactly reproduce it.

@chrischdi
Copy link
Member

Question: is the issue for you persistent or resolves itself after ~17 mins?

@fabriziopandini
Copy link
Member

fabriziopandini commented Feb 27, 2024

/triage accepted

remote/ClusterCacheTracker: Couldn't find controller pod metadata, the ClusterCacheTracker will always access clusters using the regular apiserver endpoint

Is because we are missing https://github.com/fabriziopandini/cluster-api/blob/0f47a19e038ee6b0d3b1e7675a62cdaf84face8c/controlplane/kubeadm/config/manager/manager.yaml#L28-L40 on the CABPK manifest (probably a leftover of a PR that started using ClusterCacheTracker in CAPBK)

This should be fixed in main and backported as far as possible

NOTE: This issue doesn't cause any issues to users, just degraded performance since we are using an additional client + client cache for self-hosted clusters.

Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already

Is not an issue
It is a transient condition that happens when two threads/reconcile loops running in parallel are trying to get a client from ClusterCacheTracker, and the client does not exist yet.
This error can appear many times if the underlying connection is not stable, and the reconcile loops/ClusterCacheTracker continuously tries to recreate connections, and each operation takes 10s before timing out (during those 10s other reconcile loops/ClusterCacheTracker gets the lock error).

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2024
@fabriziopandini
Copy link
Member

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 27, 2024
@fabriziopandini fabriziopandini changed the title failed to create cluster accessor: failed to get lock for cluster: cluster is locked already CABPK manifest is missing metadata Feb 27, 2024
@nikParasyr
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants