Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542

Closed
ionutbalutoiu opened this issue Oct 12, 2023 · 2 comments · Fixed by #9543
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ionutbalutoiu
Copy link
Contributor

ionutbalutoiu commented Oct 12, 2023

What steps did you take and what happened?

Steps to reproduce the error:

  • Deploy cluster-api with AWS provider.
  • Create an EKS cluster (more info here).
  • Move cluster-api components to the new workload cluster via clusterctl move command.
  • Add some MachinePool and AWSManagedMachinePool resources.
  • After some running time, observe capi-controller-manager logs being spammed, and pod memory increasing.

We have a v1.27.4 EKS cluster deployed via cluster-api, and we recently upgraded from v1.3.7 to v1.5.2.

Everything went fine with the upgrade. However, after a week of running, I noticed the capi-controller-manager pod being spammed (about 300 logs per minute) with:

W1010 12:56:12.276785       1 reflector.go:533] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: failed to list *v1.Node: Unauthorized
E1010 12:56:12.276815       1 reflector.go:148] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: Failed to watch *v1.Node: failed to list *v1.Node: Unauthorized
W1010 12:56:12.713455       1 reflector.go:533] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: failed to list *v1.Node: Unauthorized
E1010 12:56:12.713488       1 reflector.go:148] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: Failed to watch *v1.Node: failed to list *v1.Node: Unauthorized

and pod memory usage being high.

Restarted the pod, everything is fine again. But slowly with time, the same issue reproduces again.

What did you expect to happen?

Pod logs that are not being spammed, and pod memory not increasing over time.

Cluster API version

v1.5.2

Kubernetes version

v1.27.4-eks-2d98532

Anything else you would like to add?

I already investigated and solved this issue, but I want to bring more context on the issue first.

When deploying EKS Kubernetes clusters via cluster-api AWS provider (CAPA), there are two kubeconfig secrets being set:

  • <CLUSTER_NAME>-kubeconfig
  • <CLUSTER_NAME>-user-kubeconfig

The <CLUSTER_NAME>-user-kubeconfig is meant to be used by users, while <CLUSTER_NAME>-kubeconfig is being used internally by CAPI, and it's rotated every sync-period, since it contains an expiring token to access the EKS API. More details here.

When running cluster-api in the workload cluster, this code path causes a memory leak when <CLUSTER_NAME>-kubeconfig is rotated frequently:

// Create a client and a cache for the cluster.
c, uncachedClient, cache, err := t.createClient(ctx, config, cluster, indexes)
if err != nil {
return nil, err
}
// Detect if the controller is running on the workload cluster.
// This function uses an uncached client to ensure pods aren't cached by the long-lived client.
runningOnCluster, err := t.runningOnWorkloadCluster(ctx, uncachedClient, cluster)
if err != nil {
return nil, err
}
// If the controller runs on the workload cluster, access the apiserver directly by using the
// CA and Host from the in-cluster configuration.
if runningOnCluster {
inClusterConfig, err := ctrl.GetConfig()
if err != nil {
return nil, errors.Wrapf(err, "error creating client for self-hosted cluster %q", cluster.String())
}
// Use CA and Host from in-cluster config.
config.CAData = nil
config.CAFile = inClusterConfig.CAFile
config.Host = inClusterConfig.Host
// Create a new client and overwrite the previously created client.
c, _, cache, err = t.createClient(ctx, config, cluster, indexes)
if err != nil {
return nil, errors.Wrap(err, "error creating client for self-hosted cluster")
}
log.Info(fmt.Sprintf("Creating cluster accessor for cluster %q with in-cluster service %q", cluster.String(), config.Host))
} else {

Notice that in this part of the code, when CAPI is running in the workload cluster, the t.createClient is called twice:

  1. // Create a client and a cache for the cluster.
    c, uncachedClient, cache, err := t.createClient(ctx, config, cluster, indexes)
  2. // Create a new client and overwrite the previously created client.
    c, _, cache, err = t.createClient(ctx, config, cluster, indexes)

However, the function t.createClient also starts a goroutine with the clusterAccessor cache:

// Start the cache!!!
go cache.Start(cacheCtx) //nolint:errcheck

Using more verbose logging, I noticed that immediately after the goroutine cache starts, there's a cluster nodes watcher registered:

I1010 22:05:29.890187       1 reflector.go:287] Starting reflector *v1.Node (10h52m8.919938079s) from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231
I1010 22:05:29.902920       1 reflector.go:323] Listing and watching *v1.Node from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231

I believe this is happening, because we have the following watcher already registered on the ClusterCacheTracker:

return r.Tracker.Watch(ctx, remote.WatchInput{
Name: "machinepool-watchNodes",
Cluster: util.ObjectKey(cluster),
Watcher: r.controller,
Kind: &corev1.Node{},
EventHandler: handler.EnqueueRequestsFromMapFunc(r.nodeToMachinePool),
})

However, looking in the ClusterCacheTracker code, this watcher is supposed to be started just once:

if accessor.watches.Has(input.Name) {
log := ctrl.LoggerFrom(ctx)
log.V(6).Info("Watch already exists", "Cluster", klog.KRef(input.Cluster.Namespace, input.Cluster.Name), "name", input.Name)
return nil
}

But, when I restart the capi-controller-manager, I see that the watcher registers twice for the same cluster. This is confirmed by seing this twice:

I1010 22:05:29.890187       1 reflector.go:287] Starting reflector *v1.Node (10h52m8.919938079s) from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231
I1010 22:05:29.902920       1 reflector.go:323] Listing and watching *v1.Node from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231

So, one of the goroutine is leaked.

This is happening because, in the initially mentioned code path, we start the cache twice. Before overwriting the cached client with the in-cluster configuration, we never stop the initial cache, and that runs continuously in the background.

I noticed that stopping the cache would also stop the goroutine that watches the nodes continuously.

This wouldn't be a memory leak problem if kubeconfig is not rotated frequently, since the clusterAccessor is never re-created. However, when the loaded kubeconfig expires, the healthCheckCluster fails, and the clusterAccessor is deleted:

err := wait.PollUntilContextCancel(ctx, in.interval, true, runHealthCheckWithThreshold)
// An error returned implies the health check has failed a sufficient number of times for the cluster
// to be considered unhealthy or the cache was stopped and thus the cache context canceled (we pass the
// cache context into wait.PollUntilContextCancel).
// NB. Log all errors that occurred even if this error might just be from a cancel of the cache context
// when the cache is stopped. Logging an error in this case is not a problem and makes debugging easier.
if err != nil {
t.log.Error(err, "Error health checking cluster", "Cluster", klog.KRef(in.cluster.Namespace, in.cluster.Name))
}
// Ensure in any case that the accessor is deleted (even if it is a no-op).
// NB. It is crucial to ensure the accessor was deleted, so it can be later recreated when the
// cluster is reachable again
t.deleteAccessor(ctx, in.cluster)

This causes the clusterAccessor to be recreated on the next Reconcile loop, and the troublesome code path leaks another goroutine with the cluster node watcher. And this happens every time the loaded EKS cluster kubeconfig expires, and the clusterAccessor is recreated.

The log spam comes the fact that, after a long running time, we have a lot of leaked goroutines trying to list cluster nodes, with an expired kubeconfig.

Label(s) to be applied

/kind bug
/area clustercachetracker

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 12, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

Thanks for catching this and submitting a fix. I really appreciate the detailed description of the problem. Let's continue discussion primarily on the PR.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 12, 2023
@sbueringer
Copy link
Member

sbueringer commented Jan 25, 2024

Thx for the great issue description, really really nice!

Reviewing the PR now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants