ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542

ionutbalutoiu · 2023-10-12T07:42:05Z

What steps did you take and what happened?

Steps to reproduce the error:

Deploy cluster-api with AWS provider.
Create an EKS cluster (more info here).
Move cluster-api components to the new workload cluster via clusterctl move command.
Add some MachinePool and AWSManagedMachinePool resources.
After some running time, observe capi-controller-manager logs being spammed, and pod memory increasing.

We have a v1.27.4 EKS cluster deployed via cluster-api, and we recently upgraded from v1.3.7 to v1.5.2.

Everything went fine with the upgrade. However, after a week of running, I noticed the capi-controller-manager pod being spammed (about 300 logs per minute) with:

W1010 12:56:12.276785       1 reflector.go:533] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: failed to list *v1.Node: Unauthorized
E1010 12:56:12.276815       1 reflector.go:148] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: Failed to watch *v1.Node: failed to list *v1.Node: Unauthorized
W1010 12:56:12.713455       1 reflector.go:533] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: failed to list *v1.Node: Unauthorized
E1010 12:56:12.713488       1 reflector.go:148] k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231: Failed to watch *v1.Node: failed to list *v1.Node: Unauthorized

and pod memory usage being high.

Restarted the pod, everything is fine again. But slowly with time, the same issue reproduces again.

What did you expect to happen?

Pod logs that are not being spammed, and pod memory not increasing over time.

Cluster API version

v1.5.2

Kubernetes version

v1.27.4-eks-2d98532

Anything else you would like to add?

I already investigated and solved this issue, but I want to bring more context on the issue first.

When deploying EKS Kubernetes clusters via cluster-api AWS provider (CAPA), there are two kubeconfig secrets being set:

<CLUSTER_NAME>-kubeconfig
<CLUSTER_NAME>-user-kubeconfig

The <CLUSTER_NAME>-user-kubeconfig is meant to be used by users, while <CLUSTER_NAME>-kubeconfig is being used internally by CAPI, and it's rotated every sync-period, since it contains an expiring token to access the EKS API. More details here.

When running cluster-api in the workload cluster, this code path causes a memory leak when <CLUSTER_NAME>-kubeconfig is rotated frequently:

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 298 to 330 in a82c340

    
           // Create a client and a cache for the cluster. 
        
           c, uncachedClient, cache, err := t.createClient(ctx, config, cluster, indexes) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           // Detect if the controller is running on the workload cluster. 
        
           // This function uses an uncached client to ensure pods aren't cached by the long-lived client. 
        
           runningOnCluster, err := t.runningOnWorkloadCluster(ctx, uncachedClient, cluster) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           // If the controller runs on the workload cluster, access the apiserver directly by using the 
        
           // CA and Host from the in-cluster configuration. 
        
           if runningOnCluster { 
        
           	inClusterConfig, err := ctrl.GetConfig() 
        
           	if err != nil { 
        
           		return nil, errors.Wrapf(err, "error creating client for self-hosted cluster %q", cluster.String()) 
        
           	} 
        
           	// Use CA and Host from in-cluster config. 
        
           	config.CAData = nil 
        
           	config.CAFile = inClusterConfig.CAFile 
        
           	config.Host = inClusterConfig.Host 
        
           	// Create a new client and overwrite the previously created client. 
        
           	c, _, cache, err = t.createClient(ctx, config, cluster, indexes) 
        
           	if err != nil { 
        
           		return nil, errors.Wrap(err, "error creating client for self-hosted cluster") 
        
           	} 
        
           	log.Info(fmt.Sprintf("Creating cluster accessor for cluster %q with in-cluster service %q", cluster.String(), config.Host)) 
        
           } else {

Notice that in this part of the code, when CAPI is running in the workload cluster, the t.createClient is called twice:

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 298 to 299 in a82c340

    
           // Create a client and a cache for the cluster. 
        
           c, uncachedClient, cache, err := t.createClient(ctx, config, cluster, indexes)

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 324 to 325 in a82c340

    
           // Create a new client and overwrite the previously created client. 
        
           c, _, cache, err = t.createClient(ctx, config, cluster, indexes)

However, the function t.createClient also starts a goroutine with the clusterAccessor cache:

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 449 to 450 in a82c340

    
           // Start the cache!!! 
        
           go cache.Start(cacheCtx) //nolint:errcheck

Using more verbose logging, I noticed that immediately after the goroutine cache starts, there's a cluster nodes watcher registered:

I1010 22:05:29.890187       1 reflector.go:287] Starting reflector *v1.Node (10h52m8.919938079s) from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231
I1010 22:05:29.902920       1 reflector.go:323] Listing and watching *v1.Node from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231

I believe this is happening, because we have the following watcher already registered on the ClusterCacheTracker:

cluster-api/exp/internal/controllers/machinepool_controller.go

Lines 325 to 331 in a82c340

    
           return r.Tracker.Watch(ctx, remote.WatchInput{ 
        
           	Name:         "machinepool-watchNodes", 
        
           	Cluster:      util.ObjectKey(cluster), 
        
           	Watcher:      r.controller, 
        
           	Kind:         &corev1.Node{}, 
        
           	EventHandler: handler.EnqueueRequestsFromMapFunc(r.nodeToMachinePool), 
        
           })

However, looking in the ClusterCacheTracker code, this watcher is supposed to be started just once:

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 541 to 545 in a82c340

    
           if accessor.watches.Has(input.Name) { 
        
           	log := ctrl.LoggerFrom(ctx) 
        
           	log.V(6).Info("Watch already exists", "Cluster", klog.KRef(input.Cluster.Namespace, input.Cluster.Name), "name", input.Name) 
        
           	return nil 
        
           }

But, when I restart the capi-controller-manager, I see that the watcher registers twice for the same cluster. This is confirmed by seing this twice:

I1010 22:05:29.890187       1 reflector.go:287] Starting reflector *v1.Node (10h52m8.919938079s) from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231
I1010 22:05:29.902920       1 reflector.go:323] Listing and watching *v1.Node from k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231

So, one of the goroutine is leaked.

This is happening because, in the initially mentioned code path, we start the cache twice. Before overwriting the cached client with the in-cluster configuration, we never stop the initial cache, and that runs continuously in the background.

I noticed that stopping the cache would also stop the goroutine that watches the nodes continuously.

This wouldn't be a memory leak problem if kubeconfig is not rotated frequently, since the clusterAccessor is never re-created. However, when the loaded kubeconfig expires, the healthCheckCluster fails, and the clusterAccessor is deleted:

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 656 to 668 in a82c340

    
           err := wait.PollUntilContextCancel(ctx, in.interval, true, runHealthCheckWithThreshold) 
        
           // An error returned implies the health check has failed a sufficient number of times for the cluster 
        
           // to be considered unhealthy or the cache was stopped and thus the cache context canceled (we pass the 
        
           // cache context into wait.PollUntilContextCancel). 
        
           // NB. Log all errors that occurred even if this error might just be from a cancel of the cache context 
        
           // when the cache is stopped. Logging an error in this case is not a problem and makes debugging easier. 
        
           if err != nil { 
        
           	t.log.Error(err, "Error health checking cluster", "Cluster", klog.KRef(in.cluster.Namespace, in.cluster.Name)) 
        
           } 
        
           // Ensure in any case that the accessor is deleted (even if it is a no-op). 
        
           // NB. It is crucial to ensure the accessor was deleted, so it can be later recreated when the 
        
           // cluster is reachable again 
        
           t.deleteAccessor(ctx, in.cluster)

This causes the clusterAccessor to be recreated on the next Reconcile loop, and the troublesome code path leaks another goroutine with the cluster node watcher. And this happens every time the loaded EKS cluster kubeconfig expires, and the clusterAccessor is recreated.

The log spam comes the fact that, after a long running time, we have a lot of leaked goroutines trying to list cluster nodes, with an expired kubeconfig.

Label(s) to be applied

/kind bug
/area clustercachetracker

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-10-12T07:45:58Z

/triage accepted

Thanks for catching this and submitting a fix. I really appreciate the detailed description of the problem. Let's continue discussion primarily on the PR.

sbueringer · 2024-01-25T15:56:09Z

Thx for the great issue description, really really nice!

Reviewing the PR now

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 12, 2023

ionutbalutoiu mentioned this issue Oct 12, 2023

🐛 Fix ClusterCacheTracker memory leak #9543

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 12, 2023

k8s-ci-robot closed this as completed in #9543 Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542

ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542

ionutbalutoiu commented Oct 12, 2023 •

edited

Loading

killianmuldoon commented Oct 12, 2023

sbueringer commented Jan 25, 2024 •

edited

Loading

ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542

ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542

Comments

ionutbalutoiu commented Oct 12, 2023 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

killianmuldoon commented Oct 12, 2023

sbueringer commented Jan 25, 2024 • edited Loading

ionutbalutoiu commented Oct 12, 2023 •

edited

Loading

sbueringer commented Jan 25, 2024 •

edited

Loading