KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

JasonXD-CS · 2024-04-01T20:35:01Z

What steps did you take and what happened:
We have an Azure Key Vault with average requests lower than the KeyVault throttling limit.
and recently ran into outage when started using CSI Driver with auto rotation with 3 hours interval.
We had regular scale up at peak hours and that triggered Key Vault throttling and continue to be throttling for hours until we disable auto rotation.
As a result, none of the services could be created as it's stuck in ContainerCreating state and we had to revert back to use KeyVaultAgent.

After investigation we discovered a few issue with reconciler implementation:

Current auto rotation design is inefficient and not scalable as it's rotating secrets per pod, which makes a lot of unnecessary requests. If a deployment running 1000 replicas is downloading 10 secrets each. The amount of extra requests made is 10000 vs 10 if rotating per deployment.
According to this post Understanding how many calls are made to KeyVault?

"In case of rotation, the list of all existing pods is generated and the contents in the mount are rotated linearly. A workqueue is created in the driver, all 10k pods added to queue and then the queue is processed"

However, workqueue does not handle 429 and no exponential backoff.

secrets-store-csi-driver/pkg/rotation/reconciler.go

Lines 401 to 405 in bf86dbf

    
           newObjectVersions, errorReason, err := secretsstore.MountContent(ctx, providerClient, string(paramsJSON), string(secretsJSON), spcps.Status.TargetPath, string(permissionJSON), oldObjectVersions) 
        
           if err != nil { 
        
           	r.generateEvent(pod, corev1.EventTypeWarning, mountRotationFailedReason, fmt.Sprintf("provider mount err: %+v", err)) 
        
           	return fmt.Errorf("failed to rotate objects for pod %s/%s, err: %w", spcps.Namespace, spcps.Status.PodName, err) 
        
           }

So each task in the queue doesn't know anything about 429 throttling and it will just continue to process these requests in the workqueue without backoff. So it doesn't give any time for Key Vault recover as it continues to make these requests.
This is amplified when there are thousands of nodes pulling from the same KeyVault.

It requeues the failed request back to the workqueue after 10 seconds delay with no maxRequeueLimit. So that creates an infinite loop if KeyVault could not recover from throttling
https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/bf86dbf98ad3e32a0f55e52d6a411abd6784f7fb/pkg/rotation/reconciler.go#L242C1-L243C1

What did you expect to happen:

KeyVault throttling 429 TooManyRequest should be properly handled in the reconciler.
Exponential backoff for rotation requests processing.

Anything else you would like to add:
Regarding #1, the current design is not scalable.
Would love to hear from the team what's the plan for optimizing this going forward.

Also, does the polling interval starts counting when all rotation requests are processed? Or it starts as soon as List SPCPC is invoked?
If it starts as soon as List SPCPC is invoked and add these task to workqueue, Does that mean new iteration will add more rotation task to the workqueue despite previous iteration didn't finish them and the queue just continue to pile up?

Which provider are you using:
Azure Key Vault.

Environment:

Secrets Store CSI Driver version: (use the image tag): v1.4.1
Kubernetes version: (use kubectl version): v1.27.9

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2024-07-07T21:14:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Gil-Tohar-Forter · 2024-07-08T11:55:15Z

Can this be prioritized, it doesn't make sense that a request is made to Vault for each pod, it should really be one per deployment, since all the pods for a given deployment are using the same secret.

With large scale services, sercret rotation becomes an unusable feature in fear of 429s from the secret provider or simply crashing the secret provider

aramase · 2024-07-25T17:04:15Z

/remove-lifecycle stale

JasonXD-CS added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2024

JasonXD-CS changed the title ~~KeyVault 429 TooManyRequests led to infinite loop in workqueue without exponential backoff~~ KeyVault 429 TooManyRequests led to infinite loop in workqueue Apr 1, 2024

JasonXD-CS changed the title ~~KeyVault 429 TooManyRequests led to infinite loop in workqueue~~ KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue Apr 3, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

JasonXD-CS commented Apr 1, 2024 •

edited

Loading

k8s-triage-robot commented Jul 7, 2024

Gil-Tohar-Forter commented Jul 8, 2024

aramase commented Jul 25, 2024

KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

Comments

JasonXD-CS commented Apr 1, 2024 • edited Loading

k8s-triage-robot commented Jul 7, 2024

Gil-Tohar-Forter commented Jul 8, 2024

aramase commented Jul 25, 2024

JasonXD-CS commented Apr 1, 2024 •

edited

Loading