-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to mount secrets when a new node is scaled up #759
Comments
Here's some more detail as I watch another deployment:
I see that some pods are in ContainerCreating (over 120s!)
|
Thanks for reporting the issue! Unfortunately, this behavior isn't specific to the secrets-store-csi-driver implementation but rather how workloads are scheduled in Kubernetes. The CSI driver needs to be running on the node for the volume mount request to be processed but in case of a scale up event there is no way to ensure all system pods (csi driver, kube-proxy, other pods in There was an enhancement proposal centered around this: kubernetes/enhancements#1003 but was closed. Once the driver and providers pods are running on the new node, the pods waiting volume mount will eventually start |
Typically the images for some of these components are baked in the VHD image. If they're not present in the VHD, then the image needs to be pulled which is probably what you're seeing in the describe output. |
@aramase thank you for the quick response! I assume there's no known workaround for this? The behaviour we've observed over ~10 rollouts is that the pods never hit |
@sjdweb, Do we know if driver and provider pods were running at the time of manual intervention? |
@nilekhc yes all other pods were running relating to the secret provider, identity etc before manual intervention |
Currently there is no workaround for this. Eventually the pod volume mounts will succeed when the drivers are running because of the retries available in |
@aramase unfortunately the pod volume mounts never succeed in our case. Today for example the containers were stuck for 35m in |
Yeah, I agree that's not a great experience.
|
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 21 days with no activity. Feel free to re-open if you are experiencing the issue again. |
@sjdweb we're running into this issue as well where the pod volume mounts never succeed. Did you manage to find a workaround? It'd be great if the kubelet would actually retry and eventually succeed but kicking the pods manually sucks especially when scale-ups are happening automatically with our cluster-autoscaler |
We're experiencing the same issue and the only workaround we found is manual intervention by deleting the pod. When a new pod is scheduled it is able to successfully mount the secrets. Does anyone have a workaround to avoid the necessity of manual intervention? This is important for workloads that are very dynamic (autoscaling). |
I've managed to fix this situation. In our case is because the default Helm chart values do not match the Examples:
@aramase shouldn't the documentation pinpoint if you install through Helm chart you should setup the |
Have you
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Every time we deploy this Helm chart, we hit this problem.
Because there are 200 pods in this over a number of deployments, there will be a node scale-up triggered on AKS.
Once the pod is assigned the new node/s, we see that the pod is unable to mount the secret volume.
Errors:
On pod:
In MIC:
What did you expect to happen:
The pod should run as expected on the new node.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
If I babysit the deployment and kill the pods after I know the nodes are healthy, the newly created pods are fine.
The problem here is the pods that fail to mount (because the CSI driver is not found yet, or times out) get stuck in a ContainerCreating state.
Which access mode did you use to access the Azure Key Vault instance:
[e.g. Service Principal, Pod Identity, User Assigned Managed Identity, System Assigned Managed Identity]
Pod Identity
Environment:
kubectl version
andkubectl get nodes -o wide
):Azure AKS
The text was updated successfully, but these errors were encountered: