Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support node start-up taint to avoid race conditions #1069

Closed
RyanStan opened this issue Jul 17, 2023 · 7 comments
Closed

Support node start-up taint to avoid race conditions #1069

RyanStan opened this issue Jul 17, 2023 · 7 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@RyanStan
Copy link
Contributor

RyanStan commented Jul 17, 2023

Is your feature request related to a problem? Please describe.
In some cases when new nodes frequently join the cluster, workloads (application Pods) which require EFS volumes can be scheduled to a new EC2 Node before the efs-csi-node Pod has finished initialization and is ready on that Node. This race condition between workload pod and efs-csi-node Pod will cause the workload Pod to fail mounting the PVC.

We should apply a startup taint to prevent this race condition from occurring. The efs-csi-node Daemonset's efs-plugin container will apply this taint to the Node during driver initialization, and then will remove the taint once it is ready.

Describe the solution you'd like in detail
See the following PR on aws-ebs-csi-driver which implements this feature: kubernetes-sigs/aws-ebs-csi-driver#1581

From the overview of that PR:

This PR added support for start-up taint removal for csi-node daemonset pods. This feature allows cluster admin to set taints on nodes, blocking workload pod to be scheduled before the driver start up and be ready. By automatically remove the taints marks the node ready for any CSI functionalities and workload start to be scheduled to the node. This feature can be configured in driver options and in Helm values.

@RyanStan RyanStan changed the title Support node start-up taint Support node start-up taint to avoid race conditions Jul 17, 2023
@RyanStan
Copy link
Contributor Author

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 21, 2023
@allamand
Copy link

allamand commented Oct 6, 2023

The related Fix on EBS is : kubernetes-sigs/aws-ebs-csi-driver#1588

@jumasa
Copy link

jumasa commented Oct 20, 2023

I believe I'm encountering a similar issue. I can consistently reproduce it when the Horizontal Pod Autoscaler initiates a scale-up due to load metrics (the are pods receiving traffic).
However, when I manually scale up the deployment without any incoming traffic to the pods, and wait for all pods to become operational before sending traffic to them, the issue doesn't manifest.

@OverStruck
Copy link

team please look into this

@mskanth972
Copy link
Contributor

mskanth972 commented Nov 14, 2023

Our team has acknowledged this issue and it's currently on our list. We are actively working it and aim to have this resolved and released by the end of this month.

@mskanth972
Copy link
Contributor

PR for this feature. Will merge this and release in the coming version, most probably by end of this month.

@mskanth972
Copy link
Contributor

PR addressing the feature request for supporting node start-up taint to avoid race conditions has been successfully merged. This feature is now included in the EFS CSI Driver as of version v1.7.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants