Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup Taint is sometimes not removed #1320

Closed
dpiddock opened this issue Apr 19, 2024 · 4 comments
Closed

Startup Taint is sometimes not removed #1320

dpiddock opened this issue Apr 19, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dpiddock
Copy link

/kind bug

What happened?
We use Karpenter for the cluster node scaler. We recently implemented the startup taint for the efs driver. Since then we have occasionally seen nodes stuck with unscheduleable pods. Upon investigation it is caused by the efs.csi.aws.com/agent-not-ready:NoExecute taint still being present on the node despite the efs-csi-node pod running correctly.

The efs-plugin container has this log line:

E0419 06:03:32.149207       1 driver.go:134] "Unexpected failure when attempting to remove node taint(s)" err="the server rejected our request due to an error in our request"

What you expected to happen?
efs-csi-node pod to successfully remove the node taint so that other pods can be scheduled.

How to reproduce it (as minimally and precisely as possible)?
We don't currently know. It's a rare and intermittent problem.

Anything else we need to know?:
I managed to find the failed request in the audit logs. At the same time there was a request being processed to add the taint node.kubernetes.io/not-ready:NoExecute by node-controller. This could be a classic race condition with other parts of the system? Although this is a sample size of 1.

I attach the two entries:
efs-csi.json
node-controller.json

Environment

  • Kubernetes version (use kubectl version): v1.29.1-eks-b9c9ed7
  • EKS 1.29
  • Driver version: v1.7.6
  • EKS add-on: v1.7.6-eksbuild.2

Please also attach debug logs to help us better diagnose

  • Instructions to gather debug logs can be found here

results.tgz

@seanzatzdev-amazon
Copy link
Contributor

Hi @dpiddock , thank you for bringing this to our attention. We are working together with the author of the following PR to address this issue: #1287

Please let us know if you have any further questions or concerns.

@mteodori
Copy link

mteodori commented Jun 5, 2024

is this same as #1273 ?

@dpiddock
Copy link
Author

dpiddock commented Jun 5, 2024

This is the opposite of #1273. That issue is complaining that the taint is removed too fast, before the service is really ready. This issue is about the startup taint sometimes not being removed because of a race condition.

@seanzatzdev-amazon
Copy link
Contributor

I've merged #1287 into mainline to address this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants