-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max Node Provision Time + Priority Expander + Node Unavailability #3490
Comments
I encountered the same issue. |
@daganida88 No unfortunately. But thanks for confirming it’s not just me. Maybe now this will get some attention. |
Any status on this ? I'm facing the same issue. |
Well for me it worked. See Sample logs
And then after 5 mins
And it then starts a node from next one from Priority Expander
Now my Pod is scheduled. |
@4pits what version of EKS and cluster-autoscaler are you using? How did you get scaleup backoff to be 5 minutes? I believe it is 15 minutes by default. Did you change |
Scale-up backoff always starts at 5 minutes and grows exponentially up to 30 minutes [1]. Regarding i-placeholder instances - CA operates on individual nodes, not number of nodes. When it scales-up it tracks the status and timeouts individually for each node and placeholder instances are needed to allow that. You can find more details in discussions on #2008. |
I don't see any configuration variables that would allow us to customize these variables. Here How would I accomplish this: I have two priority groups. The first priority fails to scale up within 5 minutes, give up on it, mark it as failed, replace it with the next highest priority, don't attempt again for 2 hours. Will I have to edit the source and build manually? |
I think so, as you say those values are consts and can't currently be changed via flags. |
Yes @nitrag |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@nitrag I faced the same issue. Were you also facing a spot interruption in your fleet? |
We haven't seen this since 1.19+ and even before then it was rare. |
@nitrag I am using v1.18.2 on Kubernetes 1.18. It was spot interruption which I confirmed from AWS Support. Shouldn't cluster autoscalar not use an ASG which it has marked unhealthy ? In my case it was still using it. |
Created a separate bug for the same - #4900 |
Heyyy just happened again on the same cluster as OP on 1.28 🎉 |
I have an issue where we did not scale up because spot instances were not available.
AWS EKS:
1.15
Cluster-Autoscaler:
1.15.7
Expander:
priority
I have configured
--max-node-provision-time="5m0s"
. The expected behavior would be that after 5 minutes, the node group is marked as unavailable and the next highest priority group is chosen. The logs below shot it's being ignored. IThe default value for--max-node-provision-time
is 15 minutes but I see cluster-autoscaler recognizing the node group unhealthy at 10 minutes. Also, why was the spot not group (marked?) unhealthy yet priority expander still chooses it later on? What is the purpose of thei-placeholder
instances?Here are the logs:
The text was updated successfully, but these errors were encountered: