-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity #3241
Comments
Having the same issue here.
|
I think the title of this issue should be amended to include other holding states. For example, I'm running into a similar issue with |
It's not just spot. Another example is you can hit your account limit on number of instances of a specific instance type: that will also not likely change in the next 15 minutes and it's best to try another ASG. A general understanding of failure states that are unlikely to change could be very helpful. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Super important! |
Looking at AWS API, it seems like there is no reliable way to find out that scaling out for particular |
I wouldn't expect anything that ties back to a single
Maybe look at the last activity (rather than them all), if it's recent (for some definition of recent), then assume the capacity isn't able to change right now and quick fallover any scaling operation. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Super important! |
This is important for us too, same use case as OP. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
Any updates regarding this? It's super important for us and I'm sure for many others. |
I think the 15 Minutes magic number is set by "--max-node-provision-time". |
what if we improve detection of "ASG can't be scaled up activity" by sending notifications $ aws autoscaling put-notification-configuration --auto-scaling-group-name <value> --topic-arn <value> --notification-types "autoscaling:EC2_INSTANCE_LAUNCH_ERROR" then we can subscribe SQS queue to this topic and At this approach requires some configuration effort, it should be disabled by default => but for use cases when fast detection of |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
We are using an expander "priority" in our autoscaler config which doesn't solve this case. |
Any updates on the fix for this case? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle rotten |
Or at least workaround? I can verify also it's not just spot. We're getting the same issue with a k8s cluster running on regular ec2 instances. We currently have 3 autoscaling groups that are using us-east-2a, us-east-2b, and us-east-2c that are stuck bouncing back and forth between max and max-1 because a zone rebalancing failed based on capacity in that zone. |
was this not fixed by #4489 released as of |
/remove-lifecycle stale |
there is also another related PR open: #5756 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
cc @drmorr0 @gjtempleton can you confirm this can be closed? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Yes, I believe this can be closed, that PR should resolve this. |
/close |
@drmorr0: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I have two ASG: a spot and on-demand ASG. They are GPU nodes, so frequently spot instances aren't available. AWS tells us very quickly that a spot instance is unavailable: we can see "Could not launch Spot Instances. InsufficientInstanceCapacity - There is no Spot capacity available that matches your request. Launching EC2 instance failed" in the ASG logs.
The current behavior is that autoscaler tries to use the spot ASG for 15 minutes (my current timeout) before it gives up and tries to use a non spot ASG. Ideally, it could notice that the reason the ASG did not scale up, InsufficientInstanceCapacity, is unlikely to go away in the next 15 minutes and would instead mark that group as unable to scale up and fall back to the on-demand ASG.
The text was updated successfully, but these errors were encountered: