[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity #3241

cep21 · 2020-06-24T20:04:58Z

I have two ASG: a spot and on-demand ASG. They are GPU nodes, so frequently spot instances aren't available. AWS tells us very quickly that a spot instance is unavailable: we can see "Could not launch Spot Instances. InsufficientInstanceCapacity - There is no Spot capacity available that matches your request. Launching EC2 instance failed" in the ASG logs.

The current behavior is that autoscaler tries to use the spot ASG for 15 minutes (my current timeout) before it gives up and tries to use a non spot ASG. Ideally, it could notice that the reason the ASG did not scale up, InsufficientInstanceCapacity, is unlikely to go away in the next 15 minutes and would instead mark that group as unable to scale up and fall back to the on-demand ASG.

qqshfox · 2020-08-06T03:39:08Z

Having the same issue here.

autoscaler/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

Line 220 in 852ea80

_, err := m.service.SetDesiredCapacity(params)

SetDesiredCapacity will not return any error related to InsufficientInstanceCapacity according to its doc. We might need to check the scaling activities by calling DescribeScalingActivities.

{
    "Activities": [
        {
            "ActivityId": "ee05cf07-241b-2f28-2be4-3b60f77a76e9",
            "AutoScalingGroupName": "nodes-gpu-spot-cn-north-1a.aws-cn-north-1.prod-1.k8s.local",
            "Description": "Launching a new EC2 instance.  Status Reason: There is no Spot capacity available that matches your request. Launching EC2 instance failed.",
            "Cause": "At 2020-08-06T03:20:39Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
            "StartTime": "2020-08-06T03:20:43.979Z",
            "EndTime": "2020-08-06T03:20:43Z",
            "StatusCode": "Failed",
            "StatusMessage": "There is no Spot capacity available that matches your request. Launching EC2 instance failed.",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"subnet-5d6fb339\",\"Availability Zone\":\"cn-north-1a\"}"
        },
        ...
    ]
}

JacobHenner · 2020-09-18T18:52:42Z

I think the title of this issue should be amended to include other holding states. For example, I'm running into a similar issue with price-too-low. If the maximum spot price for my ASGs is below the current spot prices, cluster-autoscaler waits quite a while before it attempts to use a non-spot ASG.

cep21 · 2020-09-18T18:59:35Z

It's not just spot. Another example is you can hit your account limit on number of instances of a specific instance type: that will also not likely change in the next 15 minutes and it's best to try another ASG.

A general understanding of failure states that are unlikely to change could be very helpful.

fejta-bot · 2020-12-17T19:29:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

cep21 · 2020-12-18T20:27:35Z

Super important!
/remove-lifecycle stale

klebediev · 2020-12-21T14:44:08Z

Looking at AWS API, it seems like there is no reliable way to find out that scaling out for particular SetDesiredCapacity call has failed. If SetDesiredCapacity returned ActivityId for scaling activity, that would work.
Otherwise - personally I can't come up with nothing better than parsing autoscaling activities "younger" than mySetDesiderCapacity API call. Don't feel like this way is production-ready.
Any better ideas?

cep21 · 2020-12-22T16:46:21Z

I wouldn't expect anything that ties back to a single SetDesiredCapacity since it's async and there could be multiple calls.

parsing autoscaling activities "younger" than mySetDesiderCapacity API call

Maybe look at the last activity (rather than them all), if it's recent (for some definition of recent), then assume the capacity isn't able to change right now and quick fallover any scaling operation.

fejta-bot · 2021-03-22T17:40:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

cep21 · 2021-03-22T18:22:36Z

Super important!
/remove-lifecycle stale

itssimon · 2021-05-03T14:12:50Z

This is important for us too, same use case as OP.

k8s-triage-robot · 2021-08-01T14:29:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

azhurbilo · 2021-08-01T18:57:52Z

/remove-lifecycle stale

orsher · 2021-08-04T05:45:29Z

Any updates regarding this? It's super important for us and I'm sure for many others.
Also, where this magic number "15 min" is set? Is it configurable?

atze234 · 2021-11-01T10:29:36Z

I think the 15 Minutes magic number is set by "--max-node-provision-time".
For sure it would be better and a nice feature to scan the scaling events and mark the ASG instantly as dead for next x minutes.

klebediev · 2021-11-13T09:52:19Z

what if we improve detection of "ASG can't be scaled up activity" by sending notifications Fails to launch to SNS topic like:

 $ aws autoscaling put-notification-configuration --auto-scaling-group-name <value> --topic-arn <value> --notification-types "autoscaling:EC2_INSTANCE_LAUNCH_ERROR"

then we can subscribe SQS queue to this topic and cluster-autoscaler can start polling this SQS queue after initiating "scale up" activity.

At this approach requires some configuration effort, it should be disabled by default => but for use cases when fast detection of Fails to launch is useful like with spot ASGs users can configure corresponding infrastructure (SNS, SQS, ASG notifications) and enable this "fail fast" detection method.

k8s-triage-robot · 2022-02-11T09:59:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-03-13T10:03:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

theintz · 2023-02-02T10:53:06Z

/remove-lifecycle rotten

decipher27 · 2023-03-06T06:05:32Z

We are using an expander "priority" in our autoscaler config which doesn't solve this case.
If there is a rebalance recommendation done on ASG, [Having 2 AZ's] sometimes SPOT is unavailable in 1 AZ but, it doesn't fallback to ON_Demand Node_Group. Is there a way we can achieve the fallback to happen on On_demand in someway?

decipher27 · 2023-03-21T13:44:23Z

Any updates on the fix for this case?

k8s-triage-robot · 2023-06-19T13:44:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

RamazanKara · 2023-06-27T16:21:51Z

/remove-lifecycle rotten

ntkach · 2023-06-28T15:15:59Z

Or at least workaround? I can verify also it's not just spot. We're getting the same issue with a k8s cluster running on regular ec2 instances. We currently have 3 autoscaling groups that are using us-east-2a, us-east-2b, and us-east-2c that are stuck bouncing back and forth between max and max-1 because a zone rebalancing failed based on capacity in that zone.

ddelange · 2023-11-29T06:24:37Z

was this not fixed by #4489 released as of cluster-autoscaler-1.24.0?

Shubham82 · 2023-11-29T11:28:49Z

/remove-lifecycle stale

ddelange · 2023-11-29T11:42:28Z

there is also another related PR open: #5756

k8s-triage-robot · 2024-02-27T12:02:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-04-20T12:50:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Shubham82 · 2024-05-07T07:10:45Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-08-05T07:26:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ddelange · 2024-08-05T09:37:49Z

was this not fixed by #4489 released as of cluster-autoscaler-1.24.0?

cc @drmorr0 @gjtempleton can you confirm this can be closed?

k8s-triage-robot · 2024-09-04T09:53:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

drmorr0 · 2024-09-04T15:26:33Z

Yes, I believe this can be closed, that PR should resolve this.

drmorr0 · 2024-09-04T15:26:45Z

/close

k8s-ci-robot · 2024-09-04T15:26:51Z

@drmorr0: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

JacobHenner mentioned this issue Sep 18, 2020

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2020

cep21 mentioned this issue Feb 16, 2021

Improved EC2 Spot Instances best practices support in Kubernetes Cluster Autoscaler aws/ec2-spot-instances-integrations-roadmap#14

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

v1nc3nt27 mentioned this issue Nov 3, 2021

Cluster-autoscaler endlessly tries to launch ASGs that return InsufficientInstanceCapacity #4438

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 13, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 2, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2024

towca added the area/provider/aws Issues or PRs related to aws provider label Mar 21, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 7, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 4, 2024

k8s-ci-robot closed this as completed Sep 4, 2024

[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity #3241

[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity #3241

Comments

cep21 commented Jun 24, 2020

qqshfox commented Aug 6, 2020

JacobHenner commented Sep 18, 2020

cep21 commented Sep 18, 2020

fejta-bot commented Dec 17, 2020

cep21 commented Dec 18, 2020

klebediev commented Dec 21, 2020

cep21 commented Dec 22, 2020

fejta-bot commented Mar 22, 2021

cep21 commented Mar 22, 2021

itssimon commented May 3, 2021

k8s-triage-robot commented Aug 1, 2021

azhurbilo commented Aug 1, 2021

orsher commented Aug 4, 2021

atze234 commented Nov 1, 2021

klebediev commented Nov 13, 2021

k8s-triage-robot commented Feb 11, 2022

k8s-triage-robot commented Mar 13, 2022

theintz commented Feb 2, 2023

decipher27 commented Mar 6, 2023

decipher27 commented Mar 21, 2023

k8s-triage-robot commented Jun 19, 2023

RamazanKara commented Jun 27, 2023

ntkach commented Jun 28, 2023

ddelange commented Nov 29, 2023

Shubham82 commented Nov 29, 2023

ddelange commented Nov 29, 2023

k8s-triage-robot commented Feb 27, 2024

k8s-triage-robot commented Apr 20, 2024

Shubham82 commented May 7, 2024

k8s-triage-robot commented Aug 5, 2024

ddelange commented Aug 5, 2024

k8s-triage-robot commented Sep 4, 2024

drmorr0 commented Sep 4, 2024

drmorr0 commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024