"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937

joshuabaird · 2024-03-26T20:27:07Z

Description

Observed Behavior:
During a recent EC2 API outage (on 3/26/24), we observed the following errors:

Could not schedule pod, all available instance types exceed limits for nodepool: "use1-production-01"

Our NodePools do have available resources. I believe this may be the case of a bad/incorrect error due to the EC2 API being unavailable.

We also see expected errors such as:

listing instance types for use1-production-01, fetching instance types using ec2.DescribeInstanceTypes, InternalError: An internal error has occurred
	status code: 500

Expected Behavior:
A better error message should be displayed.

Versions:

Chart Version: 0.35.2
Kubernetes Version (kubectl version): 1.29/EKS

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

gvillafanetapia · 2024-03-26T20:52:26Z

same here 😞

it would also be great to have some contingency for this? maybe karpenter could cache the instance types and work around
an outage? 🤔

joshuabaird · 2024-03-26T21:41:17Z

same here 😞

it would also be great to have some contingency for this? maybe karpenter could cache the instance types and work around an outage? 🤔

Yeah - in the case of this outage which seems to be affecting only Describe calls, I think some sort of caching mechanism on DescribeInstanceTypes could be helpful.

bala151187 · 2024-03-27T14:59:10Z

Any insight on this issues would be helpful!! . thanks

GnatorX · 2024-03-27T17:22:03Z

This was related to an AWS outage in us-east-1 for DescribeInstanceTypes and DescribeInstanceTypeOffering. I haven't look into what and why Karpenter need to make this call. Is this information necessary to be updated continuously.

engedaam · 2024-04-03T20:54:51Z

kubernetes-sigs/karpenter#1165 should fix this erroneous log message

jonathan-innis · 2024-04-03T21:07:34Z

It seems like this issue is larger than just fixing the error message. It might be worth fixing this error message and then opening a larger issue for improvements in caching the instance types if one isn't open already.

GnatorX · 2024-04-03T21:10:49Z

This doesn't feel like an wrong error message. It seems like cache isn't working or we are listing periodically for instance type

GnatorX · 2024-04-03T21:17:37Z

I think its because of this https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L221 the scheduler performs this GetInstanceTypes each time we create a new scheduler which we do every time we evaluate scheduling

This cache expires after 5 minutes: https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cache/cache.go#L31 which imo is way too short. I don't think instance types change that often.

njtran · 2024-04-05T01:08:56Z

Our instance type cache does more than keep track of the names of the instance types, it keeps track of the allocatable of each of the offerings for a given setting of inputs, and caches the outputs for each given set. This is because of how bits and piece of other API modifies how Karpenter thinks of instance types.

Realistically, we can probably increase this timeout, but how long did this issue impact you? Would something like 15 minutes be enough? 30 minutes? 2 hours?

GnatorX · 2024-04-05T03:10:25Z

The problem here isn't about how long a cache should be rather some parts of the cache doesn't need be refreshed to prevent issues like zonal outages from AWS. In this case after 5 minutes Karpenter will start failing to perform any actions(?) even though the data it relies on to calculate is already there and won't change (instance type and instance type offering).

Its true that the whole instanceTypeProvider cache does more than just cache instance types and it actually contains multiple "sub"-caches however not every part of the cache needs refreshing.

Breaking out the caches where somethings are static is more of an availability improvement to Karpenter and not necessarily a performance thing.

JacobHenner · 2024-04-09T13:58:41Z

Realistically, we can probably increase this timeout, but how long did this issue impact you? Would something like 15 minutes be enough? 30 minutes? 2 hours?

When we observed this issue on 2024-03-26 it lasted for approximately 2h. See graphs below (time in EDT).

engedaam · 2024-04-22T06:54:35Z

The problem here isn't about how long a cache should be rather some parts of the cache doesn't need be refreshed to prevent issues like zonal outages from AWS. In this case after 5 minutes Karpenter will start failing to perform any actions(?) even though the data it relies on to calculate is already there and won't change (instance type and instance type offering).

Karpenter now maintains a controller to hydrate the instance type data asynchronously. The instance type controller will be responsible for querying DescribeInstanceTypes ,and DescribeInstanceTypesOfferings and caching the data. The controller will attempt to refresh the data every 12 hours. The instance type data will only update the instance type data upon a successful response. This will allow Karpenter to be more genrally resilient. #6045

GnatorX · 2024-05-02T23:36:12Z

Nice I think that address this issue

yaroslav-nakonechnikov · 2024-05-03T07:06:00Z

5 cents to others, who may face that issue, but with another root cause:
on 0.32.9 (at least), if there is a limit defined like cpu: 0 - it really means 0, and nothing will work with error "Could not schedule pods, all available instance types exceed limits"

for me it was a bit surprise, as in most other places that would mean unlimited.

engedaam · 2024-05-11T03:25:23Z

@yaroslav-nakonechnikov The current error message is misleading for when Karpenter can't resolve any instance types from the NodePools. Just an FYI, here are list of PRs that hope to address this issue

chore: Add an Subnet Controller to Asynchronously Hydrate Subnet Data #6057
chore: Use Security Group Status Controller to Asynchronously Hydrate Security Group Data #6069
chore: Use AMI Status Controller to Asynchronously Hydrate AMI Data #6089

engedaam · 2024-05-15T19:08:18Z

Closing as all the tracked PRs are merged.

joshuabaird added bug Something isn't working needs-triage Issues that need to be triaged labels Mar 26, 2024

engedaam mentioned this issue Apr 3, 2024

chore: Only Log when NodePool limits are reached and not on empty InstanceTypes kubernetes-sigs/karpenter#1165

Closed

engedaam removed the needs-triage Issues that need to be triaged label Apr 3, 2024

billrayburn assigned engedaam Apr 10, 2024

engedaam closed this as completed May 15, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937

"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937

joshuabaird commented Mar 26, 2024 •

edited

Loading

gvillafanetapia commented Mar 26, 2024

joshuabaird commented Mar 26, 2024

bala151187 commented Mar 27, 2024

GnatorX commented Mar 27, 2024

engedaam commented Apr 3, 2024

jonathan-innis commented Apr 3, 2024

GnatorX commented Apr 3, 2024 •

edited

Loading

GnatorX commented Apr 3, 2024 •

edited

Loading

njtran commented Apr 5, 2024

GnatorX commented Apr 5, 2024

JacobHenner commented Apr 9, 2024

engedaam commented Apr 22, 2024 •

edited

Loading

GnatorX commented May 2, 2024

yaroslav-nakonechnikov commented May 3, 2024 •

edited

Loading

engedaam commented May 11, 2024 •

edited

Loading

engedaam commented May 15, 2024

"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937

"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937

Comments

joshuabaird commented Mar 26, 2024 • edited Loading

Description

gvillafanetapia commented Mar 26, 2024

joshuabaird commented Mar 26, 2024

bala151187 commented Mar 27, 2024

GnatorX commented Mar 27, 2024

engedaam commented Apr 3, 2024

jonathan-innis commented Apr 3, 2024

GnatorX commented Apr 3, 2024 • edited Loading

GnatorX commented Apr 3, 2024 • edited Loading

njtran commented Apr 5, 2024

GnatorX commented Apr 5, 2024

JacobHenner commented Apr 9, 2024

engedaam commented Apr 22, 2024 • edited Loading

GnatorX commented May 2, 2024

yaroslav-nakonechnikov commented May 3, 2024 • edited Loading

engedaam commented May 11, 2024 • edited Loading

engedaam commented May 15, 2024

joshuabaird commented Mar 26, 2024 •

edited

Loading

GnatorX commented Apr 3, 2024 •

edited

Loading

GnatorX commented Apr 3, 2024 •

edited

Loading

engedaam commented Apr 22, 2024 •

edited

Loading

yaroslav-nakonechnikov commented May 3, 2024 •

edited

Loading

engedaam commented May 11, 2024 •

edited

Loading