Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937

Closed
joshuabaird opened this issue Mar 26, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@joshuabaird
Copy link

joshuabaird commented Mar 26, 2024

Description

Observed Behavior:
During a recent EC2 API outage (on 3/26/24), we observed the following errors:

Could not schedule pod, all available instance types exceed limits for nodepool: "use1-production-01"

Our NodePools do have available resources. I believe this may be the case of a bad/incorrect error due to the EC2 API being unavailable.

We also see expected errors such as:

listing instance types for use1-production-01, fetching instance types using ec2.DescribeInstanceTypes, InternalError: An internal error has occurred
	status code: 500

Expected Behavior:
A better error message should be displayed.

Versions:

  • Chart Version: 0.35.2
  • Kubernetes Version (kubectl version): 1.29/EKS
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@joshuabaird joshuabaird added bug Something isn't working needs-triage Issues that need to be triaged labels Mar 26, 2024
@gvillafanetapia
Copy link

same here 😞

it would also be great to have some contingency for this? maybe karpenter could cache the instance types and work around
an outage? 🤔

@joshuabaird
Copy link
Author

same here 😞

it would also be great to have some contingency for this? maybe karpenter could cache the instance types and work around an outage? 🤔

Yeah - in the case of this outage which seems to be affecting only Describe calls, I think some sort of caching mechanism on DescribeInstanceTypes could be helpful.

@bala151187
Copy link

Any insight on this issues would be helpful!! . thanks

@GnatorX
Copy link

GnatorX commented Mar 27, 2024

This was related to an AWS outage in us-east-1 for DescribeInstanceTypes and DescribeInstanceTypeOffering. I haven't look into what and why Karpenter need to make this call. Is this information necessary to be updated continuously.

@engedaam
Copy link
Contributor

engedaam commented Apr 3, 2024

kubernetes-sigs/karpenter#1165 should fix this erroneous log message

@jonathan-innis
Copy link
Contributor

It seems like this issue is larger than just fixing the error message. It might be worth fixing this error message and then opening a larger issue for improvements in caching the instance types if one isn't open already.

@GnatorX
Copy link

GnatorX commented Apr 3, 2024

This doesn't feel like an wrong error message. It seems like cache isn't working or we are listing periodically for instance type

@GnatorX
Copy link

GnatorX commented Apr 3, 2024

I think its because of this https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L221 the scheduler performs this GetInstanceTypes each time we create a new scheduler which we do every time we evaluate scheduling

This cache expires after 5 minutes: https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cache/cache.go#L31 which imo is way too short. I don't think instance types change that often.

@njtran
Copy link
Contributor

njtran commented Apr 5, 2024

Our instance type cache does more than keep track of the names of the instance types, it keeps track of the allocatable of each of the offerings for a given setting of inputs, and caches the outputs for each given set. This is because of how bits and piece of other API modifies how Karpenter thinks of instance types.

Realistically, we can probably increase this timeout, but how long did this issue impact you? Would something like 15 minutes be enough? 30 minutes? 2 hours?

@GnatorX
Copy link

GnatorX commented Apr 5, 2024

The problem here isn't about how long a cache should be rather some parts of the cache doesn't need be refreshed to prevent issues like zonal outages from AWS. In this case after 5 minutes Karpenter will start failing to perform any actions(?) even though the data it relies on to calculate is already there and won't change (instance type and instance type offering).

Its true that the whole instanceTypeProvider cache does more than just cache instance types and it actually contains multiple "sub"-caches however not every part of the cache needs refreshing.

Breaking out the caches where somethings are static is more of an availability improvement to Karpenter and not necessarily a performance thing.

@JacobHenner
Copy link

Realistically, we can probably increase this timeout, but how long did this issue impact you? Would something like 15 minutes be enough? 30 minutes? 2 hours?

When we observed this issue on 2024-03-26 it lasted for approximately 2h. See graphs below (time in EDT).

image(1)
image

@engedaam
Copy link
Contributor

engedaam commented Apr 22, 2024

The problem here isn't about how long a cache should be rather some parts of the cache doesn't need be refreshed to prevent issues like zonal outages from AWS. In this case after 5 minutes Karpenter will start failing to perform any actions(?) even though the data it relies on to calculate is already there and won't change (instance type and instance type offering).

Karpenter now maintains a controller to hydrate the instance type data asynchronously. The instance type controller will be responsible for querying DescribeInstanceTypes ,and DescribeInstanceTypesOfferings and caching the data. The controller will attempt to refresh the data every 12 hours. The instance type data will only update the instance type data upon a successful response. This will allow Karpenter to be more genrally resilient. #6045

@GnatorX
Copy link

GnatorX commented May 2, 2024

Nice I think that address this issue

@yaroslav-nakonechnikov
Copy link

yaroslav-nakonechnikov commented May 3, 2024

5 cents to others, who may face that issue, but with another root cause:
on 0.32.9 (at least), if there is a limit defined like cpu: 0 - it really means 0, and nothing will work with error "Could not schedule pods, all available instance types exceed limits"

for me it was a bit surprise, as in most other places that would mean unlimited.

@engedaam
Copy link
Contributor

engedaam commented May 11, 2024

@yaroslav-nakonechnikov The current error message is misleading for when Karpenter can't resolve any instance types from the NodePools. Just an FYI, here are list of PRs that hope to address this issue

@engedaam
Copy link
Contributor

Closing as all the tracked PRs are merged.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants