-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Could not schedule pods, all available instance types exceed limits" during EC2 API outage #5937
Comments
same here 😞 it would also be great to have some contingency for this? maybe karpenter could cache the instance types and work around |
Yeah - in the case of this outage which seems to be affecting only |
Any insight on this issues would be helpful!! . thanks |
This was related to an AWS outage in us-east-1 for DescribeInstanceTypes and DescribeInstanceTypeOffering. I haven't look into what and why Karpenter need to make this call. Is this information necessary to be updated continuously. |
kubernetes-sigs/karpenter#1165 should fix this erroneous log message |
It seems like this issue is larger than just fixing the error message. It might be worth fixing this error message and then opening a larger issue for improvements in caching the instance types if one isn't open already. |
This doesn't feel like an wrong error message. It seems like cache isn't working or we are listing periodically for instance type |
I think its because of this https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L221 the scheduler performs this GetInstanceTypes each time we create a new scheduler which we do every time we evaluate scheduling This cache expires after 5 minutes: https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cache/cache.go#L31 which imo is way too short. I don't think instance types change that often. |
Our instance type cache does more than keep track of the names of the instance types, it keeps track of the allocatable of each of the offerings for a given setting of inputs, and caches the outputs for each given set. This is because of how bits and piece of other API modifies how Karpenter thinks of instance types. Realistically, we can probably increase this timeout, but how long did this issue impact you? Would something like 15 minutes be enough? 30 minutes? 2 hours? |
The problem here isn't about how long a cache should be rather some parts of the cache doesn't need be refreshed to prevent issues like zonal outages from AWS. In this case after 5 minutes Karpenter will start failing to perform any actions(?) even though the data it relies on to calculate is already there and won't change (instance type and instance type offering). Its true that the whole instanceTypeProvider cache does more than just cache instance types and it actually contains multiple "sub"-caches however not every part of the cache needs refreshing. Breaking out the caches where somethings are static is more of an availability improvement to Karpenter and not necessarily a performance thing. |
Karpenter now maintains a controller to hydrate the instance type data asynchronously. The instance type controller will be responsible for querying DescribeInstanceTypes ,and DescribeInstanceTypesOfferings and caching the data. The controller will attempt to refresh the data every 12 hours. The instance type data will only update the instance type data upon a successful response. This will allow Karpenter to be more genrally resilient. #6045 |
Nice I think that address this issue |
5 cents to others, who may face that issue, but with another root cause: for me it was a bit surprise, as in most other places that would mean unlimited. |
@yaroslav-nakonechnikov The current error message is misleading for when Karpenter can't resolve any instance types from the NodePools. Just an FYI, here are list of PRs that hope to address this issue |
Closing as all the tracked PRs are merged. |
Description
Observed Behavior:
During a recent EC2 API outage (on 3/26/24), we observed the following errors:
Our NodePools do have available resources. I believe this may be the case of a bad/incorrect error due to the EC2 API being unavailable.
We also see expected errors such as:
Expected Behavior:
A better error message should be displayed.
Versions:
kubectl version
): 1.29/EKSThe text was updated successfully, but these errors were encountered: