-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS PV Storage Causing API Timeouts. #39526
Comments
@saad-ali can you assign this to myself and a member of the storage team. Also, this is a P0, sorry but you know if I write that it actually is a P0. We have two other reports of this already, and we have multiple clusters with this behavior. |
We have around 50 EBS volumes. Ever since we upgraded from 1.3.8 to 1.4.7, the controller-manager has been making DescribeInstances calls at a rate of 60k/hour. |
Now mysteriously the flood of DescribeInstances calls stopped 14 hours after the upgrade. We are now seeing only about 1.5k calls/hour and no RequestLimitExceededs. |
@guoshimin ours continues to flood. We have multiple clusters in a Region, and it is bad. Once the Rate Limits start timing out, it starts to retry, and it gets bad. Talking with @justinsb, we should only be seeing a lot of API use around volume creation, not a ton when we are at steady-state. |
Taking a look. Just a thought: We have 1.5.2 going out on Tuesday. As a quick and dirty fix we could disable the volumes attached check for AWS until we have a more robust fix out for AWS. I'll sync with @jingxu97 and folks offline and then circle back with short and long term recommendations. |
Hey all. Trying to get some logs with timeouts and of course I cannot:
This is 1.4.7 code base. It appears that we are doing 6 DescribeInstances, in parallel, which matches the cluster size. Then we are reconciling all of the volumes. You can hit the API limit for DescribeInstances fast. |
See https://gist.github.com/chrislovecnm/9c8482b6e82fec13a21bee5cd9a19a60 for a longer snippet |
@saad-ali we need to talk about the consequences of that option, I think @gnufied may have a couple of ideas. Should all of us hop on a zoom call this afternoon real quick? I know that @kris-nova is probably going to want to help looking a long term fix. Also, can we get the patch backported to 1.4.8? |
Also, a restart of the controller may sorta of fix this ... I am wondering if we have a race condition. I am not seeing timeouts on the cluster that I am pulling logs from. It previously had timeouts. |
+1 for a call. I'm very much interested in helping out with this issue, ideally coding some of all of it. |
I did some investigations and part of those I am thinking we should make the SyncInterval user configurable. Also, since we allow parallel executions of that check - it is possible to flood EC2 with API requests which haven't even completed. :( |
A quick workaround is - restart your controller-manager, that will reduce those request. :D I guess that is what might have reduced the API calls for @guoshimin |
Chris let's do a call from 2:30 to 3:30 PM PT |
I did restart the controller-manager a number of times. The first couple
times didn't seem to have any effect on the request rate though. How
exactly does restarting help reduce the request rate?
On Fri, Jan 6, 2017, 12:47 PM Hemant Kumar <notifications@github.com> wrote:
A quick workaround is - restart your controller-manager, that will reduce
those request. I guess that is what might have reduced the API calls for
@guoshimin <https://github.com/guoshimin>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39526 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALPcHVvCezS8eJc-SSgaJx4AlsltOKAYks5rPqhqgaJpZM4Lc4US>
.
|
To summarize, we suspect that the check introduced in #34859 is casuing AWS to quickly hit quota limits. The current plan is to make two changes:
@gnufied is working on making these changes hopefully within the next two hours. Once these changes are merged to 1.5.2 is planned to cut on January 10. |
@saad-ali quick update @chrislovecnm has volunteered to do the patch instead of me. I will review etc. |
Ack, let's get the changes in ASAP and cherrypicked so that 1) we have time to verify the fix, and 2) we have time for the fix to bake over the weekend before the release. 1.5.2 with these changes should go out on Tuesday, Jan 10. Speaking with @justinsb it sounds like a better long term fix is to batch requests to the AWS API--we don't have enough time to get that in for the immediate releases, but hopefully can get this in subsequent releases. |
@saad-ali - batching is not an acceptable solution--given that these calls will still incur the same 'cost' in terms of rate limiting on the AWS side of things, all this will do is concentrate pain. We need a better solution overall and a more intelligent means of accomplishing the goals related to the underlying need to query AWS (or maybe any underlying infrastructure) in 'panic mode' constantly. |
AWS should really be a fallback, shouldn't it? What's stopping asking for a local source of truth from the OS. We should be able to figure out what the attachment point is, if we trust the kernel... So couldn't we check fdisk -l and if that poll hits some threshold, then query AWS? If the volume is gone, relaunch the container, alert someone, or emit some kind of event. The point is to reduce the number of calls, instead of inflicting them in a batch... that would seem to avoid, or prevent the AWS dependency. We can then do a long-term poll that does a far less frequent call to get a state of things for reconciliation. This would seem to have a two fold benefit: |
@saad-ali is this code aws specific or shared using cloud providers? I believe it is part of the shared code base. |
Also seeing this and we only have a handful of volumes. Makes volumes unusable atm. |
I really don't trust any optimisations in the AWS code anymore, as long as calls to AWS remain synchronously triggered from non-AWS code. Regressions keep happening like this one, making AWS usage of k8s unstable. I think the AWS code should move to updating its state of the VPC periodically and asynchronously, then respond to calls with cached data. Then we can strictly control how often AWS calls are made and defend against these sort of problems. Batching requests in AWS only counts as a single request, @justinsb is correct. |
Can someone label this issue with sig/aws? |
We just held a meeting to discuss the short-term fixes for this issue. The options discussed were summarized here. To summarize the meeting:
I'd like to hear ASAP from those of you who weren't able to make it (@justinsb), if there are any disagreements. CC @smarterclayton @chrislovecnm @liggitt @jingxu97 @gnufied @kris-nova @matchstick |
Are we just adding a single flag?? |
Let's keep both: enable/disable, and duration. Add a warning to both indicating that if it is disabled altogether or if the duration is set too high you some volume types, like AWS, may end up mounting the wrong volume in some cases. |
Added warnings on both. |
Plan sounds great to me! |
Automatic merge from submit-queue (batch tested with PRs 39628, 39551, 38746, 38352, 39607) Increasing times on reconciling volumes fixing impact to AWS. #**What this PR does / why we need it**: We are currently blocked by API timeouts with PV volumes. See #39526. This is a workaround, not a fix. **Special notes for your reviewer**: A second PR will be dropped with CLI cobra options in it, but we are starting with increasing the reconciliation periods. I am dropping this without major testing and will test on our AWS account. Will be marked WIP until I run smoke tests. **Release note**: ```release-note Provide kubernetes-controller-manager flags to control volume attach/detach reconciler sync. The duration of the syncs can be controlled, and the syncs can be shut off as well. ```
Can we get the P0 tag removed? |
I am moving the priority to P1 as it looks appropriate right now. |
Automatic merge from submit-queue (batch tested with PRs 39625, 39842) AWS: Remove duplicate calls to DescribeInstance during volume operations This change removes all duplicate calls to describeInstance from aws volume code path. **What this PR does / why we need it**: This PR removes the duplicate calls present in disk check code paths in AWS. I can confirm that `getAWSInstance` actually returns all instance information already and hence there is no need of making separate `describeInstance` call. Related to - #39526 cc @justinsb @jsafrane
Any way we can get this merged to v1.5.3? |
We need to hold this open for re-architecture. This issue will cause us to have a theoretical limit on node sizes and number of PV. @jingxu97 Can we determine a design pattern for this? |
Just a quick note to say this is not specific to AWS, Running in vSphere, after around ~40 PVs the sheer number of reconciliation API requests killed the vSphere API, and maxed out the CPU on both the vSphere API host and the kube master node running the controller manager. |
This might cased by the sync state every 5 second through cloud provider in
volume controller. PR #41363 change the default value to 1 minute. It was
also cherrypicked to 1.5.3. Please give it a try after upgrading the
version and kindly let us know whether there still any issue. Thanks!
…On Tue, Feb 14, 2017 at 5:24 AM, Jake ***@***.***> wrote:
Just a quick note to say this is not specific to AWS, Running in vSphere,
after around ~40 PVs the sheer number of reconciliation API requests killed
the vSphere API, and maxed out the CPU on both the vSphere API host and the
kube master node running the controller manager.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39526 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASSNxQFj6AP4cXjWgqz9BLo9NLaXiDjZks5rcar1gaJpZM4Lc4US>
.
--
- Jing
|
Closing |
@jakexks btw, if polling for individual nodes is killing vsphere server. Vsphere volume plugin should implement BulkPolling api introduced in - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L177 cc @divyenpatel |
Automatic merge from submit-queue (batch tested with PRs 39625, 39842) AWS: Remove duplicate calls to DescribeInstance during volume operations This change removes all duplicate calls to describeInstance from aws volume code path. **What this PR does / why we need it**: This PR removes the duplicate calls present in disk check code paths in AWS. I can confirm that `getAWSInstance` actually returns all instance information already and hence there is no need of making separate `describeInstance` call. Related to - kubernetes/kubernetes#39526 cc @justinsb @jsafrane
Kubernetes version
1.3.8 <
It has been reported that this bug is in the 1.4.0 branch and above. I have validated on 1.4.7 and 1.4.6. We have not validated the 1.5.x branch, and an exponential back off patch just was just merged. #38766, but we do not beleive that is addressing the root problem.
Environment
uname -a
): Custom maintained kernel by aws team Linux 4.4.26-k8s Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Fri Oct 21 05:21:13 UTC 2016 x86_64 GNU/LinuxWhat happened:
As you add more PV storage to a cluster, the cluster starts to spam the API heavily. It has been reported that only 20 PV attached will start causing EC2 API timeouts. This is a showstopper issue which is making PV attached storage unusable in AWS. As the AWS account nears its limit on API calls this problem cascades, to the point that retries flood the API. One of our accounts is at around 24k calls per hour.
The controller starts to make API calls at such high rate that it starts to retry, and then you just spam the API. The controller is making far too many API calls to validate that a node exists and that a volume is attached to a node. The specific call that is timing out the most is
DescribeInstances
.The cluster is at steady state. We do not have volume churn, i.e. we are not adding and removing volumes.
What you expected to happen:
Able to have 500 PV attached to a cluster and not kill EC2 API.
How to reproduce it (as minimally and precisely as possible):
-v 11
, and restart the controller.You will get:
Showing up in the logs of the controller. I have only tested in HA, but will validate in single master setup shortly.
Anything else do we need to know:
This is bad enough issue that we are crashing controllers.
AWS is rate limited, by API call, and by failed API calls. Some API calls are limited to a region, and some are account wide.
The code that is getting called is from here:
https://github.com/kubernetes/kubernetes/blob/d97f125ddf222634948f9d2448e5316180b736dd/pkg/controller/volume/attachdetach/reconciler/reconciler.go
TLDR;
Anyone that is using 1.4.0+ in AWS will exceed their rate limits with as little as 20 PV attached to a cluster. We have one account that is at about 24k API calls per hour because of timeouts. This makes PV unuable on AWS.
The text was updated successfully, but these errors were encountered: