Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request throttling for AWS #31858

Closed
anioool opened this issue Sep 1, 2016 · 28 comments
Closed

Request throttling for AWS #31858

anioool opened this issue Sep 1, 2016 · 28 comments
Assignees
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@anioool
Copy link

anioool commented Sep 1, 2016

Hi,

We are facing huge problem with throttling requests to AWS espiecially for describe methods. Our cluster has more than five hundred volumes attached. Also depending on the load we are attaching/detaching volumes as well. As an outcome from controller manager logs I see a lot of entries for delay like this 👍

W0901 08:31:51.487165 1 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeVolumes)
W0901 08:31:51.532471 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:51.580238 1 request.go:347] Field selector: v1 - serviceaccounts - metadata.name - default: need to check if this is versioned correctly.
W0901 08:31:51.682693 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.000878 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.156797 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:52.197507 1 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeVolumes)
I0901 08:31:52.214532 1 reconciler.go:115] Started DetachVolume for volume "kubernetes.io/aws-ebs/vol-faa7bd79" from node "ip-172-28-0-137.eu-west-1.compute.internal"
I0901 08:31:52.216702 1 operation_executor.go:619] Verified volume is safe to detach for volume "kubernetes.io/aws-ebs/vol-faa7bd79" (spec.Name: "postgresdata") from node "ip-172-28-0-137.eu-west-1.compute.internal".
W0901 08:31:52.216837 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.228492 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.297681 1 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeVolumes)
W0901 08:31:52.333899 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.406311 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.433745 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.451718 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.470854 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.544053 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.564880 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.641716 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.670373 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.759217 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.788739 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.890559 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.894025 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.918555 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 7s
W0901 08:31:52.968347 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:52.978393 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:52.999720 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.001144 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.043901 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.051308 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.070164 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.094603 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.124363 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.307330 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:53.510714 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:54.168228 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:54.339202 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:54.504001 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:55.854036 1 reflector.go:334] pkg/controller/service/servicecontroller.go:150: watch of *api.Service ended with: couldn't queue object: couldn't create key for object &{lastState:0xc8297d0340 appliedState:0xc8297d0340 mu:{state:0 sema:0} lastRetryDelay:0}: object has no meta: struct service.cachedService lacks embedded TypeMeta type
W0901 08:31:56.032130 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:56.081788 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 6s
W0901 08:31:56.426938 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s
W0901 08:31:56.498465 1 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeVolumes) to avoid RequestLimitExceeded: 6s

As you can see CM postpones requests, however request from the same minute are postpone with exactly the same time - amount of seconds.

As an outcome from AWS account perspective we are not able to execute other things not related to kubernetes cluster as we are constantly receive Request Limit exceeded.

Is there possiblity to allow CM use batches for Describe methods and not make call per volume/instance and so on separately?

Our version:
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:13:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6+coreos.0", GitCommit:"f6f0055b8e503cbe5fb7b6f1a2ee37d0f160c1cd", GitTreeState:"clean", BuildDate:"2016-08-29T17:01:01Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Thanks in advance.
Adam

@anioool
Copy link
Author

anioool commented Sep 5, 2016

Hi,

May someone perform some investigation? This issue block us totally with using kubernetes.

@anioool
Copy link
Author

anioool commented Sep 9, 2016

Hi,
 
Thank you for reply. Apparently I do not see you comment in the issue I raised. Could you give me a link where I can propagate the information about issue?? It is a critical blocker for us.
 
Thanks,
Adam
 
W dniu 2016-09-08 19:07:56 użytkownik James Ravn notifications@github.com napisał:
@anioool Yes this is a limitation of the aws plugin in kubernetes. It backs off if it detects rate limit, but it really needs a global rate limiter to prevent spamming AWS in the first place. I suggest petitioning the sig-aws channel in the Kubernetes slack to get this prioritised.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
 
 

@justinsb
Copy link
Member

justinsb commented Sep 9, 2016

@anioool we do have a global back-off, but this only prevents k8s from totally DoS-ing your AWS account, it doesn't really keep k8s working smoothly when it is hitting a limit.

We need to figure out where these calls are coming from, and then figure out a way to make them less expensive. Do you have any ideas what is causing it? Are you rapidly attaching and detaching volumes?

I suspect 500 volumes is just more than we've tried before - I'll poke around and see whether I can see what is happening.

@anioool
Copy link
Author

anioool commented Sep 9, 2016

Hi @justinsb,

I want to thank you for your reply. Let me describe how our process works. So process is splitted into two phases called base and replication. In first stage - base - we are creating in automatic way approx. 500 volumes from snapshots prepared previously. Deployments for theses volumes are created in one batch. Second phase is replication, so creating new volumes from snapshots and creationg new deployments. During second stage detaching may appear. There is a chance that we will reach scale up to 5 k volumes.

I check your code and that is true that you have global back-off, which is not so much usable as controller manager according to my observation is able to make up to 10 AWS requests - describeVolumes - per minute. For all of them back off set exactly the same amount off time for delay if throttling appears. That is one of the reason why we are facing Dos-ing for EC2 service on AWS account. Second problem is that controllermanager is making calls separately for each volume, while we could use batches.

For our account during first phase we had 95 % of describeVolumes calls and only 5% of describeInstances.

Let me know if you need some additional information.

@anioool
Copy link
Author

anioool commented Sep 19, 2016

Hi @justinsb,

Any news in this topic?

@dims
Copy link
Member

dims commented Nov 17, 2016

This needs to be triaged as a release-blocker or not for 1.5

@dims
Copy link
Member

dims commented Nov 18, 2016

@justinsb all issues must be labeled either release blocker or non release blocking by end of day 18 November 2016 PST. (or please move it to 1.6) cc @kubernetes/sig-aws

@calebamiles
Copy link
Contributor

@justinsb, how would you feel about moving this out of the 1.5 milestone. I doubt anything will get done on this issue in the next week or so.

**cc: @kubernetes/sig-aws, @saad-ali, @dims

@justinsb justinsb modified the milestones: v1.6, v1.5 Nov 30, 2016
@sebbonnet
Copy link

We're seeing a similar issue but with DescribeInstances

Taking a quick look at the code the main place we call describe instances is for checking volumes are attached. We recently added a few volumes to the cluster but only a handful < 10 per cluster.

All our controller managers are now spamming the AWS API and now are all limited.

E.g.

W0106 15:47:52.686624       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:47:52.688305       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:47:52.716789       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:47:52.734486       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:47:54.895165       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 30s
W0106 15:47:54.895251       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 30s
W0106 15:47:56.166042       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
E0106 15:47:56.166071       8 attacher.go:100] Error checking if volumes ([vol-06819e79111056741]) is already attached to current node ("ip-10-50-195-252.eu-west-1.compute.internal"). err=error listing AWS instances: RequestLimitExceeded: Request limit exceeded.
	status code: 503, request id: 
E0106 15:47:56.166086       8 operation_executor.go:557] VolumesAreAttached failed for checking on node "ip-10-50-195-252.eu-west-1.compute.internal" with: error listing AWS instances: RequestLimitExceeded: Request limit exceeded.
	status code: 503, request id: 
W0106 15:47:56.317152       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 30s
W0106 15:47:58.571930       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:47:58.581973       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:47:58.762198       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:47:58.764225       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:47:59.144994       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
E0106 15:47:59.145023       8 attacher.go:100] Error checking if volumes ([vol-11af4ca1]) is already attached to current node ("ip-10-50-192-198.eu-west-1.compute.internal"). err=error listing AWS instances: RequestLimitExceeded: Request limit exceeded.
	status code: 503, request id: 
E0106 15:47:59.145037       8 operation_executor.go:557] VolumesAreAttached failed for checking on node "ip-10-50-192-198.eu-west-1.compute.internal" with: error listing AWS instances: RequestLimitExceeded: Request limit exceeded.
	status code: 503, request id: 
W0106 15:47:59.545155       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:47:59.655337       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 32s
W0106 15:47:59.708423       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:47:59.752609       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 32s
W0106 15:47:59.969483       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 32s
W0106 15:47:59.969629       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:48:00.708048       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:48:00.708083       8 retry_handler.go:87] Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
W0106 15:48:00.766404       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 32s
W0106 15:48:00.800357       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 32s
W0106 15:48:04.979564       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 32s
W0106 15:48:04.979646       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:48:05.747533       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:48:05.834688       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:48:05.852701       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 31s
W0106 15:48:07.468487       8 retry_handler.go:55] Inserting delay before AWS request (ec2::DescribeInstances) to avoid RequestLimitExceeded: 30s

Any idea what could be causing the requests @justinsb ? We're on 1.4.7

@sstarcher
Copy link

We have seen similar issues between 10-20 EBS volumes between a few clusters on both 1.4.6 and 1.4.7

@chrislovecnm
Copy link
Contributor

Unless someone beats me too it I am going to open an issue on the DescribeInstances this morning.

@chrislovecnm
Copy link
Contributor

@sebbonnet what version are you on?

@chbatey
Copy link
Contributor

chbatey commented Jan 6, 2017

@chrislovecnm (i work with @sebbonnet ) 1.4.7

@chrislovecnm
Copy link
Contributor

cc @kubernetes/sig-aws-misc @kubernetes/sig-storage-misc

@brandoncole
Copy link

This is affecting us as well.

@chrislovecnm
Copy link
Contributor

I think we have a 1.3.x spam problem at scale, which is this issue. We also have another issue not at scale #39526 in 1.4.x+

@vaijab
Copy link

vaijab commented Jan 17, 2017

We are affected by this issue as well. We have around 30 EBS volumes used by petsets. When a controller-manager hits AWS api limits, things start to go crazy. Our pods with volumes start moving around constantly.

@vaijab
Copy link

vaijab commented Jan 17, 2017

Correct me if I am wrong.

After looking at the code, looks like, given 10 EBS volumes and one node, kubernetes will do 11 AWS api calls. One call to describe each individual volume separately and another call to describe an instance.

This does not sound very efficient. What if instead it did a one call to describe a list of volumes? Because right now, the issue gets worse the more volumes you have.

/cc @justinsb

@justinsb
Copy link
Member

It will do DescribeVolume polling during attach and detach operations only, I believe. But that should be self-correcting, because we will throttle if there are too many operations in progress. Can you link to the code or provide a log if you are seeing something different?

The problem is the "one node" call I believe, because we call that for nodes regardless of "current" volume operations, currently every 10 seconds. That is fixed in #39564

@vaijab
Copy link

vaijab commented Jan 17, 2017

@justinsb that's good to know. We're running v1.4.7 and constantly being hit by API rate limits.

I guess the question is how does volume reconciler behave when detachRequestedTime timer does not get reset? Because when it's unable to get volumes state status from AWS because of rate limits, the detach timer never gets reset. What happens then?

@chrislovecnm
Copy link
Contributor

@vaijab there is a new flag on the controller manager in 1.4.8 that will help.

See #39551 - set the new duration flag on the controller and your API requests will decrease

@ethernetdan
Copy link
Contributor

@justinsb can this be closed?

@ethernetdan
Copy link
Contributor

This is too late for v1.6, moving to v1.7 milestone. If this is incorrect please correct.

/cc @kubernetes/release-team

@ethernetdan ethernetdan modified the milestones: v1.7, v1.6 Mar 14, 2017
@spiffxp
Copy link
Member

spiffxp commented May 31, 2017

@kubernetes/sig-aws-misc do you want this in v1.7? is this the right sig to own this issue?

@gnufied
Copy link
Member

gnufied commented May 31, 2017

After implementing bulk volume polling, this bug should be a thing of the past. IMO - we can close this bug. #41306

@chrislovecnm
Copy link
Contributor

chrislovecnm commented May 31, 2017

@gnufied do we have an issue open about route53 DNS? Also, I do not know if we understand our limits yet. Maybe separate issues.

Storage is much better

@marun
Copy link
Contributor

marun commented Jun 14, 2017

@justinsb Can this issue be closed?

@justinsb
Copy link
Member

I think we close this one and open focused one for issues in 1.7, e.g. Route53.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests