Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS PV Storage Causing API Timeouts. #39526

Closed
chrislovecnm opened this issue Jan 6, 2017 · 44 comments
Closed

AWS PV Storage Causing API Timeouts. #39526

chrislovecnm opened this issue Jan 6, 2017 · 44 comments
Assignees
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Jan 6, 2017

Kubernetes version
1.3.8 <

It has been reported that this bug is in the 1.4.0 branch and above. I have validated on 1.4.7 and 1.4.6. We have not validated the 1.5.x branch, and an exponential back off patch just was just merged. #38766, but we do not beleive that is addressing the root problem.

Environment

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian
  • Kernel (e.g. uname -a): Custom maintained kernel by aws team Linux 4.4.26-k8s Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Fri Oct 21 05:21:13 UTC 2016 x86_64 GNU/Linux
  • Install tools: kops
  • Others:

What happened:

As you add more PV storage to a cluster, the cluster starts to spam the API heavily. It has been reported that only 20 PV attached will start causing EC2 API timeouts. This is a showstopper issue which is making PV attached storage unusable in AWS. As the AWS account nears its limit on API calls this problem cascades, to the point that retries flood the API. One of our accounts is at around 24k calls per hour.

The controller starts to make API calls at such high rate that it starts to retry, and then you just spam the API. The controller is making far too many API calls to validate that a node exists and that a volume is attached to a node. The specific call that is timing out the most is DescribeInstances.

The cluster is at steady state. We do not have volume churn, i.e. we are not adding and removing volumes.

What you expected to happen:

Able to have 500 PV attached to a cluster and not kill EC2 API.

How to reproduce it (as minimally and precisely as possible):

You will get:

I0106 03:00:38.501732       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 03:00:38.502157       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 03:00:38.566153       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 03:00:38.593830       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2c/vol-0860881328c9f524c"/Node "ip-172-20-15-41.us-west-2.compute.internal" is attached--touching.
I0106 03:00:38.593869       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2a/vol-0a5c43d0a3ba5501e"/Node "ip-172-20-5-129.us-west-2.compute.internal" is attached--touching.
I0106 03:00:38.593882       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-08f2d5e73460c5f1f"/Node "ip-172-20-9-218.us-west-2.compute.internal" is attached--touching.
I0106 03:00:38.593895       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-0d8875331a33b8223"/Node "ip-172-20-9-218.us-west-2.compute.internal" is attached--touching.

Showing up in the logs of the controller. I have only tested in HA, but will validate in single master setup shortly.

Anything else do we need to know:

This is bad enough issue that we are crashing controllers.

AWS is rate limited, by API call, and by failed API calls. Some API calls are limited to a region, and some are account wide.

The code that is getting called is from here:

https://github.com/kubernetes/kubernetes/blob/d97f125ddf222634948f9d2448e5316180b736dd/pkg/controller/volume/attachdetach/reconciler/reconciler.go

TLDR;

Anyone that is using 1.4.0+ in AWS will exceed their rate limits with as little as 20 PV attached to a cluster. We have one account that is at about 24k API calls per hour because of timeouts. This makes PV unuable on AWS.

@chrislovecnm chrislovecnm changed the title PV AWS PV Storage Causing API Timeouts. Jan 6, 2017
@chrislovecnm
Copy link
Contributor Author

cc: @kubernetes/sig-aws-misc @kubernetes/sig-storage-misc @saad-ali @jsafrane @jingxu97

@chrislovecnm
Copy link
Contributor Author

@saad-ali can you assign this to myself and a member of the storage team. Also, this is a P0, sorry but you know if I write that it actually is a P0. We have two other reports of this already, and we have multiple clusters with this behavior.

@guoshimin
Copy link
Contributor

We have around 50 EBS volumes. Ever since we upgraded from 1.3.8 to 1.4.7, the controller-manager has been making DescribeInstances calls at a rate of 60k/hour.

@guoshimin
Copy link
Contributor

Now mysteriously the flood of DescribeInstances calls stopped 14 hours after the upgrade. We are now seeing only about 1.5k calls/hour and no RequestLimitExceededs.

@chrislovecnm
Copy link
Contributor Author

@guoshimin ours continues to flood. We have multiple clusters in a Region, and it is bad. Once the Rate Limits start timing out, it starts to retry, and it gets bad.

Talking with @justinsb, we should only be seeing a lot of API use around volume creation, not a ton when we are at steady-state.

@matchstick
Copy link
Contributor

@saad-ali @jingxu97 Can you PTAL?

@saad-ali saad-ali added priority/P0 sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jan 6, 2017
@saad-ali
Copy link
Member

saad-ali commented Jan 6, 2017

Taking a look.

Just a thought: We have 1.5.2 going out on Tuesday. As a quick and dirty fix we could disable the volumes attached check for AWS until we have a more robust fix out for AWS.

I'll sync with @jingxu97 and folks offline and then circle back with short and long term recommendations.

@chrislovecnm
Copy link
Contributor Author

Hey all. Trying to get some logs with timeouts and of course I cannot:

I0106 20:08:58.760844       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 20:08:58.760845       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 20:08:58.761344       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 20:08:58.822266       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 20:08:58.830069       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 20:08:58.832931       7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 20:08:58.843964       7 attach_detach_controller.go:526] processVolumesInUse for node "ip-172-20-14-116.us-west-2.compute.internal"
I0106 20:08:58.844010       7 actual_state_of_world.go:339] SetVolumeMountedByNode volume kubernetes.io/aws-ebs/aws://us-west-2c/vol-04627143777eaf0f3 to the node "ip-172-20-14-116.us-west-2.compute.internal" mounted true
I0106 20:08:58.844028       7 actual_state_of_world.go:339] SetVolumeMountedByNode volume kubernetes.io/aws-ebs/aws://us-west-2c/vol-0ac2bcb95ff0777e8 to the node "ip-172-20-14-116.us-west-2.compute.internal" mounted true
I0106 20:08:58.844038       7 actual_state_of_world.go:339] SetVolumeMountedByNode volume kubernetes.io/aws-ebs/aws://us-west-2c/vol-03f9fbdcb672be6bd to the node "ip-172-20-14-116.us-west-2.compute.internal" mounted true
I0106 20:08:58.844701       7 garbagecollector.go:368] Propagator process object: v1/Node, namespace , name ip-172-20-14-116.us-west-2.compute.internal, event type %!s(garbagecollector.eventType=1)
I0106 20:08:58.844724       7 garbagecollector.go:402] The updateEvent &garbagecollector.event{eventType:1, obj:(*metaonly.MetadataOnlyObject)(0xc821c14800), oldObj:(*metaonly.MetadataOnlyObject)(0xc8216c4700)} doesn't change node references, ignore
I0106 20:08:58.860879       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-0bf6b3d17cd2c799d"/Node "ip-172-20-9-76.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.860907       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-0ea77e8a9c073a23b"/Node "ip-172-20-9-76.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.860916       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-042d1ff2bc99965b6"/Node "ip-172-20-9-76.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.860925       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-0a4ca57040fb08f63"/Node "ip-172-20-9-76.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.860943       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2a/vol-0f2673791201ba7fb"/Node "ip-172-20-4-37.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.860964       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2a/vol-06ca0376ff52bbd3b"/Node "ip-172-20-4-37.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.860985       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2c/vol-04627143777eaf0f3"/Node "ip-172-20-14-116.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.861005       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2c/vol-0ac2bcb95ff0777e8"/Node "ip-172-20-14-116.us-west-2.compute.internal" is attached--touching.
I0106 20:08:58.861016       7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2c/vol-03f9fbdcb672be6bd"/Node "ip-172-20-14-116.us-west-2.compute.internal" is attached--touching.

This is 1.4.7 code base. It appears that we are doing 6 DescribeInstances, in parallel, which matches the cluster size. Then we are reconciling all of the volumes. You can hit the API limit for DescribeInstances fast.

@chrislovecnm
Copy link
Contributor Author

@chrislovecnm
Copy link
Contributor Author

@saad-ali we need to talk about the consequences of that option, I think @gnufied may have a couple of ideas. Should all of us hop on a zoom call this afternoon real quick? I know that @kris-nova is probably going to want to help looking a long term fix.

Also, can we get the patch backported to 1.4.8?

@chrislovecnm
Copy link
Contributor Author

chrislovecnm commented Jan 6, 2017

Also, a restart of the controller may sorta of fix this ... I am wondering if we have a race condition. I am not seeing timeouts on the cluster that I am pulling logs from. It previously had timeouts.

@krisnova
Copy link
Contributor

krisnova commented Jan 6, 2017

+1 for a call. I'm very much interested in helping out with this issue, ideally coding some of all of it.

@gnufied
Copy link
Member

gnufied commented Jan 6, 2017

I did some investigations and part of those DescribeInstances flood is defenitely coming from - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L111 which runs every 5s.

I am thinking we should make the SyncInterval user configurable. Also, since we allow parallel executions of that check - it is possible to flood EC2 with API requests which haven't even completed. :(

@gnufied
Copy link
Member

gnufied commented Jan 6, 2017

A quick workaround is - restart your controller-manager, that will reduce those request. :D

I guess that is what might have reduced the API calls for @guoshimin

@saad-ali
Copy link
Member

saad-ali commented Jan 6, 2017

Chris let's do a call from 2:30 to 3:30 PM PT

@guoshimin
Copy link
Contributor

guoshimin commented Jan 6, 2017 via email

@saad-ali
Copy link
Member

saad-ali commented Jan 6, 2017

To summarize, we suspect that the check introduced in #34859 is casuing AWS to quickly hit quota limits.

The current plan is to make two changes:

  1. Reduce the rate of the GetAttachedVolumesPerNode check from every 5 seconds, to 1 per minute.
  2. Add a flag to the kubernetes-controller-manager to gate the GetAttachedVolumesPerNode check so that it can be disabled altogether.

@gnufied is working on making these changes hopefully within the next two hours. Once these changes are merged to master we will cherry pick them to both the 1.5 and 1.4 branches.

1.5.2 is planned to cut on January 10.

@gnufied
Copy link
Member

gnufied commented Jan 6, 2017

@saad-ali quick update @chrislovecnm has volunteered to do the patch instead of me. I will review etc.

@saad-ali
Copy link
Member

saad-ali commented Jan 6, 2017

@saad-ali quick update @chrislovecnm has volunteered to do the patch instead of me. I will review etc.

Ack, let's get the changes in ASAP and cherrypicked so that 1) we have time to verify the fix, and 2) we have time for the fix to bake over the weekend before the release.

1.5.2 with these changes should go out on Tuesday, Jan 10.
We pulled up the 1.4.8 release date to Tuesday, Jan 10 as well.

Speaking with @justinsb it sounds like a better long term fix is to batch requests to the AWS API--we don't have enough time to get that in for the immediate releases, but hopefully can get this in subsequent releases.

@patrickmcclory
Copy link

@saad-ali - batching is not an acceptable solution--given that these calls will still incur the same 'cost' in terms of rate limiting on the AWS side of things, all this will do is concentrate pain. We need a better solution overall and a more intelligent means of accomplishing the goals related to the underlying need to query AWS (or maybe any underlying infrastructure) in 'panic mode' constantly.

@scottvidmar
Copy link

AWS should really be a fallback, shouldn't it? What's stopping asking for a local source of truth from the OS. We should be able to figure out what the attachment point is, if we trust the kernel... So couldn't we check fdisk -l and if that poll hits some threshold, then query AWS? If the volume is gone, relaunch the container, alert someone, or emit some kind of event. The point is to reduce the number of calls, instead of inflicting them in a batch... that would seem to avoid, or prevent the AWS dependency. We can then do a long-term poll that does a far less frequent call to get a state of things for reconciliation.

This would seem to have a two fold benefit:
(1) If the describe API has any issues, a sole dependency on the AWS API kills the functionality.
(2) We reduce the call count to AWS.

@chrislovecnm
Copy link
Contributor Author

@saad-ali is this code aws specific or shared using cloud providers? I believe it is part of the shared code base.

@chbatey
Copy link
Contributor

chbatey commented Jan 9, 2017

Also seeing this and we only have a handful of volumes. Makes volumes unusable atm.

@jsravn
Copy link
Contributor

jsravn commented Jan 9, 2017

I really don't trust any optimisations in the AWS code anymore, as long as calls to AWS remain synchronously triggered from non-AWS code. Regressions keep happening like this one, making AWS usage of k8s unstable. I think the AWS code should move to updating its state of the VPC periodically and asynchronously, then respond to calls with cached data. Then we can strictly control how often AWS calls are made and defend against these sort of problems.

Batching requests in AWS only counts as a single request, @justinsb is correct.

@jsravn
Copy link
Contributor

jsravn commented Jan 9, 2017

Can someone label this issue with sig/aws?

@saad-ali
Copy link
Member

saad-ali commented Jan 9, 2017

We just held a meeting to discuss the short-term fixes for this issue.

The options discussed were summarized here.

To summarize the meeting:

  1. Short term fix.
    • Leave the rate of the GetAttachedVolumesPerNode check unchanged, at every 5 seconds (so we minimize the untested changes and inadvertent side effects to other plugins).
    • Add a flag to the kubernetes-controller-manager to gate the GetAttachedVolumesPerNode check so that it can be disabled altogether and another flag that will allow users to set their own rate.
    • Push new k8s releases (1.4.8 and 1.5.2) with these changes on Wednesday, January 11.
  2. Communication
    • Update the release notes for the 1.4.x release that PR Add sync state loop in master's volume reconciler #34859 went out in to indicate that this may cause quota issues at the cost of fixing this issue.
    • Communicate to widely (kubernetes-dev@, kubernetes storage SIG mailing list, etc.) that this issue was introduced in 1.4.x release, what issue it fixes, and that to work around the issue, 1.5.2 or 1.4.8 enables setting a more conservative rate for this check
  3. Long term fix.
    • Optimizing the calls that AWS makes for the GetAttachedVolumesPerNode check:
      • Make 1 call instead of 2 per node
      • Batch calls so that number of calls doesn't increase linearly with size of cluster.
    • More architectural work to prevent AWS from mounting wrong volume (then the rate of this check can be safely further reduced by default).

I'd like to hear ASAP from those of you who weren't able to make it (@justinsb), if there are any disagreements.

CC @smarterclayton @chrislovecnm @liggitt @jingxu97 @gnufied @kris-nova @matchstick

@chrislovecnm
Copy link
Contributor Author

Are we just adding a single flag??

@saad-ali
Copy link
Member

saad-ali commented Jan 9, 2017

Are we just adding a single flag??

Let's keep both: enable/disable, and duration. Add a warning to both indicating that if it is disabled altogether or if the duration is set too high you some volume types, like AWS, may end up mounting the wrong volume in some cases.

@chrislovecnm
Copy link
Contributor Author

Added warnings on both.

@justinsb
Copy link
Member

justinsb commented Jan 9, 2017

Plan sounds great to me!

@justinsb
Copy link
Member

@jsravn I've started tinkering with a "unified" caching layer in AWS, and have opened an issue to write up some notes here: #39647 I quoted your comment as the motivator for the issue, as I think you are spot-on.

k8s-github-robot pushed a commit that referenced this issue Jan 10, 2017
Automatic merge from submit-queue (batch tested with PRs 39628, 39551, 38746, 38352, 39607)

Increasing times on reconciling volumes fixing impact to AWS.

#**What this PR does / why we need it**:

We are currently blocked by API timeouts with PV volumes.  See #39526.  This is a workaround, not a fix.

**Special notes for your reviewer**:

A second PR will be dropped with CLI cobra options in it, but we are starting with increasing the reconciliation periods.  I am dropping this without major testing and will test on our AWS account. Will be marked WIP until I run smoke tests.

**Release note**:

```release-note
Provide kubernetes-controller-manager flags to control volume attach/detach reconciler sync.  The duration of the syncs can be controlled, and the syncs can be shut off as well. 
```
@chrislovecnm
Copy link
Contributor Author

Can we get the P0 tag removed?

@matchstick
Copy link
Contributor

I am moving the priority to P1 as it looks appropriate right now.

k8s-github-robot pushed a commit that referenced this issue Jan 18, 2017
Automatic merge from submit-queue (batch tested with PRs 39625, 39842)

AWS: Remove duplicate calls to DescribeInstance during volume operations

This change removes all duplicate calls to describeInstance
from aws volume code path.

**What this PR does / why we need it**:

This PR removes the duplicate calls present in disk check code paths in AWS. I can confirm that `getAWSInstance` actually returns all instance information already and hence there is no need of making separate `describeInstance` call.

Related to - #39526

cc @justinsb @jsafrane
@rimusz
Copy link
Contributor

rimusz commented Jan 24, 2017

Any way we can get this merged to v1.5.3?
We have many PVCs and our AWS API gets jammed by this bug.

@chrislovecnm
Copy link
Contributor Author

We need to hold this open for re-architecture. This issue will cause us to have a theoretical limit on node sizes and number of PV.

@jingxu97 Can we determine a design pattern for this?

@jakexks
Copy link
Contributor

jakexks commented Feb 14, 2017

Just a quick note to say this is not specific to AWS, Running in vSphere, after around ~40 PVs the sheer number of reconciliation API requests killed the vSphere API, and maxed out the CPU on both the vSphere API host and the kube master node running the controller manager.

@jingxu97
Copy link
Contributor

jingxu97 commented Feb 14, 2017 via email

@chrislovecnm
Copy link
Contributor Author

@jingxu97 / @saad-ali should we close this?

@chrislovecnm
Copy link
Contributor Author

Closing

@gnufied
Copy link
Member

gnufied commented Jun 13, 2017

@jakexks btw, if polling for individual nodes is killing vsphere server. Vsphere volume plugin should implement BulkPolling api introduced in - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L177

cc @divyenpatel

nckturner pushed a commit to nckturner/aws-cloud-controller-manager that referenced this issue Feb 28, 2018
Automatic merge from submit-queue (batch tested with PRs 39625, 39842)

AWS: Remove duplicate calls to DescribeInstance during volume operations

This change removes all duplicate calls to describeInstance
from aws volume code path.

**What this PR does / why we need it**:

This PR removes the duplicate calls present in disk check code paths in AWS. I can confirm that `getAWSInstance` actually returns all instance information already and hence there is no need of making separate `describeInstance` call.

Related to - kubernetes/kubernetes#39526

cc @justinsb @jsafrane
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests