AKS connection timeout on scaling #381

kkmsft · 2018-05-21T19:42:00Z

While investigating the Autoscaling issue with AKS, it was found that manual scaling also causes the same issue. Raising this bug to track scaling connection time out with AKS.

Here is the reference bug which was found during auto scaler testing.
kubernetes/autoscaler#849

cc @feiskyer @khenidak @slack

zzh8829 · 2018-06-13T08:05:07Z

👍 manual scaling is causing time out for us as well

arsenvlad · 2018-06-14T14:58:41Z

I see the same thing. When scaling the cluster (either using the portal UI or “az aks scale -n avaks1 -g avaks1 -c 3”) the connection to the API server endpoint (e.g. https://avaks1-avaks1-c9c8ae-2c1b8043.hcp.eastus.azmk8s.io/) gets closed or reset.

This happens within out 10-15 seconds after starting the scaling operation and lasts for a few seconds. It almost feels like something happens with the LB that is in-front of the masters that causes existing socket connections to terminate.

You can easily repro the problem by using “kubectl get pods -n kube-system --watch” while at the same time scaling the cluster. You will see the --watch terminate with the following error. At the same time other connections “az asks browse” also get broken and need to be restarted.

I tried browsing the API server endpoint at the same moment and see it returning ERR_CONNECTION_CLOSED in browser.

Something happening to sever the API server connectivity during scale. The existing pods themselves remaining running just fine (i.e. they don’t move between nodes).

khenidak · 2018-06-19T17:46:41Z

@seanmck FYI @JackQuincy

paulashbourne · 2018-06-19T18:36:20Z

@khenidak Can we get more information on when this will be resolved?

JackQuincy · 2018-06-19T21:15:31Z

This is one thing we are tracking. We don't have an ETA today. I'll report back when we have one. @sauryadas @qike-ms

schweikert · 2018-07-25T15:14:36Z

What is maybe also interesting is that it takes apparently much longer to add a node (41 minutes) than to create a single-node cluster (18 minutes). I tested with Standard_B2s VMs.

JackQuincy · 2018-07-25T20:41:52Z

@schweikert I think that is a red herring where the original cluster is in a bad state. Our additive scale code literally does the same things as create just minus a couple steps. there are some small additional things we do but those won't be time consuming. I hope to address the nodes going unready as I work on cluster autoscaler. Which I should start back up next week. Not sure on ETA though.

A141864 · 2018-08-21T01:03:39Z

Any updates?

jnoller · 2019-04-03T21:37:55Z

Closing as solved/stale

slack added enhancement bug and removed enhancement labels May 21, 2018

jeehwancho mentioned this issue Sep 10, 2018

Brigade Worker on Transient Fault Handling brigadecore/brigade#625

Closed

jnoller closed this as completed Apr 3, 2019

ghost locked as resolved and limited conversation to collaborators Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS connection timeout on scaling #381

AKS connection timeout on scaling #381

kkmsft commented May 21, 2018

zzh8829 commented Jun 13, 2018

arsenvlad commented Jun 14, 2018

khenidak commented Jun 19, 2018

paulashbourne commented Jun 19, 2018

JackQuincy commented Jun 19, 2018

schweikert commented Jul 25, 2018

JackQuincy commented Jul 25, 2018

A141864 commented Aug 21, 2018

jnoller commented Apr 3, 2019

AKS connection timeout on scaling #381

AKS connection timeout on scaling #381

Comments

kkmsft commented May 21, 2018

zzh8829 commented Jun 13, 2018

arsenvlad commented Jun 14, 2018

khenidak commented Jun 19, 2018

paulashbourne commented Jun 19, 2018

JackQuincy commented Jun 19, 2018

schweikert commented Jul 25, 2018

JackQuincy commented Jul 25, 2018

A141864 commented Aug 21, 2018

jnoller commented Apr 3, 2019