Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS connection timeout on scaling #381

Closed
kkmsft opened this issue May 21, 2018 · 9 comments
Closed

AKS connection timeout on scaling #381

kkmsft opened this issue May 21, 2018 · 9 comments
Labels

Comments

@kkmsft
Copy link

kkmsft commented May 21, 2018

While investigating the Autoscaling issue with AKS, it was found that manual scaling also causes the same issue. Raising this bug to track scaling connection time out with AKS.

Here is the reference bug which was found during auto scaler testing.
kubernetes/autoscaler#849

cc @feiskyer @khenidak @slack

@zzh8829
Copy link

zzh8829 commented Jun 13, 2018

👍 manual scaling is causing time out for us as well

@arsenvlad
Copy link

I see the same thing. When scaling the cluster (either using the portal UI or “az aks scale -n avaks1 -g avaks1 -c 3”) the connection to the API server endpoint (e.g. https://avaks1-avaks1-c9c8ae-2c1b8043.hcp.eastus.azmk8s.io/) gets closed or reset.

This happens within out 10-15 seconds after starting the scaling operation and lasts for a few seconds. It almost feels like something happens with the LB that is in-front of the masters that causes existing socket connections to terminate.

You can easily repro the problem by using “kubectl get pods -n kube-system --watch” while at the same time scaling the cluster. You will see the --watch terminate with the following error. At the same time other connections “az asks browse” also get broken and need to be restarted.

I tried browsing the API server endpoint at the same moment and see it returning ERR_CONNECTION_CLOSED in browser.

Something happening to sever the API server connectivity during scale. The existing pods themselves remaining running just fine (i.e. they don’t move between nodes).

image

image

@khenidak
Copy link

@seanmck FYI @JackQuincy

@paulashbourne
Copy link

@khenidak Can we get more information on when this will be resolved?

@JackQuincy
Copy link

This is one thing we are tracking. We don't have an ETA today. I'll report back when we have one. @sauryadas @qike-ms

@schweikert
Copy link

What is maybe also interesting is that it takes apparently much longer to add a node (41 minutes) than to create a single-node cluster (18 minutes). I tested with Standard_B2s VMs.

@JackQuincy
Copy link

@schweikert I think that is a red herring where the original cluster is in a bad state. Our additive scale code literally does the same things as create just minus a couple steps. there are some small additional things we do but those won't be time consuming. I hope to address the nodes going unready as I work on cluster autoscaler. Which I should start back up next week. Not sure on ETA though.

@A141864
Copy link

A141864 commented Aug 21, 2018

Any updates?

@jnoller
Copy link
Contributor

jnoller commented Apr 3, 2019

Closing as solved/stale

@jnoller jnoller closed this as completed Apr 3, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Aug 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants