-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AKS connection timeout on scaling #381
Comments
👍 manual scaling is causing time out for us as well |
I see the same thing. When scaling the cluster (either using the portal UI or “az aks scale -n avaks1 -g avaks1 -c 3”) the connection to the API server endpoint (e.g. https://avaks1-avaks1-c9c8ae-2c1b8043.hcp.eastus.azmk8s.io/) gets closed or reset. This happens within out 10-15 seconds after starting the scaling operation and lasts for a few seconds. It almost feels like something happens with the LB that is in-front of the masters that causes existing socket connections to terminate. You can easily repro the problem by using “kubectl get pods -n kube-system --watch” while at the same time scaling the cluster. You will see the --watch terminate with the following error. At the same time other connections “az asks browse” also get broken and need to be restarted. I tried browsing the API server endpoint at the same moment and see it returning ERR_CONNECTION_CLOSED in browser. Something happening to sever the API server connectivity during scale. The existing pods themselves remaining running just fine (i.e. they don’t move between nodes). |
@seanmck FYI @JackQuincy |
@khenidak Can we get more information on when this will be resolved? |
This is one thing we are tracking. We don't have an ETA today. I'll report back when we have one. @sauryadas @qike-ms |
What is maybe also interesting is that it takes apparently much longer to add a node (41 minutes) than to create a single-node cluster (18 minutes). I tested with Standard_B2s VMs. |
@schweikert I think that is a red herring where the original cluster is in a bad state. Our additive scale code literally does the same things as create just minus a couple steps. there are some small additional things we do but those won't be time consuming. I hope to address the nodes going unready as I work on cluster autoscaler. Which I should start back up next week. Not sure on ETA though. |
Any updates? |
Closing as solved/stale |
While investigating the Autoscaling issue with AKS, it was found that manual scaling also causes the same issue. Raising this bug to track scaling connection time out with AKS.
Here is the reference bug which was found during auto scaler testing.
kubernetes/autoscaler#849
cc @feiskyer @khenidak @slack
The text was updated successfully, but these errors were encountered: