-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaling does not work on Azure Kubernetes Service #1730
Comments
I should add that I have tried this multiple times with multiple AKS clusters and always encountered the same results. The exact same procedure on minikube leads to autoscaling working exactly as expected. |
/area autoscale |
Fast forward a couple months and I'm troubleshooting this same issue again. It seems to me that a lot has changed since the v0.1.1 release (for instance-- the multi-tenant autoscaler), so I am deploying knative serving from the head of master using instructions from DEVELOPMENT.md. Not too much has changes from a few months ago. I see quite a variety of failures to list different resource types. For example:
I do have a new datapoint, however. What I have discovered is that this does not appear to be an AKS issue... If I edit the I sense that this is probably not the right thing to do, so I've tried instead to set
Note, comments in
Note that ACS is an older product that has been replaced by AKS, which uses IPs in the range 10.0.0.0/16. There are at least three problems of differing magnitude here.
One other data point I can offer is that the api-server that the service address If someone wouldn't mind helping me with this, I'd really appreciate it. At present Knative simply does not work on Azure Kubernetes Service and I'd imagine that's as big a problem for Knative as it is for Azure. Thanks for any help! |
Ok... a little bit of progress this morning... I understand now that the I'll keep looking. |
Ok... so if I manipulate the Then I found this... So it appears the root cause may lay within Istio and only manifests due to some idiosyncrasy of AKS. |
@krancour Thanks for reporting this. This could be fixed by also adding a ServiceEntry for that IP address. Is it a fixed IP address? A related issue is istio/istio#6146, which is recently fixed by adding a ServiceEntry "*" to allow egress traffic by default. If you the latest code you could also try to run using |
Hey @tcnghia
I considered that, but honestly struggled with that a bit... felt the docs for service entry didn't cover this scenario well.
In AKS, your kubernetes API server is always a
I will try this. As a follow-up question, however, why (with the usual configuration) does the autoscaler use an Lastly, but probably most importantly, I am wondering why any of this is an issue at all in the first place. In other environments, everything just works. In AKS, I face these issues, however I can't see anything that AKS is doing that is appreciably different from what goes on in other environments. |
Having the Going forward, Istio has mechanisms to allow users to control this even in Pods that they own but didn't directly create (hence having no control over the Pod spec). This will allow Knative to remove all annotations relating to Istio installation and let the users choose for themselves. Knative Serving doesn't strictly require Istio mesh, so these should not affect functionalities. Regarding AKS, I am not sure why API server traffic is blocked by Istio sidecar. This may be due to some assumptions that Istio has about the API server address, but I don't know for sure. /cc @sdake @lichuqiang |
@tcnghia things seem to be working with Can you confirm my understanding that at this point, the only thing Istio is being used for is ingress through the |
@tcnghia actually... scaling works, but traffic doesn't make it through to my pods anymore after that change. Does the fact that the pods now lack an |
|
@tcnghia I must have worked my cluster into a dirty state that defied my efforts to start fresh. Using a new cluster, the following (more or less) yielded the desired results on AKS:
Neither knative serving components nor my own pods have If you're keeping score, here's where we're at:
So... I'm going to consider this issue resolved. |
Followed official instructions for installing on AKS then proceeded to deploy hello world following these instructions.
No amount of traffic causes the deployment to scale out. No amount of idle time causes the deployment to scale to zero. It's fixed, perpetually, at a single replica.
I've examined logs from all (of what I think are) the relevant components.
I do not believe that this is the cause, but the
queue-proxy
sidecar logs show that component it unable to connect to the autoscaler. I don't believe this is a factor because the traffic in question is arriving viaistio-proxy
, but it's still an interesting data point:Perhaps more relevant are
autoscaler logs
. I do not refer to the revision's autoscaler-- that shows no errors. I refer to the sharedautoscaler
controller. At startup it shows numerous messages stating that the apiserver has refused the connection:But very quickly, those disappear and are replaced with another, similar error message that the controller continues to experience at (I estimate) 60s intervals:
apiserver logs aren't easy to get at in AKS (yet), but I managed to wire them into Azure Log Analytics just to find the alleged server error-- there isn't one. I guess that's not surprising given that the early messages hinted that the connection was actually refused.
To thicken the plot a bit, other controllers are having no difficulty talking to the apiserver and if my reading of autoscaler code is correct, that controller also watches other resources, but the only one for which there are ever any errors is
Revisions
.This seems to me, more than likely, to be an AKS issue. What I'm primarily looking for is additional guidance in narrowing down the root cause. Is there anything else I can look at?
The text was updated successfully, but these errors were encountered: