Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling does not work on Azure Kubernetes Service #1730

Closed
krancour opened this issue Jul 27, 2018 · 12 comments
Closed

Autoscaling does not work on Azure Kubernetes Service #1730

krancour opened this issue Jul 27, 2018 · 12 comments
Assignees

Comments

@krancour
Copy link
Contributor

Followed official instructions for installing on AKS then proceeded to deploy hello world following these instructions.

No amount of traffic causes the deployment to scale out. No amount of idle time causes the deployment to scale to zero. It's fixed, perpetually, at a single replica.

I've examined logs from all (of what I think are) the relevant components.

I do not believe that this is the cause, but the queue-proxy sidecar logs show that component it unable to connect to the autoscaler. I don't believe this is a factor because the traffic in question is arriving via istio-proxy, but it's still an interesting data point:

{"level":"error","ts":"2018-07-27T00:41:21.229Z","logger":"queueproxy","caller":"queue/main.go:138","msg":"Stat sink not connected.","knative.dev/namespace":"default","knative.dev/configuration":"helloworld-go","knative.dev/revision":"helloworld-go-00001","knative.dev/pod":"helloworld-go-00001-deployment-84887f66d7-kl2rb","stacktrace":"main.statReporter\n\t/go/src/github.com/knative/serving/cmd/queue/main.go:138"}

Perhaps more relevant are autoscaler logs. I do not refer to the revision's autoscaler-- that shows no errors. I refer to the shared autoscaler controller. At startup it shows numerous messages stating that the apiserver has refused the connection:

E0726 22:10:58.482320       1 reflector.go:205] github.com/knative/serving/pkg/client/informers/externalversions/factory.go:72: Failed to list *v1alpha1.Revision: Get https://10.0.0.1:443/apis/serving.knative.dev/v1alpha1/revisions?limit=500&resourceVersion=0: dial tcp 10.0.0.1:443: connect: connection refused

But very quickly, those disappear and are replaced with another, similar error message that the controller continues to experience at (I estimate) 60s intervals:

E0726 22:12:20.541687       1 reflector.go:205] github.com/knative/serving/pkg/client/informers/externalversions/factory.go:72: Failed to list *v1alpha1.Revision: an error on the server ("") has prevented the request from succeeding (get revisions.serving.knative.dev)

apiserver logs aren't easy to get at in AKS (yet), but I managed to wire them into Azure Log Analytics just to find the alleged server error-- there isn't one. I guess that's not surprising given that the early messages hinted that the connection was actually refused.

To thicken the plot a bit, other controllers are having no difficulty talking to the apiserver and if my reading of autoscaler code is correct, that controller also watches other resources, but the only one for which there are ever any errors is Revisions.

This seems to me, more than likely, to be an AKS issue. What I'm primarily looking for is additional guidance in narrowing down the root cause. Is there anything else I can look at?

@krancour
Copy link
Contributor Author

I should add that I have tried this multiple times with multiple AKS clusters and always encountered the same results. The exact same procedure on minikube leads to autoscaling working exactly as expected.

@mattmoor
Copy link
Member

/area autoscale
/assign @josephburnett

@krancour
Copy link
Contributor Author

Fast forward a couple months and I'm troubleshooting this same issue again. It seems to me that a lot has changed since the v0.1.1 release (for instance-- the multi-tenant autoscaler), so I am deploying knative serving from the head of master using instructions from DEVELOPMENT.md.

Not too much has changes from a few months ago. I see quite a variety of failures to list different resource types. For example:

E1019 00:57:58.252841       1 reflector.go:205] github.com/knative/serving/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ConfigMap: Get https://10.0.0.1:443/api/v1/namespaces/knative-serving/configmaps?limit=500&resourceVersion=0: dial tcp 10.0.0.1:443: connect: connection refused

I do have a new datapoint, however. What I have discovered is that this does not appear to be an AKS issue... If I edit the autoscaler deployment and remove the annotation that enables injection of an istio-proxy sidecar, this controller works fine.

I sense that this is probably not the right thing to do, so I've tried instead to set istio.sidecar.includeOutboundIPRanges appropriately to include all my in-cluster IPs, excluding 10.0.0.1, which is the service IP for the Kubernetes apiserver. i.e., I used these (comma-delimited properly of course; the format below is for convenience here):

10.0.0.2/31
10.0.0.4/30
10.0.0.8/29
10.0.0.16/28
10.0.0.32/27
10.0.0.64/26
10.0.0.128/25
10.0.1.0/24
10.0.2.0/23
10.0.4.0/22
10.0.8.0/21
10.0.16.0/20
10.0.32.0/19
10.0.64.0/18
10.0.128.0/17

Note, comments in config/config-network.yaml suggest this CIDR range for "Azure Container Service":

Azure Container Service(ACS): "10.244.0.0/16,10.240.0.0/16"

Note that ACS is an older product that has been replaced by AKS, which uses IPs in the range 10.0.0.0/16.

There are at least three problems of differing magnitude here.

  1. Docs need updating for AKS (minor)
  2. Excluding 10.0.0.1 instead of disabling the sidecar should have worked (moderate)
  3. I shouldn't have to disable the istio-proxy sidecar or exclude 10.0.0.1 in the first place (major)

One other data point I can offer is that the api-server that the service address 10.0.0.1 points to is technically off-cluster in the "hosted control plane." (AKS offers the Kubernetes control plan aaS, while you simply provide worker nodes.) I feel like this might be a factor in some way, but it seems I'm at the limit of my Knative and Istio knowledge here and I can't quite prove that.

If someone wouldn't mind helping me with this, I'd really appreciate it. At present Knative simply does not work on Azure Kubernetes Service and I'd imagine that's as big a problem for Knative as it is for Azure.

Thanks for any help!

@krancour
Copy link
Contributor Author

Ok... a little bit of progress this morning...

I understand now that the istio.sidecar.includeOutboundIPRanges setting config-network.yaml in only affects the the istio-proxy for revision pods. It has no bearing at all on the istio-proxy sidecar employed by the autoscaler-- configuration for that is elsewhere and likely needs a tweak to work well on AKS.

I'll keep looking.

@krancour
Copy link
Contributor Author

Ok... so if I manipulate the traffic.sidecar.istio.io/includeOutboundIPRanges annotation on the autoscaler deployment's pod template directly to exclude 10.0.0.1 (which is where my apiserver is), things work.

Then I found this...

istio/istio#8696

So it appears the root cause may lay within Istio and only manifests due to some idiosyncrasy of AKS.

@tcnghia
Copy link
Contributor

tcnghia commented Oct 19, 2018

@krancour Thanks for reporting this. This could be fixed by also adding a ServiceEntry for that IP address. Is it a fixed IP address? A related issue is istio/istio#6146, which is recently fixed by adding a ServiceEntry "*" to allow egress traffic by default.

If you the latest code you could also try to run using istio-lean.yaml that does not have a sidecar injector webhook to avoid using an Istio mesh. Knative functionalities should not be affected.

@krancour
Copy link
Contributor Author

Hey @tcnghia

This could be fixed by also adding a ServiceEntry for that IP address.

I considered that, but honestly struggled with that a bit... felt the docs for service entry didn't cover this scenario well.

Is it a fixed IP address?

In AKS, your kubernetes API server is always a ClusterIP at 10.0.0.1, but it's endpoint is an off-cluster IP. I'm not clear if you had been suggesting a service entry for 10.0.0.1 or for the off-cluster IP. If I understand correctly, since the autoscaler is trying to talk to 10.0.0.1, that's the thing that's being intercepted by the istio-proxy sidecar. Let me know if I understand this incorrectly.

If you the latest code you could also try to run using istio-lean.yaml that does not have a sidecar injector webhook to avoid using an Istio mesh. Knative functionalities should not be affected.

I will try this. As a follow-up question, however, why (with the usual configuration) does the autoscaler use an istio-proxy sidecar at all? What specific benefit is it providing to that particular component. I notice that the knative-controller does not use it. So I'm trying to understand the rationale that requires one controller to use it and another not to.

Lastly, but probably most importantly, I am wondering why any of this is an issue at all in the first place. In other environments, everything just works. In AKS, I face these issues, however I can't see anything that AKS is doing that is appreciably different from what goes on in other environments.

@tcnghia
Copy link
Contributor

tcnghia commented Oct 19, 2018

Having the istio-proxy will enable request tracing (and other monitoring features). In the past users are only able to control Istio sidecar injection at the cluster level, namespace level, or in the Pod that they create directly, so we have to put annotations in the Pods we create.

Going forward, Istio has mechanisms to allow users to control this even in Pods that they own but didn't directly create (hence having no control over the Pod spec). This will allow Knative to remove all annotations relating to Istio installation and let the users choose for themselves. Knative Serving doesn't strictly require Istio mesh, so these should not affect functionalities.

Regarding AKS, I am not sure why API server traffic is blocked by Istio sidecar. This may be due to some assumptions that Istio has about the API server address, but I don't know for sure. /cc @sdake @lichuqiang

@krancour
Copy link
Contributor Author

@tcnghia things seem to be working with isto-lean.yaml.

Can you confirm my understanding that at this point, the only thing Istio is being used for is ingress through the knative-ingressgateway and subsequent routing via virtualservice to my pod(s), which are now not running an istio-proxy sidecar?

@krancour
Copy link
Contributor Author

@tcnghia actually... scaling works, but traffic doesn't make it through to my pods anymore after that change. Does the fact that the pods now lack an istio-proxy sidecar impair routing from the ingress gateway to the pods?

@krancour
Copy link
Contributor Author

krancour commented Oct 19, 2018

$ curl -H "Host: helloworld-go.default.example.com" http://<load balancer ip>
upstream request timeout

@krancour
Copy link
Contributor Author

@tcnghia I must have worked my cluster into a dirty state that defied my efforts to start fresh.

Using a new cluster, the following (more or less) yielded the desired results on AKS:

$ kubectl apply -f ./third_party/istio-1.0.2/istio-lean.yaml
$ kubectl apply -f ./third_party/config/build/release.yaml
$ ko apply -f config/

Neither knative serving components nor my own pods have istio-proxy sidecars. Autoscaling works as expected, with the caveat that on scale to/from zero, route updates seem to be very slow to take effect, but this can be attributed to a known issue: Azure/AKS#620

If you're keeping score, here's where we're at:

  1. On AKS, Istio interferes with controllers talking to the apiserver. This is a known issue that is tracked here Internal Kubernetes API Calls Blocked by Istio istio/istio#8696
  2. Workaround for the above is to use istio-lean.yaml-- this should probably be documented in the instructions for getting started on AKS. I'll open a separate issue and a PR for that.
  3. Performance issues on AKS that affect this are tracked here: Performance degradation for high levels of in-cluster kube-apiserver traffic Azure/AKS#620

So... I'm going to consider this issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants