Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keda-operator-metrics-apiserver restarts frequently #2058

Closed
rangarb885 opened this issue Aug 26, 2021 · 10 comments
Closed

keda-operator-metrics-apiserver restarts frequently #2058

rangarb885 opened this issue Aug 26, 2021 · 10 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@rangarb885
Copy link

rangarb885 commented Aug 26, 2021

Report

keda-operator-metrics-apiserver gets restarted frequently when there are 6 scalar (SQS). Metric is polled every 30s from SQS. This cause metric report missing for the window of time when it restarts.

Expected Behavior

keda-operator-metrics-apiserver needs to be stable

Actual Behavior

keda-operator-metrics-apiserver restarts frequently.

kubectl get pods -n keda                                                                                                                                                                                                            
NAME                                               READY   STATUS    RESTARTS   AGE
keda-operator-75fdf45c49-8jqq5                     1/1     Running   0          7h20m
keda-operator-metrics-apiserver-596d7cf844-vz4dx   1/1     Running   3          7h20m

Steps to Reproduce the Problem

  1. Setup keda 2.4.0 using Helm on EKS 1.21
  2. Create SQS based Scalars (6) for a deployment
  3. Run volume on SQS queue to trigger scaling

Logs from KEDA operator

Strangely there is no log on the metric server when it restarts.

Below is the log from HPA which clearly shows the metric server has some issue.

Warning  FailedComputeMetricsReplicas  35m (x8 over 56m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get AWS-SQS-Queue-adapter external metric: unable to get external metric sbx/AWS-SQS-Queue-adapter/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: adapter,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get AWS-SQS-Queue-adapter.external.metrics.k8s.io)

Also during that time when i try to fetch the metric using kubectl, i received Unable to connect to the server: unexpected EOF

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/sbx/aws-sqs-queue-adapter?labelSelector=scaledobject.keda.sh%2Fname%3Dadapter" |  jq .

Unable to connect to the server: unexpected EOF

KEDA Version

2.4.0

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

SQS

Anything else?

General questions:

  • Do we have option to set memory/cpu request/limit for this metric server?
  • What is the name of attribute on helm to set the log level as debug for both operator and metric server?
  • Does metric server and operator can scale to 3 so as to provide high availability? Basically failure/restart on metric server cause scaling to stop as there is no metric for HPA to act.
@rangarb885 rangarb885 added the bug Something isn't working label Aug 26, 2021
@rangarb885
Copy link
Author

rangarb885 commented Aug 26, 2021

Now i saw this error on the keda metric server log.

I0826 05:00:09.067255       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="8.395775361s" userAgent="kube-probe/1.21+" srcIP="100.100.68.237:39662" resp=500 statusStack="\ngoroutine 6402 [running]:\nk8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc0013a59d0, 0x1f4)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/httplog/httplog.go:237 +0xcf\nk8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc0013a59d0, 0x1f4)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/httplog/httplog.go:216 +0x35\nk8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc000bfdfa0, 0x1f4)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/timeout.go:226 +0xb2\nnet/http.Error(0x7fbbf48bb638, 0xc001152598, 0xc00012c400, 0x33, 0x1f4)\n\t/usr/local/go/src/net/http/server.go:2054 +0x1f6\nk8s.io/apiserver/pkg/endpoints/handlers/responsewriters.InternalError(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00, 0x31cf780, 0xc000ac2980)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/handlers/responsewriters/errors.go:75 +0x11e\nk8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filters/authorization.go:69 +0x4d4\nnet/http.HandlerFunc.ServeHTTP(0xc0002a5e80, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186\nnet/http.HandlerFunc.ServeHTTP(0xc00012e440, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/maxinflight.go:184 +0x4cf\nnet/http.HandlerFunc.ServeHTTP(0xc000a5be60, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:95 +0x165\nnet/http.HandlerFunc.ServeHTTP(0xc000a5be90, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filters/impersonation.go:50 +0x23dd\nnet/http.HandlerFunc.ServeHTTP(0xc00012e500, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186\nnet/http.HandlerFunc.ServeHTTP(0xc00012e580, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:95 +0x165\nnet/http.HandlerFunc.ServeHTTP(0xc000a5bec0, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186\nnet/http.HandlerFunc.ServeHTTP(0xc00012e640, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:95 +0x165\nnet/http.HandlerFunc.ServeHTTP(0xc000a5bf20, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4f00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filters.WithAuthentication.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4e00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filters/authentication.go:70 +0x6d2\nnet/http.HandlerFunc.ServeHTTP(0xc0009b8f50, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4e00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7fbbf48bb638, 0xc001152598, 0xc0000c4d00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:80 +0x38a\nnet/http.HandlerFunc.ServeHTTP(0xc00012e6c0, 0x7fbbf48bb638, 0xc001152598, 0xc0000c4d00)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc001407bc0, 0xc000e3a040, 0x3230ba0, 0xc001152598, 0xc0000c4d00)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/timeout.go:111 +0xb8\ncreated by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/timeout.go:97 +0x1cc\n" addedInfo="\nlogging error output: \"Internal Server Error: \\\"/healthz\\\": context canceled\\n\"\n"

E0826 05:00:12.323028       1 webhook.go:202] Failed to make webhook authorizer request: webhook call failed: context canceled

Even for /readyz , i saw a similar webhook call failed error.

This happens when multiple scalars related metrics start to report value , meaning surge in SQS available message count & scaling needs to happen.

Additional log :

E0826 05:26:33.038633       1 webhook.go:202] Failed to make webhook authorizer request: webhook call failed: context canceled
E0826 05:26:35.062674       1 errors.go:77] webhook call failed: context canceled
I0826 05:26:35.062542       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="4.608199921s" userAgent="Go-http-client/2.0" srcIP="100.100.91.220:46058" resp=200
I0826 05:26:37.167736       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1?timeout=32s" latency="4.129833753s" userAgent="kube-controller-manager/v1.21.2 (linux/amd64) kubernetes/8a4e27b/system:serviceaccount:kube-system:generic-garbage-collector" srcIP="100.100.91.220:52886" resp=500 statusStack="\ngoroutine 19207 [running]:\nk8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc000975ce0, 0x1f4)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/httplog/httplog.go:237 +0xcf\nk8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc000975ce0, 0x1f4)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/httplog/httplog.go:216 +0x35\nk8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc000b34cc0, 0x1f4)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/timeout.go:226 +0xb2\nnet/http.Error(0x7f0435b3c298, 0xc000f77118, 0xc0016ae980, 0x71, 0x1f4)\n\t/usr/local/go/src/net/http/server.go:2054 +0x1f6\nk8s.io/apiserver/pkg/endpoints/handlers/responsewriters.InternalError(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900, 0x31cf780, 0xc000b352c0)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/handlers/responsewriters/errors.go:75 +0x11e\nk8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filters/authorization.go:69 +0x4d4\nnet/http.HandlerFunc.ServeHTTP(0xc0002b3240, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186\nnet/http.HandlerFunc.ServeHTTP(0xc0002b3480, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/maxinflight.go:184 +0x4cf\nnet/http.HandlerFunc.ServeHTTP(0xc000a9b830, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:95 +0x165\nnet/http.HandlerFunc.ServeHTTP(0xc000a9b860, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filters/impersonation.go:50 +0x23dd\nnet/http.HandlerFunc.ServeHTTP(0xc0002b34c0, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186\nnet/http.HandlerFunc.ServeHTTP(0xc0002b3500, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:95 +0x165\nnet/http.HandlerFunc.ServeHTTP(0xc000a9b890, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186\nnet/http.HandlerFunc.ServeHTTP(0xc0002b3540, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:95 +0x165\nnet/http.HandlerFunc.ServeHTTP(0xc000a9b8f0, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca900)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filters.WithAuthentication.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca800)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filters/authentication.go:70 +0x6d2\nnet/http.HandlerFunc.ServeHTTP(0xc000b16ff0, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca800)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7f0435b3c298, 0xc000f77118, 0xc0016ca700)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/endpoints/filterlatency/filterlatency.go:80 +0x38a\nnet/http.HandlerFunc.ServeHTTP(0xc0002b3580, 0x7f0435b3c298, 0xc000f77118, 0xc0016ca700)\n\t/usr/local/go/src/net/http/server.go:2042 +0x44\nk8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc0014a5200, 0xc0002fb340, 0x3230ba0, 0xc000f77118, 0xc0016ca700)\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/timeout.go:111 +0xb8\ncreated by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP\n\t/go/pkg/mod/k8s.io/apiserver@v0.20.8/pkg/server/filters/timeout.go:97 +0x1cc\n" addedInfo="\nlogging error output: \"Internal Server Error: \\\"/apis/external.metrics.k8s.io/v1beta1?timeout=32s\\\": webhook call failed: context canceled\\n\"\n"

@arschles
Copy link
Contributor

@rangarb885 to answer your questions first:

Do we have option to set memory/cpu request/limit for this metric server?
yes, they are set here, so the easiest way to do so would be write your own values.yaml file with a resources section that looks similar to this and then apply it.

What is the name of attribute on helm to set the log level as debug for both operator and metric server?

logging.metricServer.level - see here for more details

Does metric server and operator can scale to 3 so as to provide high availability? Basically failure/restart on metric server cause scaling to stop as there is no metric for HPA to act.

not currently, but iirc the operator could because it elects a leader. I'll defer to @zroubalik and @tomkerkhove whether they would want to take advantage of that

on to the problem you filed this issue for, I'm not as familiar with the metrics server code as others, but my (slightly educated) hunch is that the max in-flight limit filter is to blame (see here). still working on tracing the metrics server down to that point, but it is in the stack trace you pasted. perhaps others could chime in here.

@ahmelsayed
Copy link
Contributor

ahmelsayed commented Aug 26, 2021

You should be able to scale the metrics server as it uses a regular kubernetes service for loadbalancing

The default MaxRequestsInFlight is 400 which seems large enough. Unless the calls for metrics are getting blocked somewhere and then /healthz calls start failing.

@rangarb885
Copy link
Author

Thank you @arschles & @ahmelsayed .
In my case i have just 6 deployments on the environment so MaxRequestsInFlight could be a problem a here. Even when i scaled metric server to 2 pods, the same error repeated on both.

@arschles
Copy link
Contributor

arschles commented Aug 27, 2021

You should be able to scale the metrics server as it uses a regular kubernetes service for loadbalancing

@ahmelsayed Good to know. is there any technical requirement that the metrics server is a singleton? if not, does it make sense to update the helm chart to allow for N>1 replicas?

@JorTurFer
Copy link
Member

JorTurFer commented Aug 31, 2021

You should be able to scale the metrics server as it uses a regular kubernetes service for loadbalancing

@ahmelsayed Good to know. is there any technical requirement that the metrics server is a singleton? if not, does it make sense to update the helm chart to allow for N>1 replicas?

If it's possible having more than one instance of metrics server, I think that we should explain it explicitly because it's important to maintain the HA, I thought that 1 instance is a technical requirement :(

@JorTurFer
Copy link
Member

I knew that I read something about metric server limitation, and I did in KEDA docs xD
kubernetes-sigs/custom-metrics-apiserver#70

@tomkerkhove
Copy link
Member

Yes, it's documented in our HA docs indeed: https://keda.sh/docs/latest/operate/cluster/#high-availability

@stale
Copy link

stale bot commented Oct 31, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 31, 2021
@stale
Copy link

stale bot commented Nov 7, 2021

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Nov 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
None yet
Development

No branches or pull requests

5 participants