CPUThrottlingHigh alert for metrics-server-nanny (addon-resizer:1.8.11-gke.0) #4141

pgier · 2021-06-14T14:41:48Z

Which component are you using?: addon-resizer and metrics-server

What version of the component are you using?: 1.8.11-gke.0

Component version: 1.8.11-gke.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.9-gke.1900", GitCommit:"008fd38bf3dc201bebdd4fe26edf9bf87478309a", GitTreeState:"clean", BuildDate:"2021-04-14T09:22:08Z", GoVersion:"go1.15.8b5", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.21) and server (1.19) exceeds the supported minor version skew of +/-1

What environment is this in?:

GKE version 1.19.9-gke.1900

What did you expect to happen?:

Expected no Alert from this component.

What happened instead?:

Received the Alert, and the metrics seem to indicate that throttling is regularly over 90%.

How to reproduce it (as minimally and precisely as possible):

Create basic GKE cluster and install the kube-prometheus monitoring stack.

Anything else we need to know?:

Could be related to issue #3833

The metrics-server-nanny logs look fine, however the metrics-server in the same pod has a lot of errors in the logs saying that metrics cannot be collected.

I0614 02:22:57.188713       1 log.go:172] http: TLS handshake error from 10.128.0.15:33630: EOF
I0614 02:22:57.192135       1 log.go:172] http: TLS handshake error from 10.128.0.14:36984: EOF
I0614 02:22:57.389307       1 log.go:172] http: TLS handshake error from 10.128.0.14:36960: EOF
I0614 02:22:57.988045       1 log.go:172] http: TLS handshake error from 10.68.3.1:52464: EOF
I0614 02:22:57.988154       1 log.go:172] http: TLS handshake error from 10.68.3.1:52474: EOF
E0614 02:22:58.288218       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/event-exporter-gke-67986489c8-vmg4j: no metrics known for pod
E0614 02:22:58.288278       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-dns-6c7b8dc9f9-sn7p5: no metrics known for pod
E0614 02:22:58.288287       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/fluentbit-gke-wxqcn: no metrics known for pod
E0614 02:22:58.288293       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-gke-paul-test-1-default-pool-1166a98e-mx9n: no metrics known for pod
E0614 02:22:58.288299       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/stackdriver-metadata-agent-cluster-level-68ffcbb78c-sqwtn: no metrics known for pod
E0614 02:22:58.288305       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/pdcsi-node-kfnfj: no metrics known for pod
E0614 02:22:58.288312       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-gke-paul-test-1-default-pool-1166a98e-3zkp: no metrics known for pod

I created a separate issue for the metrics-server: kubernetes-sigs/metrics-server#783

The text was updated successfully, but these errors were encountered:

LucasRouckhout · 2021-06-22T13:34:29Z

We are seeing the same issue on GKE. > 90% CPU throttling on the metric server pod and same logs as you have posted. Will let you know if I discover anything. This sounds familiar to a known issue surrounding CFS quotas.

kubernetes/kubernetes#67577
https://bugzilla.kernel.org/show_bug.cgi?id=198197

gboor · 2021-07-08T14:47:00Z

I just ran into this as well. For now I've disabled the alert, but it would be nice to get some sort of fix. Strangely this suddenly started happening on a cluster that is over a year old. 8 hours ago the metrics-server restarted (same version) and these alerts started popping up... no idea why it didn't happen before.

obataku · 2021-07-12T21:32:59Z

I came across this a couple months ago and I believe the problem should be resolved by #3833 (in 1.8.12) and/or #4112 (which ~~hasn't landed in a versioned release yet~~ landed in 1.8.14!); if you manage your own metrics-server deployment then you can opt to pass --use-metrics=false and/or upgrade the addon-resizer used for metrics-server-nanny, while those using GKE will likely need to wait for a fix

@LucasRouckhout:

We are seeing the same issue on GKE. > 90% CPU throttling on the metric server pod and same logs as you have posted. Will let you know if I discover anything. This sounds familiar to a known issue surrounding CFS quotas.

kubernetes/kubernetes#67577
https://bugzilla.kernel.org/show_bug.cgi?id=198197

the poll period was broken and as a result it is attempting to scrape the apiserver metrics endpoint (which exports tens of thousands of metrics) every 10 s rather than the intended 5 min:

      - command:
        - /pod_nanny
        - --config-dir=/etc/config
        - --cpu=40m
        - --extra-cpu=0.5m
        - --memory=35Mi
        - --extra-memory=4Mi
        - --threshold=5
        - --deployment=metrics-server-v0.3.6
        - --container=metrics-server
        - --poll-period=300000
        - --estimator=exponential
        - --scale-down-delay=24h
        - --minClusterSize=5
        - --use-metrics=true

bskiba · 2021-07-13T08:21:41Z

Release 1.8.14 has just come out that contains the fix for both #3833 and #4112: https://github.com/kubernetes/autoscaler/releases/tag/addon-resizer-1.8.14

Could you verify if this fixes the issue with CPU?

k8s-triage-robot · 2021-12-16T10:15:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-15T10:28:27Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-14T10:32:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-14T10:32:43Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pgier added the kind/bug Categorizes issue or PR as related to a bug. label Jun 14, 2021

jbartosik added the area/addon-resizer label Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2022

k8s-ci-robot closed this as completed Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPUThrottlingHigh alert for metrics-server-nanny (addon-resizer:1.8.11-gke.0) #4141

CPUThrottlingHigh alert for metrics-server-nanny (addon-resizer:1.8.11-gke.0) #4141

pgier commented Jun 14, 2021

LucasRouckhout commented Jun 22, 2021

gboor commented Jul 8, 2021

obataku commented Jul 12, 2021 •

edited

Loading

bskiba commented Jul 13, 2021

k8s-triage-robot commented Dec 16, 2021

k8s-triage-robot commented Jan 15, 2022

k8s-triage-robot commented Feb 14, 2022

k8s-ci-robot commented Feb 14, 2022

CPUThrottlingHigh alert for metrics-server-nanny (addon-resizer:1.8.11-gke.0) #4141

CPUThrottlingHigh alert for metrics-server-nanny (addon-resizer:1.8.11-gke.0) #4141

Comments

pgier commented Jun 14, 2021

LucasRouckhout commented Jun 22, 2021

gboor commented Jul 8, 2021

obataku commented Jul 12, 2021 • edited Loading

bskiba commented Jul 13, 2021

k8s-triage-robot commented Dec 16, 2021

k8s-triage-robot commented Jan 15, 2022

k8s-triage-robot commented Feb 14, 2022

k8s-ci-robot commented Feb 14, 2022

obataku commented Jul 12, 2021 •

edited

Loading