Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

seh · 2019-09-16T23:12:01Z

We've been using cluster autoscaler version 1.15.0 patched with #2008 for some time in AWS, to good effect. Today we attempted to put the new version 1.16.0 into service using all the same configuration, and found that it no longer seems to notice pending pods.

The cluster autoscaler starts fine, and the logs don't indicate anything failing. It goes through its periodic "main loop" and the "Regenerating instance to ASG map for ASGs" step regularly, again without any obvious problems. However, when we create pods that require that the cluster autoscaler take note and adjust a suitable ASG's size, the cluster autoscaler's logs don't show any evidence of it noticing these pods. In prior versions, we see messages to the following effect:

Pod ns/name is unschedulable
Pod name be scheduled on node, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient nvidia.com/gpu
Event(v1.ObjectReference{Kind:"Pod", Namespace:"ns", Name:"name", UID:"ae32b9cc-081e-4323-ba33-7810457a0ddf", APIVersion:"v1", ResourceVersion:"58735432", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{asg 0->13 (max: 125)}]

Instead, the new cluster autoscaler exhibits no reaction to these pending pods.

Here are the flag values reported at start time:

flags.go:52] FLAG: --add-dir-header="false"
flags.go:52] FLAG: --address=":8085"
flags.go:52] FLAG: --alsologtostderr="false"
flags.go:52] FLAG: --balance-similar-node-groups="false"
flags.go:52] FLAG: --cloud-config=""
flags.go:52] FLAG: --cloud-provider="aws"
flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
flags.go:52] FLAG: --cluster-name=""
flags.go:52] FLAG: --cores-total="0:320000"
flags.go:52] FLAG: --estimator="binpacking"
flags.go:52] FLAG: --expander="least-waste"
flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
flags.go:52] FLAG: --gpu-total="[]"
flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
flags.go:52] FLAG: --ignore-taint="[]"
flags.go:52] FLAG: --kubeconfig=""
flags.go:52] FLAG: --kubernetes=""
flags.go:52] FLAG: --leader-elect="true"
flags.go:52] FLAG: --leader-elect-lease-duration="15s"
flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
flags.go:52] FLAG: --leader-elect-resource-name=""
flags.go:52] FLAG: --leader-elect-resource-namespace=""
flags.go:52] FLAG: --leader-elect-retry-period="2s"
flags.go:52] FLAG: --log-backtrace-at=":0"
flags.go:52] FLAG: --log-dir=""
flags.go:52] FLAG: --log-file=""
flags.go:52] FLAG: --log-file-max-size="1800"
flags.go:52] FLAG: --logtostderr="true"
flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
flags.go:52] FLAG: --max-empty-bulk-delete="10"
flags.go:52] FLAG: --max-failing-time="15m0s"
flags.go:52] FLAG: --max-graceful-termination-sec="600"
flags.go:52] FLAG: --max-inactivity="10m0s"
flags.go:52] FLAG: --max-node-provision-time="3m0s"
flags.go:52] FLAG: --max-nodes-total="0"
flags.go:52] FLAG: --max-total-unready-percentage="45"
flags.go:52] FLAG: --memory-total="0:6400000"
flags.go:52] FLAG: --min-replica-count="0"
flags.go:52] FLAG: --namespace="our-system"
flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=kubernetes.io/cluster-autoscaler/enabled,kubernetes.io/cluster/redacted]"
flags.go:52] FLAG: --nodes="[]"
flags.go:52] FLAG: --ok-total-unready-count="3"
flags.go:52] FLAG: --regional="false"
flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
flags.go:52] FLAG: --scale-down-delay-after-add="3m0s"
flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
flags.go:52] FLAG: --scale-down-enabled="true"
flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
flags.go:52] FLAG: --scale-down-non-empty-candidates-count="50"
flags.go:52] FLAG: --scale-down-unneeded-time="13m0s"
flags.go:52] FLAG: --scale-down-unready-time="7m0s"
flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
flags.go:52] FLAG: --scale-up-from-zero="true"
flags.go:52] FLAG: --scan-interval="10s"
flags.go:52] FLAG: --skip-headers="false"
flags.go:52] FLAG: --skip-log-headers="false"
flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
flags.go:52] FLAG: --stderrthreshold="0"
flags.go:52] FLAG: --test.bench=""
flags.go:52] FLAG: --test.benchmem="false"
flags.go:52] FLAG: --test.benchtime="1s"
flags.go:52] FLAG: --test.blockprofile=""
flags.go:52] FLAG: --test.blockprofilerate="1"
flags.go:52] FLAG: --test.count="1"
flags.go:52] FLAG: --test.coverprofile=""
flags.go:52] FLAG: --test.cpu=""
flags.go:52] FLAG: --test.cpuprofile=""
flags.go:52] FLAG: --test.failfast="false"
flags.go:52] FLAG: --test.list=""
flags.go:52] FLAG: --test.memprofile=""
flags.go:52] FLAG: --test.memprofilerate="0"
flags.go:52] FLAG: --test.mutexprofile=""
flags.go:52] FLAG: --test.mutexprofilefraction="1"
flags.go:52] FLAG: --test.outputdir=""
flags.go:52] FLAG: --test.parallel="8"
flags.go:52] FLAG: --test.run=""
flags.go:52] FLAG: --test.short="false"
flags.go:52] FLAG: --test.testlogfile=""
flags.go:52] FLAG: --test.timeout="0s"
flags.go:52] FLAG: --test.trace=""
flags.go:52] FLAG: --test.v="false"
flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
flags.go:52] FLAG: --v="4"
flags.go:52] FLAG: --vmodule=""
flags.go:52] FLAG: --write-status-configmap="true"
main.go:363] Cluster Autoscaler 1.16.0

I didn't see any other issues complaining of this problem. Given that this release has been out for seven days now, I assume someone else would have run into this same problem.

Reverting to our previous patched container image worked fine, but we'd like to move forward. Is there some new configuration that we need to adjust in order to restore the previous behavior of the cluster autoscaler?

The text was updated successfully, but these errors were encountered:

seh · 2019-09-17T00:30:38Z

For what it's worth, we're attempting to run this new CA version 1.16.0 on Kubernetes version 1.15.1. Would that version skew cause this change in behavior?

losipiuk · 2019-09-17T07:24:28Z

For what it's worth, we're attempting to run this new CA version 1.16.0 on Kubernetes version 1.15.1. Would that version skew cause this change in behavior?

This is unsupported and untested configuration. CA is simulating k8s scheduler behaviour and therefore minor version of CA binary and k8s should always match. With that said I am still a bit surprised of the fact that it does not work at all. I would rather expect bad handling of corner cases due to slight scheduler changes. All the E2E tests with pass (https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling). Those are run on GCE yet the part which lists the pods is cloudprovider agnostic - so the fact that you are running it on AWS should not matter.

seh · 2019-09-17T13:41:20Z

I'm still looking into this, mystified as I am. This morning I noticed something new: the cluster autoscaler container has failed its liveness probe 54 times overnight.

seh · 2019-09-17T14:26:39Z

Checking the metrics published the cluster autoscaler, I see one error emitted every ten seconds of type "cloudProviderError." These errors don't appear in the cluster autoscaler log file, though, even at verbosity level 6. I'll have to build a custom container image in order to see what the errors are about.

seh · 2019-09-17T14:59:44Z

The Autoscaler.RunOnce function fails periodically with the following error:

failed to build node infos for node groups: cannot find ASG for node i-0c0bb28525d36a503

Note, though, that there is indeed a Kubernetes node named "i-0c0bb28525d36a503" on an EC2 instance with the same ID. That happens to be one of our master machines, not intended to be under the cluster autoscaler's control. That machine sits within an ASG that is not tagged with "kubernetes.io/cluster-autoscaler/enabled." There's a second master machine from the same ASG that's also mentioned in similar errors.

seh · 2019-09-17T16:54:21Z

It looks like #2247 changed what happens when the cluster autoscaler goes looking for information about a Kubernetes node that's not part of a managed ASG.

@Jeffwan, have you tested that change with a cluster that has ASGs not meant to be managed by the cluster autoscaler? There's a unit test for it (TestNodeGroupForNodeWithOutNodeGroup), but that doesn't capture the effects higher up in the various call paths leading down to (*awsCloudProvider).NodeGroupForNode.

seh · 2019-09-17T17:11:04Z

Due to this early failure in (*StaticAutoscaler).RunOnce, we never make it down to the place where the autoscaler lists unschedulable pods, currently phrased as this statement:

unschedulablePods, err := unschedulablePodLister.List()

That explains why the log file never shows the autoscaler recognizing and doing anything about these pending pods.

seh · 2019-09-17T17:46:52Z

Reverting #2247 atop version 1.16.0 allows the cluster autoscaler to work, insofar as it takes note of pending pods and adjusts ASG sizes to host them. I haven't seen #2161 occur, but assuming that that's a real problem too, #2247 is not a solution to both of them.

Jeffwan · 2019-09-17T18:40:58Z

@seh The time I fix the issue I did test it. Let me check out latest change and see any problems there

Jeffwan · 2019-09-18T05:13:54Z

@seh I looked at the problem and I didn't cover your test case. The fix was for specific errors in 1.15 and since there's logic in master to handle node group empty case. I think it's safe to revert #2247.

But before that, we need to cherry-pick this PR to 1.15 and it's still a problem in 1.15.

Jeffwan · 2019-09-18T05:20:51Z

#2353

Jeffwan · 2019-09-18T05:21:34Z

Please help file a PR to revert changes #2247 and I can approve it.

Jeffwan · 2019-09-18T05:21:46Z

/platform aws

k8s-ci-robot · 2019-09-18T05:23:41Z

@Jeffwan: The label(s) sig/aws cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/sig aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan · 2019-09-18T05:24:32Z

/platform aws

losipiuk · 2019-09-18T08:28:19Z

@Jeffwan is it safe to cherry-pick #2247 to 1.15? Wouldn't it create the same issue as @seh is seeing here with 1.16?

seh · 2019-09-18T17:23:00Z

Please help file a PR to revert changes #2247 and I can approve it.

See #2357.

Jeffwan · 2019-09-18T23:33:18Z

@Jeffwan is it safe to cherry-pick #2247 to 1.15? Wouldn't it create the same issue as @seh is seeing here with 1.16?

Hi @losipiuk This PR was original filed to resolve the issue in 1.15-branch. We have not cherry-picked to 1.15 yet. @seh's case may only exist in 1.16 and later versions. I need to do more tests on this.

Since extra logic checking nodeGroup == nil was added in clusterstate.go here

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Lines 588 to 592 in bc73ded

    
           if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() { 
        
           	klog.Warningf("Nodegroup is nil for %s", unregistered.Node.Name) 
        
           	continue 
        
           } 
        
           perNgCopy := perNodeGroup[nodeGroup.Id()]

Let me double check 1.15 difference with 1.16. It's safer to add nodeGroup == nil logic in 1.15 version. Thanks for the reminder.

seh · 2019-09-20T13:06:51Z

Given the way that the cluster autoscaler releases track the Kubernetes version names so closely, can we expect this patch to appear in a container image release before Kubernetes version 1.16.1 is available?

losipiuk · 2019-09-20T13:25:24Z

Yes. I will prepare the release over weekend.

losipiuk · 2019-09-20T13:27:26Z

Just to double check. Can you confirm that version built from tip of cluster-autoscsler-release-1.16 works fine?

seh · 2019-09-20T13:31:44Z

What I had tested was an image built with my patch in it, which was the tip of the "cluster-autoscaler-release-1.16" plus that one reverting commit. I think that's a "yes" to your question.

losipiuk · 2019-09-23T08:58:40Z

I released 1.16.1 (gcr.io/google-containers/cluster-autoscaler:v1.16.1). @seh can you verify that the issue is fixed in that release?

seh · 2019-09-23T18:01:25Z

@seh can you verify that the issue is fixed in that release?

Testing this new cluster autoscaler image at version 1.16.1 with a Kubernetes cluster in AWS at version 1.15.1, I find that the autoscaler once again works as expected. Thank you for the prompt release.

losipiuk · 2019-09-23T19:45:34Z

Thanks for testing and reporting issue in the first place :)

Snehil03 · 2021-04-12T07:44:27Z

Hello @seh @losipiuk ,
I am again facing issue with liveness probe while using image k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.7
Thanks for checking it again ,

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  22m                   default-scheduler  Successfully assigned aws/cluster-autoscaler-aws-cluster-autoscaler-8486df49c9-bs9bm to ip-10-0-175-211.eu-west-1.compute.internal
  Warning  Unhealthy  21m                   kubelet            Liveness probe failed: Get http://10.0.154.48:8085/health-check: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Started    20m (x4 over 22m)     kubelet            Started container aws-cluster-autoscaler
  Normal   Pulled     19m (x5 over 22m)     kubelet            Container image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.7" already present on machine
  Normal   Created    19m (x5 over 22m)     kubelet            Created container aws-cluster-autoscaler
  Warning  BackOff    2m17s (x92 over 21m)  kubelet            Back-off restarting failed container

seh mentioned this issue Sep 17, 2019

Return error if aws fail to find nodegroup for node #2247

Merged

seh mentioned this issue Sep 17, 2019

nil pointer dereference while fetching node instances for aws asg #2161

Closed

seh mentioned this issue Sep 18, 2019

Don't treat lack of queried Kubernetes node in AWS cloud provider cache as error #2357

Merged

k8s-ci-robot closed this as completed in #2357 Sep 18, 2019

seh mentioned this issue Oct 2, 2019

Autoscaler does nothing during the main loop #2396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

seh commented Sep 16, 2019

seh commented Sep 17, 2019

losipiuk commented Sep 17, 2019

seh commented Sep 17, 2019 •

edited

Loading

seh commented Sep 17, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

Jeffwan commented Sep 17, 2019

Jeffwan commented Sep 18, 2019

Jeffwan commented Sep 18, 2019

Jeffwan commented Sep 18, 2019

Jeffwan commented Sep 18, 2019 •

edited

Loading

k8s-ci-robot commented Sep 18, 2019

Jeffwan commented Sep 18, 2019 •

edited

Loading

losipiuk commented Sep 18, 2019

seh commented Sep 18, 2019

Jeffwan commented Sep 18, 2019 •

edited

Loading

seh commented Sep 20, 2019

losipiuk commented Sep 20, 2019

losipiuk commented Sep 20, 2019

seh commented Sep 20, 2019

losipiuk commented Sep 23, 2019 •

edited

Loading

seh commented Sep 23, 2019

losipiuk commented Sep 23, 2019

Snehil03 commented Apr 12, 2021 •

edited

Loading

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

Comments

seh commented Sep 16, 2019

seh commented Sep 17, 2019

losipiuk commented Sep 17, 2019

seh commented Sep 17, 2019 • edited Loading

seh commented Sep 17, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

Jeffwan commented Sep 17, 2019

Jeffwan commented Sep 18, 2019

Jeffwan commented Sep 18, 2019

Jeffwan commented Sep 18, 2019

Jeffwan commented Sep 18, 2019 • edited Loading

k8s-ci-robot commented Sep 18, 2019

Jeffwan commented Sep 18, 2019 • edited Loading

losipiuk commented Sep 18, 2019

seh commented Sep 18, 2019

Jeffwan commented Sep 18, 2019 • edited Loading

seh commented Sep 20, 2019

losipiuk commented Sep 20, 2019

losipiuk commented Sep 20, 2019

seh commented Sep 20, 2019

losipiuk commented Sep 23, 2019 • edited Loading

seh commented Sep 23, 2019

losipiuk commented Sep 23, 2019

Snehil03 commented Apr 12, 2021 • edited Loading

seh commented Sep 17, 2019 •

edited

Loading

Jeffwan commented Sep 18, 2019 •

edited

Loading

Jeffwan commented Sep 18, 2019 •

edited

Loading

Jeffwan commented Sep 18, 2019 •

edited

Loading

losipiuk commented Sep 23, 2019 •

edited

Loading

Snehil03 commented Apr 12, 2021 •

edited

Loading