Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

Closed
seh opened this issue Sep 16, 2019 · 26 comments · Fixed by #2357
Closed

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

seh opened this issue Sep 16, 2019 · 26 comments · Fixed by #2357

Comments

@seh
Copy link

seh commented Sep 16, 2019

We've been using cluster autoscaler version 1.15.0 patched with #2008 for some time in AWS, to good effect. Today we attempted to put the new version 1.16.0 into service using all the same configuration, and found that it no longer seems to notice pending pods.

The cluster autoscaler starts fine, and the logs don't indicate anything failing. It goes through its periodic "main loop" and the "Regenerating instance to ASG map for ASGs" step regularly, again without any obvious problems. However, when we create pods that require that the cluster autoscaler take note and adjust a suitable ASG's size, the cluster autoscaler's logs don't show any evidence of it noticing these pods. In prior versions, we see messages to the following effect:

  • Pod ns/name is unschedulable
  • Pod name be scheduled on node, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient nvidia.com/gpu
  • Event(v1.ObjectReference{Kind:"Pod", Namespace:"ns", Name:"name", UID:"ae32b9cc-081e-4323-ba33-7810457a0ddf", APIVersion:"v1", ResourceVersion:"58735432", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{asg 0->13 (max: 125)}]

Instead, the new cluster autoscaler exhibits no reaction to these pending pods.

Here are the flag values reported at start time:

flags.go:52] FLAG: --add-dir-header="false"
flags.go:52] FLAG: --address=":8085"
flags.go:52] FLAG: --alsologtostderr="false"
flags.go:52] FLAG: --balance-similar-node-groups="false"
flags.go:52] FLAG: --cloud-config=""
flags.go:52] FLAG: --cloud-provider="aws"
flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
flags.go:52] FLAG: --cluster-name=""
flags.go:52] FLAG: --cores-total="0:320000"
flags.go:52] FLAG: --estimator="binpacking"
flags.go:52] FLAG: --expander="least-waste"
flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
flags.go:52] FLAG: --gpu-total="[]"
flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
flags.go:52] FLAG: --ignore-taint="[]"
flags.go:52] FLAG: --kubeconfig=""
flags.go:52] FLAG: --kubernetes=""
flags.go:52] FLAG: --leader-elect="true"
flags.go:52] FLAG: --leader-elect-lease-duration="15s"
flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
flags.go:52] FLAG: --leader-elect-resource-name=""
flags.go:52] FLAG: --leader-elect-resource-namespace=""
flags.go:52] FLAG: --leader-elect-retry-period="2s"
flags.go:52] FLAG: --log-backtrace-at=":0"
flags.go:52] FLAG: --log-dir=""
flags.go:52] FLAG: --log-file=""
flags.go:52] FLAG: --log-file-max-size="1800"
flags.go:52] FLAG: --logtostderr="true"
flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
flags.go:52] FLAG: --max-empty-bulk-delete="10"
flags.go:52] FLAG: --max-failing-time="15m0s"
flags.go:52] FLAG: --max-graceful-termination-sec="600"
flags.go:52] FLAG: --max-inactivity="10m0s"
flags.go:52] FLAG: --max-node-provision-time="3m0s"
flags.go:52] FLAG: --max-nodes-total="0"
flags.go:52] FLAG: --max-total-unready-percentage="45"
flags.go:52] FLAG: --memory-total="0:6400000"
flags.go:52] FLAG: --min-replica-count="0"
flags.go:52] FLAG: --namespace="our-system"
flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=kubernetes.io/cluster-autoscaler/enabled,kubernetes.io/cluster/redacted]"
flags.go:52] FLAG: --nodes="[]"
flags.go:52] FLAG: --ok-total-unready-count="3"
flags.go:52] FLAG: --regional="false"
flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
flags.go:52] FLAG: --scale-down-delay-after-add="3m0s"
flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
flags.go:52] FLAG: --scale-down-enabled="true"
flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
flags.go:52] FLAG: --scale-down-non-empty-candidates-count="50"
flags.go:52] FLAG: --scale-down-unneeded-time="13m0s"
flags.go:52] FLAG: --scale-down-unready-time="7m0s"
flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
flags.go:52] FLAG: --scale-up-from-zero="true"
flags.go:52] FLAG: --scan-interval="10s"
flags.go:52] FLAG: --skip-headers="false"
flags.go:52] FLAG: --skip-log-headers="false"
flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
flags.go:52] FLAG: --stderrthreshold="0"
flags.go:52] FLAG: --test.bench=""
flags.go:52] FLAG: --test.benchmem="false"
flags.go:52] FLAG: --test.benchtime="1s"
flags.go:52] FLAG: --test.blockprofile=""
flags.go:52] FLAG: --test.blockprofilerate="1"
flags.go:52] FLAG: --test.count="1"
flags.go:52] FLAG: --test.coverprofile=""
flags.go:52] FLAG: --test.cpu=""
flags.go:52] FLAG: --test.cpuprofile=""
flags.go:52] FLAG: --test.failfast="false"
flags.go:52] FLAG: --test.list=""
flags.go:52] FLAG: --test.memprofile=""
flags.go:52] FLAG: --test.memprofilerate="0"
flags.go:52] FLAG: --test.mutexprofile=""
flags.go:52] FLAG: --test.mutexprofilefraction="1"
flags.go:52] FLAG: --test.outputdir=""
flags.go:52] FLAG: --test.parallel="8"
flags.go:52] FLAG: --test.run=""
flags.go:52] FLAG: --test.short="false"
flags.go:52] FLAG: --test.testlogfile=""
flags.go:52] FLAG: --test.timeout="0s"
flags.go:52] FLAG: --test.trace=""
flags.go:52] FLAG: --test.v="false"
flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
flags.go:52] FLAG: --v="4"
flags.go:52] FLAG: --vmodule=""
flags.go:52] FLAG: --write-status-configmap="true"
main.go:363] Cluster Autoscaler 1.16.0

I didn't see any other issues complaining of this problem. Given that this release has been out for seven days now, I assume someone else would have run into this same problem.

Reverting to our previous patched container image worked fine, but we'd like to move forward. Is there some new configuration that we need to adjust in order to restore the previous behavior of the cluster autoscaler?

@seh
Copy link
Author

seh commented Sep 17, 2019

For what it's worth, we're attempting to run this new CA version 1.16.0 on Kubernetes version 1.15.1. Would that version skew cause this change in behavior?

@losipiuk
Copy link
Contributor

For what it's worth, we're attempting to run this new CA version 1.16.0 on Kubernetes version 1.15.1. Would that version skew cause this change in behavior?

This is unsupported and untested configuration. CA is simulating k8s scheduler behaviour and therefore minor version of CA binary and k8s should always match. With that said I am still a bit surprised of the fact that it does not work at all. I would rather expect bad handling of corner cases due to slight scheduler changes. All the E2E tests with pass (https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling). Those are run on GCE yet the part which lists the pods is cloudprovider agnostic - so the fact that you are running it on AWS should not matter.

@seh
Copy link
Author

seh commented Sep 17, 2019

I'm still looking into this, mystified as I am. This morning I noticed something new: the cluster autoscaler container has failed its liveness probe 54 times overnight.

@seh
Copy link
Author

seh commented Sep 17, 2019

Checking the metrics published the cluster autoscaler, I see one error emitted every ten seconds of type "cloudProviderError." These errors don't appear in the cluster autoscaler log file, though, even at verbosity level 6. I'll have to build a custom container image in order to see what the errors are about.

@seh
Copy link
Author

seh commented Sep 17, 2019

The Autoscaler.RunOnce function fails periodically with the following error:

failed to build node infos for node groups: cannot find ASG for node i-0c0bb28525d36a503

Note, though, that there is indeed a Kubernetes node named "i-0c0bb28525d36a503" on an EC2 instance with the same ID. That happens to be one of our master machines, not intended to be under the cluster autoscaler's control. That machine sits within an ASG that is not tagged with "kubernetes.io/cluster-autoscaler/enabled." There's a second master machine from the same ASG that's also mentioned in similar errors.

@seh
Copy link
Author

seh commented Sep 17, 2019

It looks like #2247 changed what happens when the cluster autoscaler goes looking for information about a Kubernetes node that's not part of a managed ASG.

@Jeffwan, have you tested that change with a cluster that has ASGs not meant to be managed by the cluster autoscaler? There's a unit test for it (TestNodeGroupForNodeWithOutNodeGroup), but that doesn't capture the effects higher up in the various call paths leading down to (*awsCloudProvider).NodeGroupForNode.

@seh
Copy link
Author

seh commented Sep 17, 2019

Due to this early failure in (*StaticAutoscaler).RunOnce, we never make it down to the place where the autoscaler lists unschedulable pods, currently phrased as this statement:

unschedulablePods, err := unschedulablePodLister.List()

That explains why the log file never shows the autoscaler recognizing and doing anything about these pending pods.

@seh
Copy link
Author

seh commented Sep 17, 2019

Reverting #2247 atop version 1.16.0 allows the cluster autoscaler to work, insofar as it takes note of pending pods and adjusts ASG sizes to host them. I haven't seen #2161 occur, but assuming that that's a real problem too, #2247 is not a solution to both of them.

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 17, 2019

@seh The time I fix the issue I did test it. Let me check out latest change and see any problems there

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

@seh I looked at the problem and I didn't cover your test case. The fix was for specific errors in 1.15 and since there's logic in master to handle node group empty case. I think it's safe to revert #2247.

But before that, we need to cherry-pick this PR to 1.15 and it's still a problem in 1.15.

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

#2353

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

Please help file a PR to revert changes #2247 and I can approve it.

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

/platform aws

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: The label(s) sig/aws cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/sig aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

/platform aws

@losipiuk
Copy link
Contributor

@Jeffwan is it safe to cherry-pick #2247 to 1.15? Wouldn't it create the same issue as @seh is seeing here with 1.16?

@seh
Copy link
Author

seh commented Sep 18, 2019

Please help file a PR to revert changes #2247 and I can approve it.

See #2357.

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

@Jeffwan is it safe to cherry-pick #2247 to 1.15? Wouldn't it create the same issue as @seh is seeing here with 1.16?

Hi @losipiuk This PR was original filed to resolve the issue in 1.15-branch. We have not cherry-picked to 1.15 yet. @seh's case may only exist in 1.16 and later versions. I need to do more tests on this.

Since extra logic checking nodeGroup == nil was added in clusterstate.go here

if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
klog.Warningf("Nodegroup is nil for %s", unregistered.Node.Name)
continue
}
perNgCopy := perNodeGroup[nodeGroup.Id()]

Let me double check 1.15 difference with 1.16. It's safer to add nodeGroup == nil logic in 1.15 version. Thanks for the reminder.

@seh
Copy link
Author

seh commented Sep 20, 2019

Given the way that the cluster autoscaler releases track the Kubernetes version names so closely, can we expect this patch to appear in a container image release before Kubernetes version 1.16.1 is available?

@losipiuk
Copy link
Contributor

Yes. I will prepare the release over weekend.

@losipiuk
Copy link
Contributor

Just to double check. Can you confirm that version built from tip of cluster-autoscsler-release-1.16 works fine?

@seh
Copy link
Author

seh commented Sep 20, 2019

What I had tested was an image built with my patch in it, which was the tip of the "cluster-autoscaler-release-1.16" plus that one reverting commit. I think that's a "yes" to your question.

@losipiuk
Copy link
Contributor

losipiuk commented Sep 23, 2019

I released 1.16.1 (gcr.io/google-containers/cluster-autoscaler:v1.16.1). @seh can you verify that the issue is fixed in that release?

@seh
Copy link
Author

seh commented Sep 23, 2019

@seh can you verify that the issue is fixed in that release?

Testing this new cluster autoscaler image at version 1.16.1 with a Kubernetes cluster in AWS at version 1.15.1, I find that the autoscaler once again works as expected. Thank you for the prompt release.

@losipiuk
Copy link
Contributor

Thanks for testing and reporting issue in the first place :)

@Snehil03
Copy link

Snehil03 commented Apr 12, 2021

Hello @seh @losipiuk ,
I am again facing issue with liveness probe while using image k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.7
Thanks for checking it again ,

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  22m                   default-scheduler  Successfully assigned aws/cluster-autoscaler-aws-cluster-autoscaler-8486df49c9-bs9bm to ip-10-0-175-211.eu-west-1.compute.internal
  Warning  Unhealthy  21m                   kubelet            Liveness probe failed: Get http://10.0.154.48:8085/health-check: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Started    20m (x4 over 22m)     kubelet            Started container aws-cluster-autoscaler
  Normal   Pulled     19m (x5 over 22m)     kubelet            Container image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.7" already present on machine
  Normal   Created    19m (x5 over 22m)     kubelet            Created container aws-cluster-autoscaler
  Warning  BackOff    2m17s (x92 over 21m)  kubelet            Back-off restarting failed container 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants