README: Add note about cluster-autoscaler not supporting multiple AZs #647

mgalgs · 2019-03-19T23:05:54Z

Description

As discussed on slack, cluster-autoscaler doesn't support ASGs which span multiple AZs. Made a few clarifying notes in the README to that effect.

Checklist

Added/modified documentation as required (such as the README.md, and examples directory)
Added yourself to the humans.txt file

whereisaaron · 2019-03-20T06:11:08Z

This is often said but not entirely true. We use cluster-autoscaler with multi-AZ ASG's all the time and it works perfectly. This is because we don't have any AZ-specific dependencies in our workloads, e.g. all of PVC volume types can be mounted in any AZ. Failure zone Pod anti-affinity could also be an issue, but we generally only have soft/preferred anti-affinity rules.

The mechanism/'issue' is just as explained. The cluster-autoscaler takes a random node and assesses whether another node like that would enable the 'Pending' Pod to be scheduled. If it would, then it asks AWS to make that ASG larger. Of course the ASG could add a node in any AZ (favoring balance). But if your workload doesn't care about the AZ, then there is just no problem with this mechanism. And cluster-autoscaler is perfect for multi-AZ ASG's.

Because overall our ASGs and workloads are very AZ-balanced, even our soft Pod anti-affinity is almost always satisfied.

If your workloads are all single-AZ PVCs and hard anti-affinity requirements (e.g. etcd or other quorum hosting), then the advice to have single AZ node pools is of course completely valid.

mumoshu · 2019-03-20T07:23:46Z

@mgalgs Hey! Thanks for your contribution.

Yep, I believe that @whereisaaron's explanation is valid, too. You may already have read it, but for more context, I'm sharing the original discussion regarding the gotcha of CA: kubernetes-retired/contrib#1552 (comment)

Maybe we'd better add a dedicated section in the README for this?

I'm not a good writer but I'd propose something like the below as a foundation:

Ensure that you have a separate nodegroup per availability zone when your workload is zone-aware!

cluster-autoscaler is unable to reliably add necessary nodes when you have a nodegroup that spans multiple AZs, by design.

To create separate nodegroup per AZ, just replicate your nodegroup config per AZ.

BEFORE:

nodeGroups:
  - name: ng1-public
    instanceType: m5.xlarge
    # availabilityZones: ["eu-west-2a", "eu-west-2b"]

AFTER:

nodeGroups:
  - name: ng1-public-2a
    instanceType: m5.xlarge
    availabilityZones: ["eu-west-2a"]
  - name: ng1-public-2b
    instanceType: m5.xlarge
    availabilityZones: ["eu-west-2b"]

errordeveloper · 2019-03-20T11:09:52Z

Yes, it sounds like it should be up to the user whether to use a single AZ or not, we should just make a note of how to do it, in case they think they must.

mgalgs · 2019-03-23T22:49:32Z

Thanks for the feedback! I totally agree that we should inform the user about the constraints and let them make a decision. I've revised my PR based on @mumoshu's draft.

whereisaaron · 2019-03-24T02:53:37Z

Cheers @mglags. Suggestions and explanation:

~~The AZRebalance scaling process is [suspended]~~

There is no need to do this. If your workload is not AZ-specific, then by definition is doesn't mind being re-balanced. This setting would be a work-around if have (unbalanced) AZ-specific requests that drive unbalanced ASG's and you don't want a re-balance undoing that. But that case you should be using per-AZ ASGs anyway, as your other criteria recommend.

No required podAffinity with topology other than host
No required nodeAffinity on zone label
No nodeSelector on a zone label

'Soft' affinity requirements that use preferredDuringSchedulingIgnoredDuringExecution do not prevent scheduling even if not satisfied, so again not a problem in multi-AZ ASG. It is a problem to use 'hard' affinity requirements that use requiredDuringSchedulingIgnoredDuringExecution. A nodeSelector is also form of 'hard' affinity.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

~~Never scale any zone to 0~~

You can certainly zero to and from zero node with a multi-AZ ASG - on AWS at least. This is because you can add the labels needed as node/affinity selectors as AWS tags on the ASG. The cluster-autoscaler will use those tags to determine if making that ASG larger would enable the pending pod to be scheduled (in place of selecting a random ASG instance, since there are none). Thus, so long as your node selector / affinity is not requesting a particular failure domain (AZ), you are still sweet. I've done and tested this with multi-AZ ASGs and the cluster-autoscaler.

mgalgs · 2019-03-24T16:44:20Z

@whereisaaron thanks for the feedback, I've incorporated your suggestions.

You can certainly zero to and from zero node with a multi-AZ ASG - on AWS at least.

Got it. I was pretty much blindly transcribing the comments I got from a CA contributor (here), but that makes sense.

whereisaaron · 2019-03-24T17:14:53Z

Yep, I read it @mgalgs. I think that is sensible advice for how you might possibly be able to get it working if you do have AZ-specific workloads/resources. But I'd say don't even try that, just have per-AZ ASGs in that situation.

errordeveloper

LGTM!

errordeveloper · 2019-03-25T08:29:07Z

Thanks a lot @mgalgs for contributing this, and thanks @whereisaaron and @mumoshu for the review!

Feature: Add ability to customize node daemonset nodeselector

mgalgs force-pushed the readme-autoscaler-single-az branch from 617ce43 to e9c5a9e Compare March 23, 2019 22:48

mgalgs force-pushed the readme-autoscaler-single-az branch from e9c5a9e to 2e3e607 Compare March 23, 2019 22:51

mgalgs added 2 commits March 24, 2019 09:42

README: Add note about cluster-autoscaler not supporting multiple AZs

4ed3803

humans.txt: Add @mgalgs

df9a1d0

mgalgs force-pushed the readme-autoscaler-single-az branch from 2e3e607 to df9a1d0 Compare March 24, 2019 16:43

errordeveloper approved these changes Mar 25, 2019

View reviewed changes

errordeveloper merged commit 6e0136a into eksctl-io:master Mar 25, 2019

mgalgs deleted the readme-autoscaler-single-az branch March 25, 2019 14:21

mascah mentioned this pull request Apr 25, 2019

Extend current worker_group ASG creation behavior (1 AZ per ASG) terraform-aws-modules/terraform-aws-eks#346

Closed

4 tasks

torredil pushed a commit to torredil/eksctl that referenced this pull request May 20, 2022

Merge pull request eksctl-io#647 from pliu/node-nodeselector

9873e22

Feature: Add ability to customize node daemonset nodeselector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README: Add note about cluster-autoscaler not supporting multiple AZs #647

README: Add note about cluster-autoscaler not supporting multiple AZs #647

mgalgs commented Mar 19, 2019

whereisaaron commented Mar 20, 2019 •

edited

Loading

mumoshu commented Mar 20, 2019 •

edited

Loading

errordeveloper commented Mar 20, 2019

mgalgs commented Mar 23, 2019

whereisaaron commented Mar 24, 2019

mgalgs commented Mar 24, 2019

whereisaaron commented Mar 24, 2019

errordeveloper left a comment

errordeveloper commented Mar 25, 2019

README: Add note about cluster-autoscaler not supporting multiple AZs #647

README: Add note about cluster-autoscaler not supporting multiple AZs #647

Conversation

mgalgs commented Mar 19, 2019

Description

Checklist

whereisaaron commented Mar 20, 2019 • edited Loading

mumoshu commented Mar 20, 2019 • edited Loading

errordeveloper commented Mar 20, 2019

mgalgs commented Mar 23, 2019

whereisaaron commented Mar 24, 2019

mgalgs commented Mar 24, 2019

whereisaaron commented Mar 24, 2019

errordeveloper left a comment

Choose a reason for hiding this comment

errordeveloper commented Mar 25, 2019

whereisaaron commented Mar 20, 2019 •

edited

Loading

mumoshu commented Mar 20, 2019 •

edited

Loading