Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README: Add note about cluster-autoscaler not supporting multiple AZs #647

Merged
merged 2 commits into from
Mar 25, 2019

Conversation

mgalgs
Copy link
Contributor

@mgalgs mgalgs commented Mar 19, 2019

Description

As discussed on slack, cluster-autoscaler doesn't support ASGs which span multiple AZs. Made a few clarifying notes in the README to that effect.

Checklist

  • Added/modified documentation as required (such as the README.md, and examples directory)
  • Added yourself to the humans.txt file

@whereisaaron
Copy link

whereisaaron commented Mar 20, 2019

This is often said but not entirely true. We use cluster-autoscaler with multi-AZ ASG's all the time and it works perfectly. This is because we don't have any AZ-specific dependencies in our workloads, e.g. all of PVC volume types can be mounted in any AZ. Failure zone Pod anti-affinity could also be an issue, but we generally only have soft/preferred anti-affinity rules.

The mechanism/'issue' is just as explained. The cluster-autoscaler takes a random node and assesses whether another node like that would enable the 'Pending' Pod to be scheduled. If it would, then it asks AWS to make that ASG larger. Of course the ASG could add a node in any AZ (favoring balance). But if your workload doesn't care about the AZ, then there is just no problem with this mechanism. And cluster-autoscaler is perfect for multi-AZ ASG's.

Because overall our ASGs and workloads are very AZ-balanced, even our soft Pod anti-affinity is almost always satisfied.

If your workloads are all single-AZ PVCs and hard anti-affinity requirements (e.g. etcd or other quorum hosting), then the advice to have single AZ node pools is of course completely valid.

@mumoshu
Copy link
Contributor

mumoshu commented Mar 20, 2019

@mgalgs Hey! Thanks for your contribution.

Yep, I believe that @whereisaaron's explanation is valid, too. You may already have read it, but for more context, I'm sharing the original discussion regarding the gotcha of CA: kubernetes-retired/contrib#1552 (comment)

Maybe we'd better add a dedicated section in the README for this?

I'm not a good writer but I'd propose something like the below as a foundation:


Ensure that you have a separate nodegroup per availability zone when your workload is zone-aware!

cluster-autoscaler is unable to reliably add necessary nodes when you have a nodegroup that spans multiple AZs, by design.

To create separate nodegroup per AZ, just replicate your nodegroup config per AZ.

BEFORE:

nodeGroups:
  - name: ng1-public
    instanceType: m5.xlarge
    # availabilityZones: ["eu-west-2a", "eu-west-2b"]

AFTER:

nodeGroups:
  - name: ng1-public-2a
    instanceType: m5.xlarge
    availabilityZones: ["eu-west-2a"]
  - name: ng1-public-2b
    instanceType: m5.xlarge
    availabilityZones: ["eu-west-2b"]

@errordeveloper
Copy link
Contributor

Yes, it sounds like it should be up to the user whether to use a single AZ or not, we should just make a note of how to do it, in case they think they must.

@mgalgs mgalgs force-pushed the readme-autoscaler-single-az branch from 617ce43 to e9c5a9e Compare March 23, 2019 22:48
@mgalgs
Copy link
Contributor Author

mgalgs commented Mar 23, 2019

Thanks for the feedback! I totally agree that we should inform the user about the constraints and let them make a decision. I've revised my PR based on @mumoshu's draft.

@mgalgs mgalgs force-pushed the readme-autoscaler-single-az branch from e9c5a9e to 2e3e607 Compare March 23, 2019 22:51
@whereisaaron
Copy link

Cheers @mglags. Suggestions and explanation:

  • The AZRebalance scaling process is [suspended]

There is no need to do this. If your workload is not AZ-specific, then by definition is doesn't mind being re-balanced. This setting would be a work-around if have (unbalanced) AZ-specific requests that drive unbalanced ASG's and you don't want a re-balance undoing that. But that case you should be using per-AZ ASGs anyway, as your other criteria recommend.

  • No required podAffinity with topology other than host
  • No required nodeAffinity on zone label
  • No nodeSelector on a zone label

'Soft' affinity requirements that use preferredDuringSchedulingIgnoredDuringExecution do not prevent scheduling even if not satisfied, so again not a problem in multi-AZ ASG. It is a problem to use 'hard' affinity requirements that use requiredDuringSchedulingIgnoredDuringExecution. A nodeSelector is also form of 'hard' affinity.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

  • Never scale any zone to 0

You can certainly zero to and from zero node with a multi-AZ ASG - on AWS at least. This is because you can add the labels needed as node/affinity selectors as AWS tags on the ASG. The cluster-autoscaler will use those tags to determine if making that ASG larger would enable the pending pod to be scheduled (in place of selecting a random ASG instance, since there are none). Thus, so long as your node selector / affinity is not requesting a particular failure domain (AZ), you are still sweet. I've done and tested this with multi-AZ ASGs and the cluster-autoscaler.

@mgalgs mgalgs force-pushed the readme-autoscaler-single-az branch from 2e3e607 to df9a1d0 Compare March 24, 2019 16:43
@mgalgs
Copy link
Contributor Author

mgalgs commented Mar 24, 2019

@whereisaaron thanks for the feedback, I've incorporated your suggestions.

You can certainly zero to and from zero node with a multi-AZ ASG - on AWS at least.

Got it. I was pretty much blindly transcribing the comments I got from a CA contributor (here), but that makes sense.

@whereisaaron
Copy link

Yep, I read it @mgalgs. I think that is sensible advice for how you might possibly be able to get it working if you do have AZ-specific workloads/resources. But I'd say don't even try that, just have per-AZ ASGs in that situation.

Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@errordeveloper
Copy link
Contributor

Thanks a lot @mgalgs for contributing this, and thanks @whereisaaron and @mumoshu for the review!

@errordeveloper errordeveloper merged commit 6e0136a into eksctl-io:master Mar 25, 2019
@mgalgs mgalgs deleted the readme-autoscaler-single-az branch March 25, 2019 14:21
torredil pushed a commit to torredil/eksctl that referenced this pull request May 20, 2022
Feature: Add ability to customize node daemonset nodeselector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants