Skip to content
This repository has been archived by the owner on Apr 17, 2019. It is now read-only.

AWS Cluster Autoscaler README #1552

Merged
merged 1 commit into from
Aug 25, 2016

Conversation

andrewsykim
Copy link

@andrewsykim andrewsykim commented Aug 17, 2016

under #1311


This change is Reviewable

@andrewsykim
Copy link
Author

@mwielgus @pbitty @osxi @iterion r?


## Deployment Specification
Your deployment configuration should look something like this:
```
Copy link

@iterion iterion Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can hint to Github that this should be formatted as yaml with: "```yaml". Same with the json above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool didn't know that

```
Note:
- the `/etc/ssl/certs/ca-certificates.crt` should exist by default on your ec2 instance.
- at the time of writing this, cluster autoscaler is unaware of availability zones, the availability zone of the instance should be configured by the autoscaling group. Although autoscaling groups can contain instances in multiple availability zones, the autoscaling group should span 1 availability zone for the cluster autoscaler to work.
Copy link
Contributor

@mumoshu mumoshu Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, really excited to see this PR!!

Excuse me if I'm misreading but I couldn't figure out why the autoscaling group should span 1 availability zone for the cluster autoscaler to work.

As noted, an cluster autoscaler is unaware of AZs, but an ASG is aware of them. If configured, spreading EC2 instances over multiple AZs will be done completely out of band in AWS AutoScaling. So I believe you can just assign 2 or more AZs to an ASG to eventually make cluster autoscaler multi-AZ aware as far as the AWS impl for Cluster Autoscaler delegates adding/removing instances to AWS AutoScaling.

And it seems so(according to https://github.com/kubernetes/contrib/pull/1377/files#diff-ade7b95627ea0dd6b6f4deee7f24fa7eR124 it is calling SetDesiredCapacity to delegate adding/removing instances)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mumoshu, I can't really speak to multi-AZ setups since our use case was a single-AZ ASG but that sounds right. You're correct regarding SetDesiredCapacity for instance creation, but this is where the instances are deleted.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu, it will likely work with a multi-AZ ASG, but there may be some caveats if Kubernetes is configured to be zone-aware. Pardon the long-winded explanation. Here is the reasoning:

The cluster-autoscaler asks the AWS CloudProvider for a sample Node from the NodeGroup (backed by an ASG), and uses it to make scaling decisions. It assumes that the sample Node is equivalent to all other nodes in the ASG - eg. same instance type, storage, etc. When it needs to scale up, for example, it will know with certainty that new Nodes will have the same capacity and will be able to accommodate the pending pods.

The cluster-autoscaler has logic that simulates the Scheduler's decisions to see if a new Node will be able to accommodate pending workloads. If the Scheduler is zone-aware, it may specifically want to distribute workloads across AZs.

Consider this scenario:

  • The Scheduler wants to run some Pods in Zone A because there are already equivalent pods in Zone B and it wants a multi-AZ Pod distribution.
  • There are not enough Nodes in Zone A
  • In this situation the Pods will be pending, waiting for a Node in Zone A
  • The cluster autoscaler sees pending pods and starts looking for a NodeGroup (an ASG) that can accommodate these Pods
  • Say we have a NodeGroup whose ASG spans Zones A and B
  • This NodeGroup returns a random sampleNode that is in Zone A
  • The cluster-autoscaler says "Great, this NodeGroup has an appropriate node, let me scale it up", and it increases the DesiredCapacity by 1
  • The ASG launches a Node in Zone B (because it is trying to keep the Zones balanced)
  • The new Node appears in Kubernetes and the Scheduler sees that is in Zone B
  • The Scheduler continues to wait a Node in Zone A in order to schedule the pending Pods
  • The new Node did not help accommodate the pending Pods

Eventually the cluster-autoscaler will try to launch another Node because there are still Pods pending. By chance if a new Node lands in Zone A, the Pods will get scheduled and things will work, but this would be non-deterministic and would make the process less reliable

The crux of the matter is the contract between NodeGroup that provides a sampleNode, and the scaling logic, which expects that a sampleNode is equivalent to all other nodes in that NodeGroup.

These are theoretical, I haven't tested these scenarios, but from my understanding of the Node sampling logic, something like this would happen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, maybe we should say in the README that if one wants the Scheduler to be Zone-aware and distribute workloads evenly across Zones, then the 1-zone-per-ASG rule must be followed. Otherwise, it doesn't matter.

Thoughts?

Copy link
Author

@andrewsykim andrewsykim Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • At the time of writing this, cluster autoscaler is unaware of availability zones, the availability zone of the instance should be configured by the autoscaling group. Although autoscaling groups can contain instances in multiple availability zones, the cluster autoscaler will not evenly distribute pods across zones. Use mutli-AZ ASG at your own risk.

??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osxi @pbitty @andrewsykim Your explanation really helped me understand what's under the hood. Thanks!

So, to be clear,

  • As long as we keep single AZ per ASG, cluster-autoscaling works reliably. That's because cluster autoscaler isn't AZ aware. Multiple AZs per ASG can result in a node added to an unneeded AZ because cluster scaler doesn't know/is unable to control which AZ to add a node.
  • If you are O.K. with undeterministic behavior @pbitty described, you can assign multiple AZs to an ASG but it isn't necessary because as @pbitty pointed out in AWS Cluster Autoscaler README #1552 (comment), keeping 1-zone-per-ASG fixes the reliability gotcha without any unwanted side-efffect

?

Then, IMHO, something like the below makes sense to me:

The autoscaling group should span exactly 1 availability zone for the cluster autoscaler to work. If you want to distribute workloads evenly across zones, set up multiple ASGs, each in a distinct availability zone.

At the time of writing this, cluster autoscaler is unaware of availability zones. Although autoscaling groups can contain instances in multiple availability zones when configured so, cluster autoscaler can't reliably add nodes to desired zones then. That's because AWS AutoScaling determines which zone to add node completely out-of-band from cluster autoscaler. For more information, see #1552 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, I'll make the changes

@mwielgus
Copy link
Contributor

Is the current state of the doc acceptable?

]
}
```
Unfortunately AWS does not support ARNs for autoscaling groups yet so you must use "*" as the resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osxi
Copy link
Contributor

osxi commented Aug 23, 2016

LGTM after adding a link to the autoscaling permissions docs!

@andrewsykim
Copy link
Author

Comments addressed. @mwielgus I was wondering if there's an official gcr image for the CA that supports AWS as a cloud provider we can put here?

@mwielgus
Copy link
Contributor

There is one: gcr.io/google_containers/cluster-autoscaler:v0.3.0-beta2 but it is a beta image. I will will update the doc once we get the final release.

@mwielgus mwielgus added the lgtm Indicates that a PR is ready to be merged. label Aug 25, 2016
@mwielgus
Copy link
Contributor

lgtm

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 0fca8c7 into kubernetes-retired:master Aug 25, 2016
@andrewsykim andrewsykim deleted the aws_docs branch September 13, 2016 20:42
mwielgus pushed a commit to kubernetes/autoscaler that referenced this pull request Apr 18, 2017
Automatic merge from submit-queue

AWS Cluster Autoscaler README

under kubernetes-retired/contrib#1311
mwielgus pushed a commit to kubernetes/autoscaler that referenced this pull request Apr 18, 2017
Automatic merge from submit-queue

AWS Cluster Autoscaler README

under kubernetes-retired/contrib#1311
johanneswuerbach added a commit to johanneswuerbach/autoscaler that referenced this pull request Jun 19, 2017
kubernetes-retired/contrib#1552 (comment) seems to explain the reasoning behind multiple ASGs much better then the previous link target.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/autoscaler lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants