-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Conversation
|
||
## Deployment Specification | ||
Your deployment configuration should look something like this: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can hint to Github that this should be formatted as yaml with: "```yaml". Same with the json above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool didn't know that
20a561b
to
29f639e
Compare
``` | ||
Note: | ||
- the `/etc/ssl/certs/ca-certificates.crt` should exist by default on your ec2 instance. | ||
- at the time of writing this, cluster autoscaler is unaware of availability zones, the availability zone of the instance should be configured by the autoscaling group. Although autoscaling groups can contain instances in multiple availability zones, the autoscaling group should span 1 availability zone for the cluster autoscaler to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, really excited to see this PR!!
Excuse me if I'm misreading but I couldn't figure out why the autoscaling group should span 1 availability zone for the cluster autoscaler to work
.
As noted, an cluster autoscaler is unaware of AZs, but an ASG is aware of them. If configured, spreading EC2 instances over multiple AZs will be done completely out of band in AWS AutoScaling. So I believe you can just assign 2 or more AZs to an ASG to eventually make cluster autoscaler multi-AZ aware as far as the AWS impl for Cluster Autoscaler delegates adding/removing instances to AWS AutoScaling.
And it seems so(according to https://github.com/kubernetes/contrib/pull/1377/files#diff-ade7b95627ea0dd6b6f4deee7f24fa7eR124 it is calling SetDesiredCapacity
to delegate adding/removing instances)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu, it will likely work with a multi-AZ ASG, but there may be some caveats if Kubernetes is configured to be zone-aware. Pardon the long-winded explanation. Here is the reasoning:
The cluster-autoscaler asks the AWS CloudProvider for a sample Node from the NodeGroup (backed by an ASG), and uses it to make scaling decisions. It assumes that the sample Node is equivalent to all other nodes in the ASG - eg. same instance type, storage, etc. When it needs to scale up, for example, it will know with certainty that new Nodes will have the same capacity and will be able to accommodate the pending pods.
The cluster-autoscaler has logic that simulates the Scheduler's decisions to see if a new Node will be able to accommodate pending workloads. If the Scheduler is zone-aware, it may specifically want to distribute workloads across AZs.
Consider this scenario:
- The Scheduler wants to run some Pods in Zone A because there are already equivalent pods in Zone B and it wants a multi-AZ Pod distribution.
- There are not enough Nodes in Zone A
- In this situation the Pods will be pending, waiting for a Node in Zone A
- The cluster autoscaler sees pending pods and starts looking for a NodeGroup (an ASG) that can accommodate these Pods
- Say we have a NodeGroup whose ASG spans Zones A and B
- This NodeGroup returns a random sampleNode that is in Zone A
- The cluster-autoscaler says "Great, this NodeGroup has an appropriate node, let me scale it up", and it increases the DesiredCapacity by 1
- The ASG launches a Node in Zone B (because it is trying to keep the Zones balanced)
- The new Node appears in Kubernetes and the Scheduler sees that is in Zone B
- The Scheduler continues to wait a Node in Zone A in order to schedule the pending Pods
- The new Node did not help accommodate the pending Pods
Eventually the cluster-autoscaler will try to launch another Node because there are still Pods pending. By chance if a new Node lands in Zone A, the Pods will get scheduled and things will work, but this would be non-deterministic and would make the process less reliable
The crux of the matter is the contract between NodeGroup that provides a sampleNode, and the scaling logic, which expects that a sampleNode is equivalent to all other nodes in that NodeGroup.
These are theoretical, I haven't tested these scenarios, but from my understanding of the Node sampling logic, something like this would happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note, maybe we should say in the README that if one wants the Scheduler to be Zone-aware and distribute workloads evenly across Zones, then the 1-zone-per-ASG rule must be followed. Otherwise, it doesn't matter.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- At the time of writing this, cluster autoscaler is unaware of availability zones, the availability zone of the instance should be configured by the autoscaling group. Although autoscaling groups can contain instances in multiple availability zones, the cluster autoscaler will not evenly distribute pods across zones. Use mutli-AZ ASG at your own risk.
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osxi @pbitty @andrewsykim Your explanation really helped me understand what's under the hood. Thanks!
So, to be clear,
- As long as we keep single AZ per ASG, cluster-autoscaling works reliably. That's because cluster autoscaler isn't AZ aware. Multiple AZs per ASG can result in a node added to an unneeded AZ because cluster scaler doesn't know/is unable to control which AZ to add a node.
- If you are O.K. with undeterministic behavior @pbitty described, you can assign multiple AZs to an ASG but it isn't necessary because as @pbitty pointed out in AWS Cluster Autoscaler README #1552 (comment), keeping 1-zone-per-ASG fixes the reliability gotcha without any unwanted side-efffect
?
Then, IMHO, something like the below makes sense to me:
The autoscaling group should span exactly 1 availability zone for the cluster autoscaler to work. If you want to distribute workloads evenly across zones, set up multiple ASGs, each in a distinct availability zone.
At the time of writing this, cluster autoscaler is unaware of availability zones. Although autoscaling groups can contain instances in multiple availability zones when configured so, cluster autoscaler can't reliably add nodes to desired zones then. That's because AWS AutoScaling determines which zone to add node completely out-of-band from cluster autoscaler. For more information, see #1552 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, I'll make the changes
Is the current state of the doc acceptable? |
] | ||
} | ||
``` | ||
Unfortunately AWS does not support ARNs for autoscaling groups yet so you must use "*" as the resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably add this link: http://docs.aws.amazon.com/autoscaling/latest/userguide/IAM.html#UsingWithAutoScaling_Actions
LGTM after adding a link to the autoscaling permissions docs! |
There is one: gcr.io/google_containers/cluster-autoscaler:v0.3.0-beta2 but it is a beta image. I will will update the doc once we get the final release. |
lgtm |
Automatic merge from submit-queue |
Automatic merge from submit-queue AWS Cluster Autoscaler README under kubernetes-retired/contrib#1311
Automatic merge from submit-queue AWS Cluster Autoscaler README under kubernetes-retired/contrib#1311
kubernetes-retired/contrib#1552 (comment) seems to explain the reasoning behind multiple ASGs much better then the previous link target.
under #1311
This change is