Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

jphuynh · 2019-04-15T11:00:24Z

I have issues

I'm submitting a...

bug report
feature request
support request
kudos, thank you, warm fuzzy

What is the current behavior?

Currently, worker_groups will take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones) .

While this is a useful behavior in some conditions, it's incompatible with the use of cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones

Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler, will it be possible to consider extending the feature?

What is the desired behavior?

I'm suggesting switching the current behavior under a flag like:

single_asg = true

And implement the new default behavior to create as many ASG as the number of declared subnets.

Environment details

Affected module version: v2.3.1
OS: Linux
Terraform version: 0.11.13

Notes:

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

Example:

worker_groups = [
    # Multi AZ ASG
    {
      subnets              = "${join(",",data.aws_subnet_ids.private.ids)}"
      asg_desired_capacity = 1
      asg_max_size         = 1
      asg_min_size         = 1
      instance_type        = "t3.small"
    },
    # Single AZ ASGs
    {
      subnets              = "${data.aws_subnet_ids.private.ids[0]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
    {
      subnets              = "${data.aws_subnet_ids.private.ids[1]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
    {
      subnets              = "${data.aws_subnet_ids.private.ids[2]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },

The text was updated successfully, but these errors were encountered:

max-rocket-internet · 2019-04-15T12:22:34Z

Interesting issue!

Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler

I would say almost everyone will be running the cluster-autoscaler.

cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones

This is the first I've heard of this and I have always run the cluster-autoscaler with multi-AZ ASGs. But I've never seen rebalancing happening or an instance terminated by the ASG (unless EC2 healthcheck fails).

take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones)

Interesting idea!

I'm keen to get other opinions here as I've never seen or heard of this issue before.

Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?

jphuynh · 2019-04-15T13:10:01Z

Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?

This is a good point. I guess yes that would prevent the issues with rebalancing.

Like I said, it's not blocking at the moment but I still think that adding the flexibility could be interesting for the users of the module.

mascah · 2019-04-25T16:01:04Z

I would say this problem isn't avoided just by passing suspended_processes=["AZRebalance"], but it is in certain cases.

If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.

The issue has been explained/addressed by people who are much more intimate with CA and the use cases.

kubernetes-retired/contrib#1552 (comment)
eksctl-io/eksctl#647 (comment)

max-rocket-internet · 2019-04-26T10:26:55Z

If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.

I think this is helped in 1.12: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/

An common example is pods that have an EBS volume.

But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.

BrianChristie · 2019-05-16T12:17:03Z

@mascah is correct, use of the cluster-autoscaler with AutoScaling Groups which span more than one Availability Zone is unsupported.

A separate ASG should be created per zone. As noted, this is primarily because EBS are scoped to a single zone.

But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.

When there is an ASG per AZ, cluster-autoscaler can increase the desired capacity in the group which is located in the appropriate zone. I believe it goes something like this:

Pod with PVC created
Volume Provisioner creates EBS in random zone
kube-scheduler schedules Pod into zone matching the volume
insufficient capacity in the group for that zone ==> cluster-autoscaler increases desired capacity.

The 1.12 Topology-Aware Volume Provisioning comes at this from the other direction which primarily helps when the cluster-autoscaler is not in use.

This new setting instructs the volume provisioner to not create a volume immediately, and instead, wait for a pod using an associated PVC to run through scheduling

The default remains as before:

By default, the Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created

max-rocket-internet · 2019-05-16T13:27:52Z

A separate ASG should be created per zone.

Unless you suspend AZRebalance (which is now the default in this module), and don't use PVCs. Which I suspect is most use cases.

We have a large PVC for prometheus in 1 cluster so we defined a second worker group like this with a label for scheduling in k8s:

  worker_groups_launch_template = [
    {
      name                     = "prom-1"
      instance_type       = "r5.4xlarge"
      kubelet_extra_args       = "--node-labels=kubernetes.io/prometheus=true"
      subnets                  = "${element(module.vpc1.private_subnets, 1)}"
    }
  ]

max-rocket-internet · 2019-05-16T13:30:37Z

I'm suggesting switching the current behavior under a flag like: single_asg = true

Maybe do the reverse so as not to break compatibility and see if there's a clean way of doing it in TF? All the worker group related code is quite complicated at the moment 😬

mascah · 2019-05-16T13:38:48Z

  worker_groups_launch_template = [
    {
      name                     = "prom-1"
      instance_type       = "r5.4xlarge"
      kubelet_extra_args       = "--node-labels=kubernetes.io/prometheus=true"
      subnets                  = "${element(module.vpc1.private_subnets, 1)}"
    }
  ]

That's a reasonable work around for the time being. I'm also optimistic that CA supports multi AZ ASGs before we need to dump a bunch more worker group related code here to be able to handle it 😄

Stephanzf · 2019-10-02T14:38:39Z

It is required to have a separate ASG for each available zone configured in an EKS cluster. If you like, try to kill a pod or increate your workloads, to see whether the cluster would autoscale.

eperdeme · 2019-10-08T14:06:47Z

This is still needed, as during autoscaling for resources pinned to AZ-A causes the autoscaler to have a chance of spawning the additional capacity in another AZ, as it does not have the ability to scale based on zone when multiple AZs are in an ASG.

mgar · 2019-10-22T08:28:25Z

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

To avoid code duplication in worker_groups, we ended up doing this:

worker_groups = [
    for subnet in data.aws_subnet_ids.private.ids :
    {
      subnets              = [subnet],
      asg_desired_capacity = 3,
      instance_type        = "m5.large",
      ami_id               = data.aws_ami.eks_worker.id
    }
  ]

It is also useful in case you don't know how many subnets you're going to have initially.

max-rocket-internet · 2019-10-22T15:14:07Z

we ended up doing this:

Nice 😃

ipnextgen · 2019-11-14T15:59:00Z

Good to see this thread. Any plan to natively integrate that in the module without any need for "for loops"?

max-rocket-internet · 2019-11-18T17:16:22Z

Do we need to? That solution by @mgar is very elegant?

bmcustodio · 2019-11-28T15:56:16Z

I'd also need this, but I'll (hopefully) try @mgar's solution over the next few days and report back. 👍

bmcustodio · 2019-11-29T16:52:15Z

Quick update, @mgar's seems to be working well for us 🤞

max-rocket-internet · 2019-12-04T16:54:10Z

Great. I think we can close this issue then, right?

github-actions · 2022-11-29T02:18:05Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

max-rocket-internet closed this as completed Dec 4, 2019

max-rocket-internet mentioned this issue Jan 27, 2020

Support for different availability zones per-worker #712

Closed

2 tasks

github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

jphuynh commented Apr 15, 2019

max-rocket-internet commented Apr 15, 2019

jphuynh commented Apr 15, 2019

mascah commented Apr 25, 2019

max-rocket-internet commented Apr 26, 2019

BrianChristie commented May 16, 2019

max-rocket-internet commented May 16, 2019

max-rocket-internet commented May 16, 2019

mascah commented May 16, 2019 •

edited

Loading

Stephanzf commented Oct 2, 2019

eperdeme commented Oct 8, 2019

mgar commented Oct 22, 2019

max-rocket-internet commented Oct 22, 2019

ipnextgen commented Nov 14, 2019

max-rocket-internet commented Nov 18, 2019

bmcustodio commented Nov 28, 2019

bmcustodio commented Nov 29, 2019

max-rocket-internet commented Dec 4, 2019

github-actions bot commented Nov 29, 2022

Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

Comments

jphuynh commented Apr 15, 2019

I have issues

I'm submitting a...

What is the current behavior?

What is the desired behavior?

Environment details

Notes:

max-rocket-internet commented Apr 15, 2019

jphuynh commented Apr 15, 2019

mascah commented Apr 25, 2019

max-rocket-internet commented Apr 26, 2019

BrianChristie commented May 16, 2019

max-rocket-internet commented May 16, 2019

max-rocket-internet commented May 16, 2019

mascah commented May 16, 2019 • edited Loading

Stephanzf commented Oct 2, 2019

eperdeme commented Oct 8, 2019

mgar commented Oct 22, 2019

max-rocket-internet commented Oct 22, 2019

ipnextgen commented Nov 14, 2019

max-rocket-internet commented Nov 18, 2019

bmcustodio commented Nov 28, 2019

bmcustodio commented Nov 29, 2019

max-rocket-internet commented Dec 4, 2019

github-actions bot commented Nov 29, 2022

mascah commented May 16, 2019 •

edited

Loading