Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

Closed
1 of 4 tasks
jphuynh opened this issue Apr 15, 2019 · 18 comments
Closed
1 of 4 tasks

Extend current worker_group ASG creation behavior (1 AZ per ASG) #346

jphuynh opened this issue Apr 15, 2019 · 18 comments

Comments

@jphuynh
Copy link

jphuynh commented Apr 15, 2019

I have issues

I'm submitting a...

  • bug report
  • feature request
  • support request
  • kudos, thank you, warm fuzzy

What is the current behavior?

Currently, worker_groups will take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones) .

While this is a useful behavior in some conditions, it's incompatible with the use of cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones

Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler, will it be possible to consider extending the feature?

What is the desired behavior?

I'm suggesting switching the current behavior under a flag like:

single_asg = true

And implement the new default behavior to create as many ASG as the number of declared subnets.

Environment details

  • Affected module version: v2.3.1
  • OS: Linux
  • Terraform version: 0.11.13

Notes:

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

Example:

worker_groups = [
    # Multi AZ ASG
    {
      subnets              = "${join(",",data.aws_subnet_ids.private.ids)}"
      asg_desired_capacity = 1
      asg_max_size         = 1
      asg_min_size         = 1
      instance_type        = "t3.small"
    },
    # Single AZ ASGs
    {
      subnets              = "${data.aws_subnet_ids.private.ids[0]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
    {
      subnets              = "${data.aws_subnet_ids.private.ids[1]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
    {
      subnets              = "${data.aws_subnet_ids.private.ids[2]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
@max-rocket-internet
Copy link
Contributor

Interesting issue!

Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler

I would say almost everyone will be running the cluster-autoscaler.

cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones

This is the first I've heard of this and I have always run the cluster-autoscaler with multi-AZ ASGs. But I've never seen rebalancing happening or an instance terminated by the ASG (unless EC2 healthcheck fails).

take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones)

Interesting idea!

I'm keen to get other opinions here as I've never seen or heard of this issue before.

Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?

@jphuynh
Copy link
Author

jphuynh commented Apr 15, 2019

Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?

This is a good point. I guess yes that would prevent the issues with rebalancing.

Like I said, it's not blocking at the moment but I still think that adding the flexibility could be interesting for the users of the module.

@mascah
Copy link

mascah commented Apr 25, 2019

I would say this problem isn't avoided just by passing suspended_processes=["AZRebalance"], but it is in certain cases.

If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.

The issue has been explained/addressed by people who are much more intimate with CA and the use cases.

kubernetes-retired/contrib#1552 (comment)
eksctl-io/eksctl#647 (comment)

@max-rocket-internet
Copy link
Contributor

If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.

I think this is helped in 1.12: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/

An common example is pods that have an EBS volume.

But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.

@BrianChristie
Copy link

@mascah is correct, use of the cluster-autoscaler with AutoScaling Groups which span more than one Availability Zone is unsupported.

A separate ASG should be created per zone. As noted, this is primarily because EBS are scoped to a single zone.

But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.

When there is an ASG per AZ, cluster-autoscaler can increase the desired capacity in the group which is located in the appropriate zone. I believe it goes something like this:

  • Pod with PVC created
  • Volume Provisioner creates EBS in random zone
  • kube-scheduler schedules Pod into zone matching the volume
  • insufficient capacity in the group for that zone ==> cluster-autoscaler increases desired capacity.

The 1.12 Topology-Aware Volume Provisioning comes at this from the other direction which primarily helps when the cluster-autoscaler is not in use.

This new setting instructs the volume provisioner to not create a volume immediately, and instead, wait for a pod using an associated PVC to run through scheduling

The default remains as before:

By default, the Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created

@max-rocket-internet
Copy link
Contributor

A separate ASG should be created per zone.

Unless you suspend AZRebalance (which is now the default in this module), and don't use PVCs. Which I suspect is most use cases.

We have a large PVC for prometheus in 1 cluster so we defined a second worker group like this with a label for scheduling in k8s:

  worker_groups_launch_template = [
    {
      name                     = "prom-1"
      instance_type       = "r5.4xlarge"
      kubelet_extra_args       = "--node-labels=kubernetes.io/prometheus=true"
      subnets                  = "${element(module.vpc1.private_subnets, 1)}"
    }
  ]

@max-rocket-internet
Copy link
Contributor

I'm suggesting switching the current behavior under a flag like: single_asg = true

Maybe do the reverse so as not to break compatibility and see if there's a clean way of doing it in TF? All the worker group related code is quite complicated at the moment 😬

@mascah
Copy link

mascah commented May 16, 2019

  worker_groups_launch_template = [
    {
      name                     = "prom-1"
      instance_type       = "r5.4xlarge"
      kubelet_extra_args       = "--node-labels=kubernetes.io/prometheus=true"
      subnets                  = "${element(module.vpc1.private_subnets, 1)}"
    }
  ]

That's a reasonable work around for the time being. I'm also optimistic that CA supports multi AZ ASGs before we need to dump a bunch more worker group related code here to be able to handle it 😄

@Stephanzf
Copy link

It is required to have a separate ASG for each available zone configured in an EKS cluster. If you like, try to kill a pod or increate your workloads, to see whether the cluster would autoscale.

@eperdeme
Copy link

eperdeme commented Oct 8, 2019

This is still needed, as during autoscaling for resources pinned to AZ-A causes the autoscaler to have a chance of spawning the additional capacity in another AZ, as it does not have the ability to scale based on zone when multiple AZs are in an ASG.

@mgar
Copy link

mgar commented Oct 22, 2019

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

To avoid code duplication in worker_groups, we ended up doing this:

worker_groups = [
    for subnet in data.aws_subnet_ids.private.ids :
    {
      subnets              = [subnet],
      asg_desired_capacity = 3,
      instance_type        = "m5.large",
      ami_id               = data.aws_ami.eks_worker.id
    }
  ]

It is also useful in case you don't know how many subnets you're going to have initially.

@max-rocket-internet
Copy link
Contributor

we ended up doing this:

Nice 😃

@ipnextgen
Copy link

Good to see this thread. Any plan to natively integrate that in the module without any need for "for loops"?

@max-rocket-internet
Copy link
Contributor

Do we need to? That solution by @mgar is very elegant?

@bmcustodio
Copy link
Contributor

I'd also need this, but I'll (hopefully) try @mgar's solution over the next few days and report back. 👍

@bmcustodio
Copy link
Contributor

Quick update, @mgar's seems to be working well for us 🤞

@max-rocket-internet
Copy link
Contributor

Great. I think we can close this issue then, right?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants