Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks-prow-build-cluster: Use dedicated Managed Node Groups (MNGs) per Availability Zone (AZ) #6320

Merged
merged 4 commits into from
Jan 24, 2024

Conversation

xmudrii
Copy link
Member

@xmudrii xmudrii commented Jan 24, 2024

xref #6303 (see for more details and context)

As per recommendations received from the AWS support and @tzneal, we're replacing blue and green node groups with node groups per AZ. In other words, old node groups had instances in all three AZs. Now, we have a dedicated node group for each AZ. This is a short-term solution to fix the stability issues that we're facing.

This has been successfully rolled out to canary, I'll do prod rollout once this PR is reviewed and merged.

Notes:

  • We plan to switch to Karpenter long-term (eks-prow-build-cluster: Use Karpenter instead of cluster-autoscaler #5168)
  • We didn't want to suspend AZRebalacing process because it's not natively supported by EKS, so we opted in for dedicated node groups as that solution can be fully-automated using Terraform
  • We have dedicated Terraform objects and variables for each Node Group. I explicitly didn't want to use count because that would make rolling out upgrades too complicated

Follow-ups:

  • Remove blue/green Terraform variables and objects
  • Update docs to reflect changes in the upgrade procedure

/assign @upodroid @ameukam @dims

Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/bash Bash scripts, testing them, writing less of them, code in infra/gcp/ area/infra Infrastructure management, infrastructure design, code in infra/ area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. labels Jan 24, 2024
Copy link
Member

@upodroid upodroid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: upodroid, xmudrii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@upodroid
Copy link
Member

cancel the hold when you are ready to merge
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2024
@xmudrii
Copy link
Member Author

xmudrii commented Jan 24, 2024

Let's get this merged and I'll create a new PR for follow-ups
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2024

cluster_version = var.node_group_version_us_east_2a

taints = var.node_taints_build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - you can use eks_managed_node_group_defaults to set the default values you wish to use across all of the managed nodegroups, and then any specific settings unique to that nodegroup can be set in the nodegroup definition. You can also set the defaults and still override the value in the nodegroup definition as well

So roughly speaking, something like the following might help cut down on the copy+pasta across nodegroup definitions:

  eks_managed_node_group_defaults = {
    use_name_prefix = true
    
    taints = var.node_taints_build
    labels = var.node_labels_build
    ... anything else that you want common across the nodegroups
  }
  
  

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's neat, thanks for pointing it out! I'll keep it in mind and see if we can make use of it the next time we do some rollout

@k8s-ci-robot k8s-ci-robot merged commit 33a5d9d into kubernetes:main Jan 24, 2024
3 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jan 24, 2024
@xmudrii xmudrii deleted the eks-mng-per-az branch January 24, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/bash Bash scripts, testing them, writing less of them, code in infra/gcp/ area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure area/infra Infrastructure management, infrastructure design, code in infra/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants