Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS Prow Cluster] Add karpenter terraform module for eks-prow-build cluster #6895

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions infra/aws/terraform/modules/eks-prow-iam/policy_apply.tf
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ data "aws_iam_policy_document" "eks_apply" {
"ec2:ModifyInstanceAttribute",
"ec2:TerminateInstances",
"ec2:ImportKeyPair",
"eks:CreateAccessEntry",
"eks:CreateAddon",
"eks:CreateCluster",
"eks:CreateNodegroup",
Expand All @@ -69,9 +70,19 @@ data "aws_iam_policy_document" "eks_apply" {
"eks:UpdateClusterVersion",
"eks:UpdateNodegroupConfig",
"eks:UpdateNodegroupVersion",
"events:DeleteRule",
"events:DescribeRule",
"events:ListTagsForResource",
"events:ListTargetsByRule",
Comment on lines +74 to +76
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of the plan policy, we only keep the write permissions here.

"events:PutRule",
"events:PutTargets",
"events:RemoveTargets",
"events:TagResource",
"iam:AttachRolePolicy",
"iam:CreateOpenIDConnectProvider",
"iam:CreatePolicy",
"iam:CreatePolicyVersion",
"iam:CreateRole",
"iam:PassRole",
"iam:TagOpenIDConnectProvider",
"iam:TagPolicy",
Expand All @@ -90,6 +101,12 @@ data "aws_iam_policy_document" "eks_apply" {
"logs:PutRetentionPolicy",
"logs:TagLogGroup",
"s3:PutObject",
"sqs:createqueue",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these permissions required by prow, or by Karpenter? Likewise fir the Eventbridge API permissions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are required by the karpenter module, it creates a SQS queue for the event messaging.

"sqs:deletequeue",
"sqs:getqueueattributes",
"sqs:listqueuetags",
Comment on lines +106 to +107
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of the plan policy, we only keep the write permissions here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are those permissions lowercase while all other permissions are CamelCase? Is that on the AWS side or AWS doesn't care at all?

"sqs:setqueueattributes",
"sqs:tagqueue",
# TODO(xmudrii-ubuntu): remove after removing ECR repo
"ecr-public:*"
]
Expand Down
2 changes: 2 additions & 0 deletions infra/aws/terraform/modules/eks-prow-iam/policy_plan.tf
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ data "aws_iam_policy_document" "eks_plan" {
"acm:DescribeCertificate",
"acm:ListTagsForCertificate",
"ec2:DescribeAddresses",
"ec2:DescribeAddressesAttribute",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeEgressOnlyInternetGateways",
"ec2:DescribeInternetGateways",
Expand Down Expand Up @@ -84,6 +85,7 @@ data "aws_iam_policy_document" "eks_plan" {
"kms:ListAliases",
"kms:ListResourceTags",
"logs:DescribeLogGroups",
"logs:ListTagsForResource",
"logs:ListTagsLogGroup",
"s3:GetObject",
"s3:ListBucket"
Expand Down
18 changes: 18 additions & 0 deletions infra/aws/terraform/prow-build-cluster/hack/flux-update.bash
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,23 @@ flux create hr kubecost \
--interval=${sync_interval} \
--export >> ${resources_dir}/kubecost/flux-hr-kubecost.yaml

boilerplate > ${resources_dir}/flux-system/flux-source-helm-karpenter-chart.yaml
flux create source helm karpenter \
--url oci://public.ecr.aws/karpenter/karpenter \
--interval=${sync_interval} \
--export >> ${resources_dir}/flux-system/flux-source-helm-karpenter-chart.yaml

boilerplate > ${resources_dir}/karpenter/flux-hr-karpenter.yaml
flux create hr karpenter \
--source HelmRepository/karpenter.flux-system \
--namespace=karpenter \
--chart karpenter \
--chart-version 0.36.2 \
--values ${resources_dir}/karpenter/${PROW_ENV}-cluster-values \
--interval=${sync_interval} \
--export >> ${resources_dir}/karpenter/flux-hr-karpenter.yaml


# This list contains names of folders inside ./resources directory
# that are used for generating FluxCD kustomizations.
kustomizations=(
Expand All @@ -104,6 +121,7 @@ kustomizations=(
external-secrets
kubecost
cluster-autoscaler
karpenter
)

# Code below is used to figure out a relative path of
Expand Down
36 changes: 36 additions & 0 deletions infra/aws/terraform/prow-build-cluster/karpenter.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
Copyright 2024 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

module "karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"

cluster_name = module.eks.cluster_name
create_access_entry = false

enable_irsa = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use EKS Pod Identity here?

irsa_namespace_service_accounts = ["karpenter:karpenter"]
irsa_oidc_provider_arn = module.eks.oidc_provider_arn

# Attach additional IAM policies to the Karpenter node IAM role
node_iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

tags = {
Environment = var.cluster_name == "prow-canary-cluster" ? "canary" : "prod"
Terraform = "true"
}
}
26 changes: 25 additions & 1 deletion infra/aws/terraform/prow-build-cluster/locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ locals {
auto_scaling_tags = {
"k8s.io/cluster-autoscaler/${var.cluster_name}" = "owned"
"k8s.io/cluster-autoscaler/enabled" = true
"karpenter.sh/discovery" = var.cluster_name
}

azs = slice(data.aws_availability_zones.available.names, 0, 3)
Expand Down Expand Up @@ -81,6 +82,27 @@ locals {
}
]

karpenter_roles = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use cluster access entry instead of adding to the aws-auth configmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I tried this code on the Canary cluster first and could not run Karpenter with the access entry. I may have missed something, so I turned it off and continued with the aws-auth file.

{
rolearn = module.karpenter.node_iam_role_arn
username = "system:node:{{EC2PrivateDNSName}}"
groups = [
"system:bootstrappers",
"system:nodes"
]
}
]

sso_roles = [
{
rolearn = "arn:aws:iam::468814281478:role/AWSReservedSSO_AdministratorAccess_abaef4db15a2c055"
username = "sso-admins"
groups = [
"eks-cluster-admin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what permissions this group maps to, but similar to above - would cluster access entry work here instead?

]
}
]

aws_auth_roles = flatten([
local.configure_prow ? [
# Allow access to the Prow-EKS-Admin IAM role (used by Prow directly).
Expand All @@ -93,6 +115,8 @@ locals {
}
] : [],
local.cluster_admin_roles,
local.cluster_viewer_roles
local.cluster_viewer_roles,
local.karpenter_roles,
local.sso_roles
])
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: karpenter
namespace: flux-system
spec:
interval: 5m0s
path: ./infra/aws/terraform/prow-build-cluster/resources/karpenter
prune: true
sourceRef:
kind: GitRepository
name: k8s-io
namespace: flux-system
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
settings:
clusterName: prow-canary-cluster
interruptionQueue: Karpenter-prow-canary-cluster
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::054318140392:role/KarpenterController-20240527081538529900000002"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EKS Pod Identity would remove this hardcoded role arn mapping in code


controller:
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 1Gi

# default affinity rule works well in our case
# https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter/values.yaml#L70
tolerations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to force scheduling Karpenter on the stable nodes?

- effect: NoSchedule
key: node-group
operator: Equal
value: stable
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: karpenter
namespace: flux-system
spec:
chart:
spec:
chart: karpenter
reconcileStrategy: ChartVersion
sourceRef:
kind: HelmRepository
name: karpenter
namespace: flux-system
version: 0.36.2
install:
createNamespace: true
interval: 1m0s
releaseName: karpenter
targetNamespace: karpenter
values:
controller:
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 1Gi
serviceAccount:
annotations:
# terraform state show module.karpenter.aws_iam_role.controller\[0\] | grep " arn "
eks.amazonaws.com/role-arn: arn:aws:iam::468814281478:role/KarpenterController-20240527081538529900000002
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment - EKS Pod Identity removes this hardcoding

settings:
clusterName: prow-build-cluster
interruptionQueue: Karpenter-prow-build-cluster
tolerations:
- effect: NoSchedule
key: node-group
operator: Equal
value: stable
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
settings:
clusterName: prow-build-cluster
interruptionQueue: Karpenter-prow-build-cluster
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::468814281478:role/KarpenterController-20240527081538529900000002"

controller:
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 1Gi

# default affinity rule works well in our case
# https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter/values.yaml#L70
tolerations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should check if we want to force Karpenter on the stable nodes.

- effect: NoSchedule
key: node-group
operator: Equal
value: stable
2 changes: 1 addition & 1 deletion infra/aws/terraform/prow-build-cluster/vpc.tf
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ limitations under the License.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.1"

name = "${var.cluster_name}-vpc"

cidr = var.vpc_cidr
Expand Down Expand Up @@ -65,6 +64,7 @@ module "vpc" {
public_subnet_tags = {
"kubernetes.io/role/elb" = 1
"kubernetes.io/cluster/${var.cluster_name}" = "owned"
"karpenter.sh/discovery" = var.cluster_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment why is this required?

}

private_subnet_tags = {
Expand Down