Skip to content

zalando-incubator/kubernetes-on-aws

Repository files navigation

Kubernetes on AWS

WORK IN PROGRESS

This repo contains configuration templates to provision Kubernetes clusters on AWS using Cloud Formation and Ubuntu Linux.

Many values are parameterized and values are not always visible. We're focusing on solving our own, specific/Zalando use case. However, we are open to ideas from the community at large about potentially turning this idea into a project that provides universal/general value to others. Please contact us via our Issues Tracker with your thoughts and suggestions.

Configuration in this repository initially was based on kube-aws, but now depends on four components which aren't all yet open sourced:

  • Cluster Registry to keep desired cluster states (e.g. used config channel and version)
  • Cluster Lifecycle Manager to provision the cluster's Cloud Formation stack and apply Kubernetes manifests for system components
  • Cluster Lifecycle Controller that handles rolling updates from inside the cluster, for example node termination
  • Authnz Webhook to validate OAuth tokens and authorize access

Lean more about Zalando's cloud native journey by reading the Zalando Case Study on kubernetes.io. See our Running Kubernetes in Production on AWS document for details on the setup.

Features

  • Highly available master nodes (ASG) behind ELB
  • Worker Auto Scaling Group with node pools support
  • Flannel overlay networking
  • Cluster autoscaling (using cluster-autoscaler)
  • Kubernetes DNS with node-local dnsmasq as daemonset and CoreDNS resolver for cluster.local domain running in the same pod.
  • Route53 DNS integration via External DNS
  • AWS IAM integration via kube2iam, AWS OIDC IAM
  • Standard components are installed: node exporter, kube-state-metrics, see also cluster/manifests directory
  • Webhook authentication and authorization (roles "ReadOnly", "PowerUser", "Manual", "Emergency", "Administrator")
  • Emergency Access via internal emergency-access-service, that grant roles "Manual" and "Emergency" with 4 eyes principle and audit logging
  • Log shipping via Scalyr
  • Full Ingress support with ALB/NLB and TLS integration via kube-ingress-aws-controller and HTTP routing via skipper
  • Enhanced usability with managed stacks and blue green deployments via stackset-controller and skipper
  • Fabric API Gateway, which can be used in combination with stackset-controller
  • Static Egress IPs to route through NAT Gateways with Elastic IPs via kube-static-egress-controller
  • Horizontal Pod Autoscaling with scaling by request per second, SQS queue size or others via kube-metrics-adapter
  • Vertical Pod Autoscaling to scale for example Prometheus
  • EFS support
  • GPU support
  • ETCD backup via Kubernetes cronjob and etcdctl snapshot and upload to S3
  • Monitoring via Prometheus and OpenTracing
  • Fully automated cluster updates via Cluster Lifecycle Manager
  • Automated downscaling for test clusters with kube-downscaler
  • Fallback node pools
  • Spot node pool integration
  • automated PDB creation with pdb-controller

Notes

  • Node and user authentication is done via tokens (using the webhook feature)
  • SSL client-cert authentication is disabled
  • Many values are hardcoded
  • Secrets (e.g. shared token) are not KMS-encrypted in the cluster

Assumptions

  • The AWS account has one or more hosted zones in Route53 including a proper SSL cert (you can use the free ACM service)
  • The VPC has at least one public subnet per AZ (either AWS default VPC setup or public subnet named "dmz-<REGION>-<AZ>")
  • The VPC is in region eu-central-1 or eu-west-1
  • etcd cluster is available via DNS discovery (SRV records) at etcd.<YOUR-HOSTED-ZONE>
  • OAuth Token Info is available to validate user tokens

Directory Structure

  • cluster/cluster.yaml: Cloud Formation template files for the cluster (will be applied by Cluster Lifecycle Manager)
  • cluster/config-defaults.yaml: Default values for different kind of use that can be overridden by values from our cluster-registry (will be applied by Cluster Lifecycle Manager)
  • cluster/etcd-cluster.yaml: Senza Cloud Formation to deploy ETCD
  • cluster/manifests: Kubernetes manifests for system components (will be applied by Cluster Lifecycle Manager)
  • cluster/node-pools: Cloud Formation template files and userdata (cloud-init) for ContainerLinux node-pools (will be applied by Cluster Lifecycle Manager)
  • docs: extracts from internal Zalando documentation.