Kubernetes imposes the following fundamental requirements:
- All containers can communicate with all other containers without NAT
- All nodes can communicate with all containers (and vice-versa) without NAT
- The IP address that a container sees itself as is the same IP address that others see it as
To run Kubernetes over AWS VPC, we would like to reach following additional goals:
- Networking for Pods must support high throughput and availability, low latency and minimal jitter comparable to the characteristics a user would get from EC2 networking.
- Users must be able to express and enforce granular network policies and isolation comparable to those achievable with native EC2 networking and security groups.
- Network operation must be simple and secure. Users must be able to apply existing AWS VPC networking and security best practices for building Kubernetes clusters over AWS VPC. For example, a user should be able to:
- use VPC flow logs for troubleshooting and compliance auditing,
- apply VPC routing polices for traffic engineering,
- apply security groups to enforce isolation and meet regulatory requirements.
- Pod networking should be setup in a matter of seconds.
- Administrators should be able to scale clusters up to 2000 nodes.
Here we propose:
- For each Kubernetes node (ec2 instance), create multiple elastic network interfaces (ENIs) and allocate their secondary IP addresses.
- For each pod, pick a free secondary IP address, assign it to the pod, wire host and pod networking stack to allow:
- Pod to Pod on a single host communication
- Pod to Pod on different hosts communication
- Pod to other AWS service communication
- Pod to on-premises data center communication
- Pod to internet communication
In EC2-VPC, each instance can have multiple elastic network interfaces (ENI) and each ENI can have multiple IPv4 or IPv6 addresses. Any packet which is destined to one of these addresses, EC2-VPC fabric will deliver the packet to the instance.
An elastic network interface is a virtual network interface that you can attach to an instance in a VPC. When the ENI is attached to an instance, a corresponding interface is created. The primary ENI IP address is automatically assigned to the interface. All secondary addresses remain unassigned and it's up to the host owner as to how to configure them.
When you create a VPC, you must specify a range of IPv4 addresses for the VPC in the form of a Classless Inter-Domain Routing (CIDR) block; for example, 10.0.0.0/16. This is the primary CIDR block for your VPC.
A VPC spans all the Availability Zones in the region. After creating a VPC, you can add one or more subnets in each Availability Zone. When you create a subnet, you specify the CIDR block for the subnet, which is a subset of the VPC CIDR block. Each subnet must reside entirely within one Availability Zone and cannot span zones. Availability Zones are distinct locations that are engineered to be isolated from failures in other Availability Zones. By launching instances in separate Availability Zones, you can protect your applications from the failure of a single location. We assign a unique ID to each subnet.
When you create an ENI, you must specify the subnet. Based on the instance type (Limit), each instance can have up to N ENIs and M addresses. One address on each ENI is always marked as primary, while others are secondary addresses. The primary IP address is not used by any Pods as it is required for traffic routing between Pods and outside world.
Maximum number of IP addresses available to Pods on each EC2 instance can be calculated as:
Max IPs = min((N * M - N), subnet's free IP)
Here is an overview on how we setup host side network stack and pod side network stack to allow pod to pod communication:
IP address
/ # ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if231: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 56:41:95:26:17:41 brd ff:ff:ff:ff:ff:ff
inet 10.0.97.30/32 brd 10.0.97.226 scope global eth0 <<<<<<< ENI's secondary IP address
valid_lft forever preferred_lft forever
inet6 fe80::5441:95ff:fe26:1741/64 scope link
valid_lft forever preferred_lft forever
routes
# ip route show
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0
static arp
# arp -a
? (169.254.1.1) at 2a:09:74:cd:c4:62 [ether] PERM on eth0
There are multiple routing tables used to route incoming/outgoing Pod's traffic.
- main (toPod) route table is used to route to Pod traffic
# ip route show
default via 10.0.96.1 dev eth0
10.0.96.0/19 dev eth0 proto kernel scope link src 10.0.104.183
10.0.97.30 dev aws8db0408c9a8 scope link <------------------------Pod's IP
10.0.97.159 dev awsbcd978401eb scope link
10.0.97.226 dev awsc2f87dc4cdd scope link
10.0.102.98 dev aws4914061689b scope link
...
- Each ENI has its own route table which is used to route pod's outgoing traffic, where pod is allocated with one of the ENI's secondary IP address
# ip route show table eni-1
default via 10.0.96.1 dev eth1
10.0.96.1 dev eth1 scope link
- Here is the routing rules to enforce policy routing
# ip rule list
0: from all lookup local
512: from all to 10.0.97.30 lookup main <---------- to Pod's traffic
1025: not from all to 10.0.0.0/16 lookup main
1536: from 10.0.97.30 lookup eni-1 <-------------- from Pod's traffic
Here are the wiring steps to enable pod to pod communication:
-
create a veth pair and have one veth on host namespace and one veth on Pod's namespace
ip link add veth-1 type veth peer name veth-1c /* on host namespace */ ip link set veth-1c netns ns1 /* move veth-1c to Pod's namespace ns1 */ ip link set veth-1 up /* bring up veth-1 */ ip netns exec ns1 ip link set veth-1c up /* bring up veth-1c */
-
Get an Secondary IP address assigned to the instance and perform following inside Pod's name space:
- Assign this IP address to Pod's eth0
- Add default gateway and default route to Pod's route table
- Add a static ARP entry for default gateway
/* To assign IP address 20.0.49.215 to Pod's namespace ns1 */ ip netns exec ns1 ip addr add 20.0.49.215/32 dev veth-1c /* assign a IP address to veth-1c */ ip netns exec ns1 ip route add 169.254.1.1 dev veth-1c /* add default gateway */ ip netns exec ns1 ip route add default via 169.254.1.1 dev veth-1c /* add default route */ ip netns exec ns1 arp -i veth-1c -s 169.254.1.1 <veth-1's mac> /* add static ARP entry for default gateway */
-
On host side, add host route so that incoming Pod's traffic can be routed to Pod.
/* Pod's IP address is 20.0.49.215 */ ip route add 20.0.49.215/32 dev veth-1 /* add host route */
For Pod to external (outside VPC) traffic, we will program iptables to SNAT using Primary IP address on the Primary ENI.
Here is the NAT rule:
-A POSTROUTING ! -d <VPC-CIDR> -m comment --comment "kubenetes: SNAT for outbound traffic from cluster" -m addrtype ! --dst-type LOCAL -j SNAT --to-source <Primary IP on the Primary ENI>
For fast Pod networking setup time, we will run a small, single binary(L-IPAM) on each host to maintain a warm-pool of available secondary IP addresses. Whenever Kubelet receives an ADD pod request, L-IPAM can immediately take one available secondary IP address from its warm pool and assign it to Pod.
L-IPAM learns the available ENIs and their secondary IP addresses from EC2 instance's Metadata Service.
// retrieve all attached ENIs
curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/
// retrieve all IPv4 addresses on an ENI
curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:da:9d:51:47:28/local-ipv4s
Whenever L-IPAM daemon restarts (e.g. for upgrade reason), it also queries local Kubelet introspection service to get current running Pods information such as Pod Name, Pod Namespace and Pod IP address.
curl --stderr /dev/null http://localhost:10255/pods
With the information from these 2 sources, L-IPAM can build a warm-pool that contains all available secondary IP addresses on the instance.
L-IPAM monitors the size of secondary IP address warm pool.
-
whenever the number of available IP addresses goes below a configured min threshold, L-IPAM will:
- create a new ENI and attach it to instance
- allocate all available IP addresses on this new ENI
- once these IP addresses become available on the instance (confirmed through instance's metadata service), add these IP addresses to warm-pool.
-
whenever the number of available IP addresses exceeds a configured max threshold, L-IPAM will pick an ENI where all of its secondary IP address are in warm-pool, detach the ENI interface and free it to EC2-VPC ENI pool.
Fragmentation of addresses assignment to ENIs may prevent freeing ENIs even when there are many unused IP addresses.
Right now, ENIs are allocated or freed by L-IPAM on each instance. All ENIs on a instance share same subnet and same security groups.
When a Pod is deleted, we will keep the Pod IP address in "cooling mode" for cooling period. When the cooling period expires, we will return this Pod IP to the warm-pool. The cooling period is used to prevent CNI plugin recycle this Pod's IP address and assign to a new Pod before controller has finished updating all nodes in the cluster about this deleted pod.
L-IPAM will have a command to show:
- Number of currently allocated IP addresses
- How many addresses are still available
- Detail information of each allocated IP address, such as pod name and namespace
curl http://localhost:6666/ip-manager
We will use gRPC for inter-process communication between CNI-plugin and L-IPAM.
L-IPAM runs as DaemonSet on all nodes.
In the future, we would like to investigate whether to have a cluster ENI manager which manages allocating and freeing ENIs for all worker nodes in the cluster. Here are a few benefits of having a cluster ENI manager:
- Administrators can remove EC2 ENI write permission from worker nodes.
- It may be easier to troubleshoot.
- It may be easier to allow ENIs of different security groups and subnets get attached to a single worker node.