Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165

Merged
merged 2 commits into from
Sep 21, 2018

Conversation

liwenwu-amazon
Copy link
Contributor

Issue #131

Problem

Today ipamD uses the node's primary ENI's SecurityGroups and Subnet when allocating secondary ENIs for Pod.

Here are few use cases which require Pods to use different SecurityGroups and Subnets than Primary ENI:

  • There is a limited IP addresses available in primary ENI's subnet. This limits the number of Pods can be created in the cluster.
  • For security reason, Pods need to use different SecurityGroups and Subnet than Node's SecurityGroups and Subnet
  • For security reason and availability reason, some Pods in the cluster need to use one set of SecurityGroups and Subnet, whereas some other Pods need to use different SecurityGroups and Subnet

Pod's Custom Network Config

ENIConfig CRD

Here we define a new CRD ENIConfig to allow user to configure SecurityGroups and Subet for Pods network config:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: eniconfigs.crd.k8s.amazonaws.com
spec:
  scope: Cluster
  group: crd.k8s.amazonaws.com
  version: v1alpha1
  names:
    scope: Cluster
    plural: eniconfigs
    singuar: eniconfig
    kind: ENIConfig

Node Annotation

We will use Node's Annotation to indicate which ENIConfig will be used for this Node's Pod network.

kubectl annotate node <node-name> k8s.amazonaws.com/eniConfig=<ENIConfig name>

default ENIConfig

If a node does not have annotation k8s.amazonaws.com/eniConfig, it will use ENIConfig whose name is default

Workflow

  • Set environment variable AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG to true
    • This will cause ipamD to use the SecurityGroups and Subnet in node's ENIConfig for ENI allocation
  • Create a VPC Subnet for pod network config (e.g subnet-0c4678ec01ce68b24)
  • Create VPC Security groups for pod network config (e.g. sg-066c7927a794cf7e7, sg-08f5f22cfb70d8405, sg-0bb293fc16f5c2058)
  • Create ENIConfig CRD, for example:
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
 name: group1-pod-netconfig
spec:
 subnet: subnet-0c4678ec01ce68b24
 securityGroups:
 - sg-066c7927a794cf7e7
 - sg-08f5f22cfb70d8405
 - sg-0bb293fc16f5c2058
  • Annotate a node to use ENIConfig group1-pod-netconfig
kubectl annotate node <node-xxx> k8s.amazonaws.com/eniConfig=group1-pod-netconfig
  • The ipamD will use these for ENI allocation

Behavior when there is a ENIConfig Configuration Change

If user changes ENIConfig CRD definition (e.g. using different subnet or different security groups), or changes node's annotation to use different ENIConfig CRD, ipamD will use this new configuration when allocating new ENIs.

Alternative Solutions Considered But Not Adopted:

Use Environment for Pod Subnet and SecurityGroups

  • Every config change will trigger CNI/aws-node daemonSet rolling upgrade
  • All Pods in the cluster MUST use same Subnet and SecurityGroups

Use /etc/cni/net.d/aws.conf

  • In addition to issues mentioned above, you need to rebuild aws-vpc-cni docker image.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@stuartnelson3
Copy link

Currently running this on a test cluster, seems to be working as advertised

curl http://localhost:61678/v1/env-settings > ${LOG_DIR}/env.output
curl http://localhost:61678/v1/networkutils-env-settings > ${LOG_DIR}/networkutils-env.output
curl http://localhost:61678/v1/ipamd-env-settings > ${LOG_DIR}/ipamd-env.output
curl http://localhost:61678/v1/eni-configs > ${LOG_DIR}/eni-configs.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why does the eni-configs log file have a .out extension when all the others have .output

ipamd/ipamd.go Outdated
@@ -43,11 +45,40 @@ const (
ipPoolMonitorInterval = 5 * time.Second
maxRetryCheckENI = 5
eniAttachTime = 10 * time.Second
defaultWarmENITarget = 1
nodeIPPoolReconcileInterval = 60 * time.Second
maxK8SRetries = 12
retryK8SInterval = 5 * time.Second
noWarmIPTarget = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to defaultWarmIPTarget and move it to line 62?

@stuartnelson3
Copy link

I've noticed to unexpected behavior while using this:

I'm attempting to ping from one physical machine inside a non-AWS datacenter, to a pod running on a k8s cluster running on instances in EC2. The two are linked via a direct connect.

When pinging from the pod to the physical machine, the ping request and reply packets make it through fine:

13:52:40.197673 IP pod-host-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo request, id 968, seq 4, length 64
13:52:40.197708 IP bare-metal-XX.XX.XX.XX > pod-host-XX.XX.XX.XX: ICMP echo reply, id 968, seq 4, length 64

To note is that I seem to be receiving packets from the EC2 instance itself, over eth0; the pod's IP addr is not being recorded by tcpdump as the origin/destination of the request/reply messages. The pod is attached to device=3 according to the output on /v1/enis. Checking ip a on the EC2 instance, there are eth0, eth1, and eth2.

When pinging from the physical machine to the pod ip, the physical machine sees no replies.

If I tcpdump on the EC2 instance while pinging from the physical machine to the pod, I see requests coming in on eth2, but then leaving on eth0:

$ tcpdump -n -i eth2 icmp and host bare-metal-XX.XX.XX.XX
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:31.639213 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 1, length 64
13:37:32.693195 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 2, length 64
13:37:33.717124 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 3, length 64
13:37:34.741126 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 4, length 64
13:37:35.765154 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 5, length 64
$ tcpdump -n -i eth0 icmp and host bare-metal-XX.XX.XX.XX
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:46.265847 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 1, length 64
13:37:47.285361 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 2, length 64
13:37:48.309237 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 3, length 64

So it appears something is incorrectly configured with ip routes/rules, since the reply is leaving on the wrong device.

When I check the routes/rules:

$ ip rule show | grep pod-XX.XX.XX.XX
512:    from all to pod-XX.XX.XX.XX lookup main 
1536:    from pod-XX.XX.XX.XX lookup 2 
$ ip route show table 2
default via 10.132.0.1 dev eth1 
10.132.0.1 dev eth1  scope link 

According to this, it seems like traffic from the pod should in fact be leaving over eth1. This contradictory behavior is confirmed:

$ ip route get bare-metal-XX.XX.XX.XX from pod-XX.XX.XX.XX iif eniab037e126c3
bare-metal-XX.XX.XX.XX from pod-XX.XX.XX.XX via pod-host-XX.XX.XX.XX dev eth0 
    cache  iif eniab037e126c3
$ ip route get 10.132.0.1 from pod-XX.XX.XX.XX iif eniab037e126c3
10.132.0.1 from pod-XX.XX.XX.XX via pod-host-XX.XX.XX.XX dev eth0 
    cache  iif eniab037e126c3

I'm hoping that this isn't the expected behavior? Is there any further information I can provide in helping to debug this? Thanks for the great work, we're very excited to start using this feature!

@liwenwu-amazon
Copy link
Contributor Author

@stuartnelson3 can you set AWS_VPC_K8S_CNI_EXTERNALSNAT to true and see if it works.

@mogren
Copy link
Contributor

mogren commented Sep 21, 2018

LGTM!

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure all the tests have passed with the latest changes.

@liwenwu-amazon liwenwu-amazon merged commit c30ede2 into aws:master Sep 21, 2018
@stuartnelson3
Copy link

When I set AWS_VPC_K8S_CNI_EXTERNALSNAT to true, the container is unable to ping the baremetal machine, and the baremetal machine cannot ping the container.

The default setting, AWS_VPC_K8S_CNI_EXTERNALSNAT=false, allows the container to ping the baremetal machine, but the baremetal machine is not able to ping the container.

Do you have any suggestions how to further trouble shoot this, or if there's any more information I could add to my previous comment (#165 (comment)) ?

@liwenwu-amazon
Copy link
Contributor Author

@stuartnelson3 What's your VPC topology? In another word, how is your baremetal connected to Pods in the Pod's VPC? Are your pods using Pod's specific subnet and security groups from this PR?

@stuartnelson3
Copy link

my mistake! the routers in our baremetal datacenter were blocking the CIDR block of the secondary subnet. Setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true fixed the issue!

- crd.k8s.amazonaws.com
resources:
- "*"
- namespaecs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liwenwu-amazon I realize this is already merged, but it looks like theres a typo in this ClusterRole rule: namespaecs -> namespaces, though with the wildcard above it this line may be redundant.

@sdavids13
Copy link

Are there instructions on how this could be used in a multi-AZ autoscaling group? When a machine comes up how can it know that if it comes up in us-east-1a to use the alternate subnet X (eniConfig=groupX-pod-netconfig) while anything launching in us-east-1b uses alternate subnet Y (eniConfig=groupY-pod-netconfig). Also, where can you set the node annotation where it is already configured before it joins the cluster (i.e. you don't need to run kubectl annotate node <node-name> k8s.amazonaws.com/eniConfig=<ENIConfig name> after it has already joined the cluster -- EC2 Tag?)

@bnutt
Copy link

bnutt commented Oct 18, 2018

I am wondering this as well @sdavids13 , my thought to this is that you could have the user_data in the launch config for the autoscaling group check what AZ the instance is in by curl http://169.254.169.254/latest/meta-data/placement/availability-zone. Based on the response, you could map it to a ENIConfig which contains a subnet in that AZ and then set the label for the node in kubelet on startup. If you use the amazon ami, you can pass the labels you want to the bootstrap script https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh.

Edit: Sorry, I realized you need annotations, not labels, I dont see a way either to assign annotations in kubelet, so it somehow needs to be done after it's joined the cluster which really isn't that feasible since you may have nodes scaling up or down. If it could have an EC2 tag to set the ENIConfig it would be easy to specify which to use for each subnet.

After testing this more it is mainly usable when you run all your instances in one AZ, by doing this you could specify a default eniconfig and then all your instances would automatically allocate ip's from the subnet specified. However, if you want to run across different az's, you would need to make multiple eni configs, lets say one for each AZ. If you dont have default config your nodes by default will not allocate any ip's for pods based on the subnet the worker node was created in, so your worker node is not operating. This means you would need to annotate every node that comes up by hand if you can not do it during node bootstrap. Would it be possible to have the default behavior to just use the worker node subnet to allocate pod ip's even if the flag is enabled instead of having to define a default eniconfig, or is there a programatic approach you can recommend @liwenwu-amazon on how to annotate nodes? Or even switch to use labels?

@sdavids13
Copy link

@liwenwu-amazon @bchav Could either of you please help explain how this feature can be used in a multi-AZ environment on node startup? Looking at the kubelet documentation there doesn't appear to be a way to specify annotations at node startup/registration time. Could you please provide a mechanism to allow us to set the eniConfig value somewhere in the node startup process?

@liwenwu-amazon
Copy link
Contributor Author

@sdavids13 One note on eniConfig is that:

  • There is NO hard requirement that node MUST be annotated at node startup time.
  • If node comes up with NO eniConfig annotation, or there is NO matching eniConfig CRD found, ipamD will NOT try to allocate any ENIs and IPs
  • You can write a external controller to watch the node object and annotate the node based on your business need.
  • Once the node get annotated with a eniConfig name and also the matching eniConfig CRD is configured, the ipamD will start allocating ENIs and IPs using security groups and subnet specified in the eniConfig CRD.

@liwenwu-amazon
Copy link
Contributor Author

@bnutt ,

"Would it be possible to have the default behavior to just use the worker node subnet to allocate pod ip's even if the flag is enabled instead of having to define a default eniconfig, or is there a programatic approach you can recommend @liwenwu-amazon on how to annotate nodes? Or even switch to use labels?"

Yes, today's default behavior (AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = false) is all ENIs are using same security groups and subnet as worker node.

@sdavids13
Copy link

sdavids13 commented Oct 24, 2018

@liwenwu-amazon The problem with doing it after the worker node joins the cluster is that pods are already going to start being scheduled on the host. There will effectively be a race condition between when pods will be scheduled/run on the node and when the external "watcher" process can update the annotation. Then what ends up happening? The pods might be launched into the incorrect/undesired subnet and then you will need to kill all of those pods to then be rescheduled in the correct subnet? Please correct me if there is a better approach that won't allow pods to be scheduled on the host until the annotation is applied. (important to note that if you don't supply a "default" eniConfig then the critical "watcher" pod wouldn't be able to be deployed and hence will never allow any other pods to start, hence nothing would be able to be launched from a new cluster)

Alternatively could we provide an environment variable through kubelet (or a similar mechanism) that can be applied in the user-data script to provide the "default" eniConfig value if the annotation isn't present on the node via the k8s API? @liwenwu-amazon / @bchav

@liwenwu-amazon
Copy link
Contributor Author

@sdavids13 I have a question with your deployment:

After a node has been annotated with a eniConfig, how do you prevent a new Pod (e.g. The pod which is supposed to use a different subnet) being scheduled on this node?

@sdavids13
Copy link

@liwenwu-amazon I'm not quite following your question - I would only want pods to be scheduled after the node has been annotated, so I'm not sure why I would want to prevent them from being scheduled after being annotated. But to answer your question... you would generally cordon the node to prevent new pods from being scheduled. Unfortunately I don't believe you can cordon the node when the node registers itself but even if you could you still need to at least get the node watcher pod up and running in order to perform the annotating process, hence we have a chicken before the egg problem.

Taking a step back this is my goal:

  1. Configure a multi-AZ EKS cluster where the primary EC2 ENI/IP is in a routable subnet (to other peered VPCs) while all pod ENIs/IPs are run in a subnet in a different CIDR range/non-routable space in each corresponding AZ. This was described in the original issue.
  2. A cluster can be launched via a terraform script, install helm, various helm charts, and requires 0 human intervention throughout the process.
  3. Minimize/have 0 "false errors" coming from the cluster.

@liwenwu-amazon
Copy link
Contributor Author

@sdavids13 Here is one way to achieve your goal:

  • config node watcher pod to use hostNetwork: true, so that node watcher can run before CNI allocates ENI and IPs
  • if a pod is scheduled to a node before the node get annotated with the eniConfig, this pod will NOT get a IP. After node is annotated with the eniConfig, ipamD will start allocating IPs and ENIs. After this, the Pod will get a IP from the subnet configured in the eniConfig CRD

Will this satisfy your requirement?

@ewbankkit
Copy link
Contributor

@sdavids13 I have exactly the same scenario I need to solve.
@liwenwu-amazon My understanding of the proposed solution is:

  1. Set AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true in the aws-node Daemonset
  2. Deploy the node watcher pod with hostNetwork: true
  3. Start EC2 instance workers without the k8s.amazonaws.com/eniConfig=<ENIConfig name> annotation
  4. node watcher (and other pods with hostNetwork: true) will get scheduled and use the IP address of the worker node it's scheduled on
  5. node worker runs a standard controller loop, ensuring Nodes are annotated with the "correct" ENIConfig annotation (see below)
  6. Once a worker node is annotated correctly, regular hostNetwork: false pods will run on it and use IP addresses based on the associated ENIConfig

Determining the "correct" ENIConfig could be done in a number of ways:

  • Using the failure-domain.beta.kubernetes.io/zone label on the node to determine the node's AZ and then looking up the corresponding ENIConfig in a ConfigMap
  • Using the node's ExternalID attribute (=EC2 instance ID) to determine the node's AZ by making EC2 API calls and then looking up the ENIConfig
  • Look at all the ENIConfigs registered and from the associated subnet ID build an AZ to ENIConfig map instead of using a ConfigMap
  • ...

Sound about right?

@taylorb-syd
Copy link
Contributor

taylorb-syd commented Feb 1, 2019

Just FYI for those of you who find yourself here trying to solve the problem that @sdavids13 raised.

Currently in the master build, which theoretically should be included in the 1.4.0 and later releases, the following PR was made:

Feature: ENIConfig set by custom annotation or label names #280

What this feature does is two things:

  • Expands the control of ENIConfig to that of a Label as a well as a Annotation.
  • Added control variables ENI_CONFIG_LABEL_DEF and ENI_CONFIG_ANNOTATION_DEF to change the controlling label/annotation from k8s.amazonaws.com/eniConfig to an arbitrary label/annotation.

The upside of this is that if you set ENI_CONFIG_LABEL_DEF to failure-domain.beta.kubernetes.io/zone then create a ENIConfig for each Availability Zone in your VPC (e.g. us-east-1a and us-east-1b), it will automatically select that the correct ENIConfig for your availability zone, without the requirement of a watcher, custom labels, or any external infrastructure.

Additionally as the code is written to prefer an annotation over a label, this means that if you want to override the ENI Config, you can annotate to override these "default for availability zone".

Edit: Changed expected release based upon comment by @mogren

@mogren
Copy link
Contributor

mogren commented Feb 1, 2019

Unfortunately, 1.3.1 won't have this change. I created a tracking ticket on the AWS container roadmap board for the next CNI release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants