-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165
Conversation
f16c913
to
8bb4c71
Compare
Currently running this on a test cluster, seems to be working as advertised |
scripts/aws-cni-support.sh
Outdated
curl http://localhost:61678/v1/env-settings > ${LOG_DIR}/env.output | ||
curl http://localhost:61678/v1/networkutils-env-settings > ${LOG_DIR}/networkutils-env.output | ||
curl http://localhost:61678/v1/ipamd-env-settings > ${LOG_DIR}/ipamd-env.output | ||
curl http://localhost:61678/v1/eni-configs > ${LOG_DIR}/eni-configs.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Why does the eni-configs log file have a .out
extension when all the others have .output
ipamd/ipamd.go
Outdated
@@ -43,11 +45,40 @@ const ( | |||
ipPoolMonitorInterval = 5 * time.Second | |||
maxRetryCheckENI = 5 | |||
eniAttachTime = 10 * time.Second | |||
defaultWarmENITarget = 1 | |||
nodeIPPoolReconcileInterval = 60 * time.Second | |||
maxK8SRetries = 12 | |||
retryK8SInterval = 5 * time.Second | |||
noWarmIPTarget = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this to defaultWarmIPTarget
and move it to line 62?
I've noticed to unexpected behavior while using this: I'm attempting to ping from one physical machine inside a non-AWS datacenter, to a pod running on a k8s cluster running on instances in EC2. The two are linked via a direct connect. When pinging from the pod to the physical machine, the ping request and reply packets make it through fine: 13:52:40.197673 IP pod-host-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo request, id 968, seq 4, length 64
13:52:40.197708 IP bare-metal-XX.XX.XX.XX > pod-host-XX.XX.XX.XX: ICMP echo reply, id 968, seq 4, length 64 To note is that I seem to be receiving packets from the EC2 instance itself, over When pinging from the physical machine to the pod ip, the physical machine sees no replies. If I $ tcpdump -n -i eth2 icmp and host bare-metal-XX.XX.XX.XX
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:31.639213 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 1, length 64
13:37:32.693195 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 2, length 64
13:37:33.717124 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 3, length 64
13:37:34.741126 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 4, length 64
13:37:35.765154 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 5, length 64 $ tcpdump -n -i eth0 icmp and host bare-metal-XX.XX.XX.XX
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:46.265847 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 1, length 64
13:37:47.285361 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 2, length 64
13:37:48.309237 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 3, length 64 So it appears something is incorrectly configured with ip routes/rules, since the reply is leaving on the wrong device. When I check the routes/rules: $ ip rule show | grep pod-XX.XX.XX.XX
512: from all to pod-XX.XX.XX.XX lookup main
1536: from pod-XX.XX.XX.XX lookup 2
$ ip route show table 2
default via 10.132.0.1 dev eth1
10.132.0.1 dev eth1 scope link According to this, it seems like traffic from the pod should in fact be leaving over $ ip route get bare-metal-XX.XX.XX.XX from pod-XX.XX.XX.XX iif eniab037e126c3
bare-metal-XX.XX.XX.XX from pod-XX.XX.XX.XX via pod-host-XX.XX.XX.XX dev eth0
cache iif eniab037e126c3
I'm hoping that this isn't the expected behavior? Is there any further information I can provide in helping to debug this? Thanks for the great work, we're very excited to start using this feature! |
@stuartnelson3 can you set |
8bb4c71
to
ddbe248
Compare
LGTM! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure all the tests have passed with the latest changes.
When I set The default setting, Do you have any suggestions how to further trouble shoot this, or if there's any more information I could add to my previous comment (#165 (comment)) ? |
@stuartnelson3 What's your VPC topology? In another word, how is your baremetal connected to Pods in the Pod's VPC? Are your pods using Pod's specific subnet and security groups from this PR? |
my mistake! the routers in our baremetal datacenter were blocking the CIDR block of the secondary subnet. Setting |
- crd.k8s.amazonaws.com | ||
resources: | ||
- "*" | ||
- namespaecs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liwenwu-amazon I realize this is already merged, but it looks like theres a typo in this ClusterRole rule: namespaecs
-> namespaces
, though with the wildcard above it this line may be redundant.
Are there instructions on how this could be used in a multi-AZ autoscaling group? When a machine comes up how can it know that if it comes up in us-east-1a to use the alternate subnet X (eniConfig=groupX-pod-netconfig) while anything launching in us-east-1b uses alternate subnet Y (eniConfig=groupY-pod-netconfig). Also, where can you set the node annotation where it is already configured before it joins the cluster (i.e. you don't need to run |
I am wondering this as well @sdavids13 , my thought to this is that you could have the user_data in the launch config for the autoscaling group check what AZ the instance is in by Edit: Sorry, I realized you need annotations, not labels, I dont see a way either to assign annotations in kubelet, so it somehow needs to be done after it's joined the cluster which really isn't that feasible since you may have nodes scaling up or down. If it could have an EC2 tag to set the ENIConfig it would be easy to specify which to use for each subnet. After testing this more it is mainly usable when you run all your instances in one AZ, by doing this you could specify a default eniconfig and then all your instances would automatically allocate ip's from the subnet specified. However, if you want to run across different az's, you would need to make multiple eni configs, lets say one for each AZ. If you dont have default config your nodes by default will not allocate any ip's for pods based on the subnet the worker node was created in, so your worker node is not operating. This means you would need to annotate every node that comes up by hand if you can not do it during node bootstrap. Would it be possible to have the default behavior to just use the worker node subnet to allocate pod ip's even if the flag is enabled instead of having to define a default eniconfig, or is there a programatic approach you can recommend @liwenwu-amazon on how to annotate nodes? Or even switch to use labels? |
@liwenwu-amazon @bchav Could either of you please help explain how this feature can be used in a multi-AZ environment on node startup? Looking at the kubelet documentation there doesn't appear to be a way to specify annotations at node startup/registration time. Could you please provide a mechanism to allow us to set the |
@sdavids13 One note on
|
@bnutt , "Would it be possible to have the default behavior to just use the worker node subnet to allocate pod ip's even if the flag is enabled instead of having to define a default eniconfig, or is there a programatic approach you can recommend @liwenwu-amazon on how to annotate nodes? Or even switch to use labels?" Yes, today's default behavior (AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = false) is all ENIs are using same security groups and subnet as worker node. |
@liwenwu-amazon The problem with doing it after the worker node joins the cluster is that pods are already going to start being scheduled on the host. There will effectively be a race condition between when pods will be scheduled/run on the node and when the external "watcher" process can update the annotation. Then what ends up happening? The pods might be launched into the incorrect/undesired subnet and then you will need to kill all of those pods to then be rescheduled in the correct subnet? Please correct me if there is a better approach that won't allow pods to be scheduled on the host until the annotation is applied. (important to note that if you don't supply a "default" eniConfig then the critical "watcher" pod wouldn't be able to be deployed and hence will never allow any other pods to start, hence nothing would be able to be launched from a new cluster) Alternatively could we provide an environment variable through kubelet (or a similar mechanism) that can be applied in the user-data script to provide the "default" |
@sdavids13 I have a question with your deployment: After a node has been annotated with a eniConfig, how do you prevent a new Pod (e.g. The pod which is supposed to use a different subnet) being scheduled on this node? |
@liwenwu-amazon I'm not quite following your question - I would only want pods to be scheduled after the node has been annotated, so I'm not sure why I would want to prevent them from being scheduled after being annotated. But to answer your question... you would generally Taking a step back this is my goal:
|
@sdavids13 Here is one way to achieve your goal:
Will this satisfy your requirement? |
@sdavids13 I have exactly the same scenario I need to solve.
Determining the "correct" ENIConfig could be done in a number of ways:
Sound about right? |
Just FYI for those of you who find yourself here trying to solve the problem that @sdavids13 raised. Currently in the master build, which theoretically should be included in the 1.4.0 and later releases, the following PR was made: Feature: ENIConfig set by custom annotation or label names #280 What this feature does is two things:
The upside of this is that if you set Additionally as the code is written to prefer an annotation over a label, this means that if you want to override the ENI Config, you can annotate to override these "default for availability zone". Edit: Changed expected release based upon comment by @mogren |
Unfortunately, 1.3.1 won't have this change. I created a tracking ticket on the AWS container roadmap board for the next CNI release. |
Issue #131
Problem
Today ipamD uses the node's primary ENI's SecurityGroups and Subnet when allocating secondary ENIs for Pod.
Here are few use cases which require Pods to use different SecurityGroups and Subnets than Primary ENI:
Pod's Custom Network Config
ENIConfig CRD
Here we define a new CRD ENIConfig to allow user to configure SecurityGroups and Subet for Pods network config:
Node Annotation
We will use Node's Annotation to indicate which ENIConfig will be used for this Node's Pod network.
default
ENIConfigIf a node does not have annotation
k8s.amazonaws.com/eniConfig
, it will use ENIConfig whose name isdefault
Workflow
ENIConfig
for ENI allocationsubnet-0c4678ec01ce68b24
)sg-066c7927a794cf7e7
,sg-08f5f22cfb70d8405
,sg-0bb293fc16f5c2058
)group1-pod-netconfig
ipamD
will use these for ENI allocationBehavior when there is a ENIConfig Configuration Change
If user changes ENIConfig CRD definition (e.g. using different subnet or different security groups), or changes node's annotation to use different ENIConfig CRD, ipamD will use this new configuration when allocating new ENIs.
Alternative Solutions Considered But Not Adopted:
Use Environment for Pod Subnet and SecurityGroups
same
Subnet and SecurityGroupsUse
/etc/cni/net.d/aws.conf
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.