You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had already implemented ClusterIP services support in Antrea, but Kube-Proxy still needs to be used to support NodePort services. Since the Kube-Proxy does not support to run only for NodePort services, the ClusterIP Service calculations of it waste a lot of CPU cycles and memories. Once we implemented NodePort services support in Antrea Proxy and removed Kube-Proxy in the cluster, then the overhead will definitely be reduced. Furthermore, traffic for watching Service resources should also decrease and the pressure of APIServer should also be lower.
Describe the solution you have in mind
For every kind of Kubernetes Services, when accessing them, traffic should always be DNATed to one Endpoint. Thus, we can reuse the Endpoint selection flows in OVS no matter whether it is a NodePort Service or a ClusterIP service. To achieve this, traffic goes to the host should be redirected to the OVS correctly. We then need to use IPTables to redirect NodePort traffic to the OVS to complete the LB.
Describe how your solution impacts user flows
After we implemented this feature, we then can remove Kube-Proxy Deployments in theory.
We can also begin to consider the way to setup Antrea without Kube-Proxy.
Describe the main design/architecture of your solution
From our prior experiments, the performance of IPTables will go down significantly if there are too many rules, thus we should keep the number of IPTables rules as small as possible. By using IPSet, we can then use only one IPTables rule to redirect traffic, the matching complexity will be O(1) since we can a set with hash type. For each valid NodePort service, there should be two entries in the NodePort IPSet. E.g. a NodePort Service which has a node port 31091 and accepts TCP connections, there should be entries like:
127.0.0.1,tcp:31091
192.168.77.100,tcp:31091
In nat tale, the iptables workflow looks like:
Traffic no matter it comes from the remote or current host, once its destination matches entries then we need to forward it to the OVS. By doing DNAT, LLA 169.254.169.254, we make the packets can be forwarded to the OVS. To make the forward action really happened, we still need an IP route rule: 169.254.169.254/32 via 169.254.169.254 dev antrea-gw0 onlink. Traffic may be sent from 127.0.0.1 and then we need to do SNAT for it to ensure the reply can work. In the POSTROUTING chain of the nat table, we SNAT traffic that has source IP 127.0.0.1 to the Node IP.
Once traffic goes into the OVS, it can have several different paths, then let’s discuss them case by case.
Topology cases
Here are the traffic paths of three complex cases.
Request NodePort on the same node
Request NodePort on the remote node
Use client in a Pod
Request NodePort on the same node
Request NodePort on the remote node
Based on the discussion above, we need the following flows:
For the host traffic forward part, we can use alternatives like EBPF or IPVS. But for now, I can not see any significant drawback if we use IPTables.
Test plan
We can verify and protect this feature by using e2e tests.
Additional context
Since we use IPSET to match NodePort Services, the time complexity of matching should be O(1). While the time complexity of OVS flow matching is also O(1), the performance should not have a significant decrease. Moreover, since the IPTables rules will reduce significantly, the connect delay should decrease.
According to the analysis, we can believe that the implementation will improve or keep the performance comparing to the Kube-proxy.
As we can see, the traffic from Pod to a NodePort Service will go through a complex path. But as the NodePort services are designed for, Pods to NodePort should not be a common use case. To keep the implementation clear and efficient for those real use cases, it is reasonable to implement like this.
This draft is only for Linux Nodes, we still need to design the solution for Windows Nodes.
The text was updated successfully, but these errors were encountered:
Describe what you are trying to solve
We had already implemented ClusterIP services support in Antrea, but Kube-Proxy still needs to be used to support NodePort services. Since the Kube-Proxy does not support to run only for NodePort services, the ClusterIP Service calculations of it waste a lot of CPU cycles and memories. Once we implemented NodePort services support in Antrea Proxy and removed Kube-Proxy in the cluster, then the overhead will definitely be reduced. Furthermore, traffic for watching Service resources should also decrease and the pressure of APIServer should also be lower.
Describe the solution you have in mind
For every kind of Kubernetes Services, when accessing them, traffic should always be DNATed to one Endpoint. Thus, we can reuse the Endpoint selection flows in OVS no matter whether it is a NodePort Service or a ClusterIP service. To achieve this, traffic goes to the host should be redirected to the OVS correctly. We then need to use IPTables to redirect NodePort traffic to the OVS to complete the LB.
Describe how your solution impacts user flows
After we implemented this feature, we then can remove Kube-Proxy Deployments in theory.
We can also begin to consider the way to setup Antrea without Kube-Proxy.
Describe the main design/architecture of your solution
From our prior experiments, the performance of IPTables will go down significantly if there are too many rules, thus we should keep the number of IPTables rules as small as possible. By using IPSet, we can then use only one IPTables rule to redirect traffic, the matching complexity will be
O(1)
since we can a set with hash type. For each valid NodePort service, there should be two entries in the NodePort IPSet. E.g. a NodePort Service which has a node port 31091 and accepts TCP connections, there should be entries like:In
nat
tale, the iptables workflow looks like:Traffic no matter it comes from the remote or current host, once its destination matches entries then we need to forward it to the OVS. By doing DNAT, LLA 169.254.169.254, we make the packets can be forwarded to the OVS. To make the forward action really happened, we still need an IP route rule:
169.254.169.254/32 via 169.254.169.254 dev antrea-gw0 onlink
. Traffic may be sent from127.0.0.1
and then we need to do SNAT for it to ensure the reply can work. In the POSTROUTING chain of the nat table, we SNAT traffic that has source IP127.0.0.1
to the Node IP.Once traffic goes into the OVS, it can have several different paths, then let’s discuss them case by case.
Topology cases
Here are the traffic paths of three complex cases.
Request NodePort on the same node
Request NodePort on the remote node
Use client in a Pod
Request NodePort on the same node
Request NodePort on the remote node
Based on the discussion above, we need the following flows:
Virtual IP ARP responder
Flow that makes NodePort packets comes from tunnel back to tunnel
Flow that makes NodePort packets comes from gateway back to the gateway with ServiceCTMark
Alternative solutions that you considered
For the host traffic forward part, we can use alternatives like EBPF or IPVS. But for now, I can not see any significant drawback if we use IPTables.
Test plan
We can verify and protect this feature by using e2e tests.
Additional context
Since we use IPSET to match NodePort Services, the time complexity of matching should be O(1). While the time complexity of OVS flow matching is also O(1), the performance should not have a significant decrease. Moreover, since the IPTables rules will reduce significantly, the connect delay should decrease.
According to the analysis, we can believe that the implementation will improve or keep the performance comparing to the Kube-proxy.
As we can see, the traffic from Pod to a NodePort Service will go through a complex path. But as the NodePort services are designed for, Pods to NodePort should not be a common use case. To keep the implementation clear and efficient for those real use cases, it is reasonable to implement like this.
This draft is only for Linux Nodes, we still need to design the solution for Windows Nodes.
The text was updated successfully, but these errors were encountered: