Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea Proxy NodePort Service Support #79

Open
weiqiangt opened this issue Oct 14, 2020 · 0 comments
Open

Antrea Proxy NodePort Service Support #79

weiqiangt opened this issue Oct 14, 2020 · 0 comments

Comments

@weiqiangt
Copy link
Owner

weiqiangt commented Oct 14, 2020

Describe what you are trying to solve

We had already implemented ClusterIP services support in Antrea, but Kube-Proxy still needs to be used to support NodePort services. Since the Kube-Proxy does not support to run only for NodePort services, the ClusterIP Service calculations of it waste a lot of CPU cycles and memories. Once we implemented NodePort services support in Antrea Proxy and removed Kube-Proxy in the cluster, then the overhead will definitely be reduced. Furthermore, traffic for watching Service resources should also decrease and the pressure of APIServer should also be lower.

Describe the solution you have in mind

For every kind of Kubernetes Services, when accessing them, traffic should always be DNATed to one Endpoint. Thus, we can reuse the Endpoint selection flows in OVS no matter whether it is a NodePort Service or a ClusterIP service. To achieve this, traffic goes to the host should be redirected to the OVS correctly. We then need to use IPTables to redirect NodePort traffic to the OVS to complete the LB.

Describe how your solution impacts user flows

After we implemented this feature, we then can remove Kube-Proxy Deployments in theory.
We can also begin to consider the way to setup Antrea without Kube-Proxy.

Describe the main design/architecture of your solution

From our prior experiments, the performance of IPTables will go down significantly if there are too many rules, thus we should keep the number of IPTables rules as small as possible. By using IPSet, we can then use only one IPTables rule to redirect traffic, the matching complexity will be O(1) since we can a set with hash type. For each valid NodePort service, there should be two entries in the NodePort IPSet. E.g. a NodePort Service which has a node port 31091 and accepts TCP connections, there should be entries like:

127.0.0.1,tcp:31091
192.168.77.100,tcp:31091

In nat tale, the iptables workflow looks like:

Traffic no matter it comes from the remote or current host, once its destination matches entries then we need to forward it to the OVS. By doing DNAT, LLA 169.254.169.254, we make the packets can be forwarded to the OVS. To make the forward action really happened, we still need an IP route rule: 169.254.169.254/32 via 169.254.169.254 dev antrea-gw0 onlink. Traffic may be sent from 127.0.0.1 and then we need to do SNAT for it to ensure the reply can work. In the POSTROUTING chain of the nat table, we SNAT traffic that has source IP 127.0.0.1 to the Node IP.

Once traffic goes into the OVS, it can have several different paths, then let’s discuss them case by case.

Topology cases

Here are the traffic paths of three complex cases.

Request NodePort on the same node
Request NodePort on the remote node
Use client in a Pod
Request NodePort on the same node
Request NodePort on the remote node

Based on the discussion above, we need the following flows:

Virtual IP ARP responder

[]binding.Flow{
    c.pipeline[spoofGuardTable].BuildFlow(priorityNormal).MatchProtocol(binding.ProtocolARP).
        MatchInPort(gatewayOFPort).
        MatchARPTpa(NodePortVirtualIP).
        MatchARPSpa(nodeIP).
        Action().GotoTable(arpResponderTable).
        Cookie(c.cookieAllocator.Request(cookie.Service).Raw()).
        Done(),
    c.pipeline[arpResponderTable].BuildFlow(priorityNormal).MatchProtocol(binding.ProtocolARP).
        MatchARPOp(1).
        MatchARPTpa(NodePortVirtualIP).
        Action().Move(binding.NxmFieldSrcMAC, binding.NxmFieldDstMAC).
        Action().SetSrcMAC(globalVirtualMAC).
        Action().LoadARPOperation(2).
        Action().Move(binding.NxmFieldARPSha, binding.NxmFieldARPTha).
        Action().SetARPSha(globalVirtualMAC).
        Action().Move(binding.NxmFieldARPSpa, binding.NxmFieldARPTpa).
        Action().SetARPSpa(NodePortVirtualIP).
        Action().OutputInPort().
        Cookie(c.cookieAllocator.Request(cookie.Service).Raw()).
        Done(),
}

Flow that makes NodePort packets comes from tunnel back to tunnel

[]binding.Flow{
    ctStateTable.BuildFlow(priorityNormal).
        MatchProtocol(binding.ProtocolIP).
        MatchRegRange(int(marksReg), markTrafficFromTunnel, binding.Range{0, 15}).
        MatchSrcIP(tunnelPeerIP).
        MatchDstIPNet(localPodCIDR).
        Action().CT(true, ctStateTable.GetNext(), CtZone).
        LoadToMark(nodePortCTMark).
        CTDone().
        Done(),
    c.pipeline[l3ForwardingTable].BuildFlow(priorityNormal).
        MatchProtocol(binding.ProtocolIP).
        MatchCTMark(nodePortCTMark).
        Action().DecTTL().
        Action().LoadRegRange(int(portCacheReg), tunOFPort, ofPortRegRange).
        Action().LoadRegRange(int(marksReg), portFoundMark, ofPortMarkRange).
        Action().SetTunnelDst(tunnelPeerIP).
        Action().GotoTable(conntrackCommitTable).
        Cookie(c.cookieAllocator.Request(cookie.Service).Raw()).
        Done(),
}

Flow that makes NodePort packets comes from gateway back to the gateway with ServiceCTMark

c.pipeline[conntrackCommitTable].BuildFlow(priorityHigh).
    MatchProtocol(binding.ProtocolIP).
    MatchCTMark(serviceCTMark).
    MatchCTStateNew(true).
    MatchCTStateTrk(true).
    MatchRegRange(int(marksReg), markTrafficFromGateway, binding.Range{0, 15}).
    Action().GotoTable(L2ForwardingOutTable).
    Done()

Alternative solutions that you considered

For the host traffic forward part, we can use alternatives like EBPF or IPVS. But for now, I can not see any significant drawback if we use IPTables.

Test plan

We can verify and protect this feature by using e2e tests.

Additional context

Since we use IPSET to match NodePort Services, the time complexity of matching should be O(1). While the time complexity of OVS flow matching is also O(1), the performance should not have a significant decrease. Moreover, since the IPTables rules will reduce significantly, the connect delay should decrease.

According to the analysis, we can believe that the implementation will improve or keep the performance comparing to the Kube-proxy.

As we can see, the traffic from Pod to a NodePort Service will go through a complex path. But as the NodePort services are designed for, Pods to NodePort should not be a common use case. To keep the implementation clear and efficient for those real use cases, it is reasonable to implement like this.

This draft is only for Linux Nodes, we still need to design the solution for Windows Nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant