Skip to content

Commit

Permalink
Fix Misc DSR Issues (#1174)
Browse files Browse the repository at this point in the history
* fact(NSC): consolidate constants to top

* fix(NSC): increase IPVS add service logging

* fix(NSC): improve logging for FWMark IPVS entries

* fix(NSC): add missing parameter to logging

* feat(NSC): generate unique FW marks

Because we trim the 32-bit FNV-1a hash to 16 bits there is the potential
for FW marks to collide with each other even for unique inputs of IP,
protocol, and port. This reduces that chance up to the 16-bit max by
keeping track of which FW marks we've already allocated and what IP,
protocol, port combo they've been allocated for.

Fixes #1045

* fact(NSC): move utility funcs to utils

* fix(NSC): reduce IPVS service shell outs

This also aligns it more with the almost identical function used for
non-FWmarked services ipvsAddService() which is also called from
setupExternalIPServices and passes in this same list of ipvsServices.

* fix(NSC): fix & consolidate DSR cleanup code

A lot of this is refactor work, but its important to know why the DSR
mangle tables were not being cleaned up in the first place. When we
transitioned to iptables-save to look over the mangle rules, we didn't
realize that iptables-save changes the format of the marks from integer
values (which is what the CLI works with) to hexadecimal.

This made it so that we were never actually matching on a mangle rule,
which left them all behind. When these mangle rules were left, it meant
that IPs that used to be part of a DSR service were essentially
black-holed on the system and were no longer route-able.

Fixes #1167

* doc(dsr): expand DSR documentation

fixes #1055

* ensure active service map is updated for non DSR services

Co-authored-by: Murali Reddy <muralimmreddy@gmail.com>
  • Loading branch information
aauren and murali-reddy authored Oct 14, 2021
1 parent 6a5a296 commit c3f90c5
Show file tree
Hide file tree
Showing 8 changed files with 527 additions and 207 deletions.
125 changes: 125 additions & 0 deletions docs/dsr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Direct Server Return

## More Information

For a more detailed explanation on how to use Direct Server Return (DSR) to build a highly scalable and available ingress for Kubernetes see the following blog post:
https://cloudnativelabs.github.io/post/2017-11-01-kube-high-available-ingress/

## What is DSR?

When enabled, DSR allows the service endpoint to respond directly to the client request, bypassing the service proxy. When DSR is enabled kube-router will use LVS's tunneling mode to achieve this (more on how later).

## Quick Start

You can enable DSR functionality on a per service basis.

Requirements:
* ClusterIP type service has an externalIP set on it or is a LoadBalancer type service
* kube-router has been started with `--service-external-ip-range` configured at least once. This option can be specified multiple times for multiple ranges. The external IPs or LoadBalancer IPs must be included in these ranges.
* kube-router must be run in service proxy mode with `--run-service-proxy` (this option is defaulted to `true` if left unspecified)
* If you are advertising the service outside the cluster `--advertise-external-ip` must be set
* If kube-router is deployed as a Kubernetes pod:
* `hostIPC: true` must be set for the pod
* `hostPID: true` must be set for the pod
* The container runtime socket must be mounted into the kube-router pod via a `hostPath` volume mount.

To enable DSR you need to annotate service with the `kube-router.io/service.dsr=tunnel` annotation:
```
kubectl annotate service my-service "kube-router.io/service.dsr=tunnel"
```

## Things To Lookout For
* In the current implementation, **DSR will only be available to the external IPs or LoadBalancer IPs**
* **The current implementation does not support port remapping.** So you need to use same port and target port for the service.
* In order for DSR to work correctly, an `ipip` tunnel to the pod is used. This reduces the [MTU](https://en.wikipedia.org/wiki/Maximum_transmission_unit) for the packet by 20 bytes. Because of the way DSR works it is not possible for clients to use [PMTU](https://en.wikipedia.org/wiki/Path_MTU_Discovery) to discover this MTU reduction. In TCP based services, we mitigate this by using iptables to set the [TCP MSS](https://en.wikipedia.org/wiki/Maximum_segment_size) value to 20 bytes less than kube-router's primary interface MTU size. However, it is not possible to do this for UDP streams. Therefore, UDP streams that continuously use large packets may see a performance impact due to packet fragmentation. Additionally, if clients set the `DF` (Do Not Fragment) bit, services may see packet loss on UDP services.

## Kubernetes Pod Examples
As mentioned previously, if kube-router is run as a Kubernetes deployment, there are a couple of things needed on the deployment. Below is an example of what is necessary to get going (this is NOT a full deployment, it is just meant to highlight the elements needed for DSR):
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: kube-router
tier: node
name: kube-router
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: kube-router
tier: node
template:
metadata:
labels:
k8s-app: kube-router
tier: node
spec:
hostNetwork: true
hostIPC: true
hostPID: true
volumes:
- name: run
hostPath:
path: /var/run/docker.sock
...
containers:
- name: kube-router
image: docker.io/cloudnativelabs/kube-router:latest
...
volumeMounts:
- name: run
mountPath: /var/run/docker.sock
readOnly: true
...
```

For an example manifest please look at the [kube-router all features manifest](../daemonset/kubeadm-kuberouter-all-features-dsr.yaml) with DSR requirements for Docker enabled.

### DSR with containerd or cri-o

As of kube-router-1.2.X and later, kube-router's DSR mode now works with non-docker container runtimes. Officially only containerd has been tested, but this solution should work with cri-o as well.

Most of what was said above also applies for non-docker container runtimes, however, there are some adjustments that you'll need to make:
* You'll need to let kube-router know what container runtime socket to use via the `--runtime-endpoint` CLI parameter
* If running kube-router as a Kubernetes deployment you'll need to make sure that you expose the correct socket via `hostPath` volume mount

Here is an example kube-router daemonset manifest with just the changes needed to enable DSR with containerd (this is not a full manifest, it is just meant to highlight differences):
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-router
spec:
template:
spec:
...
volumes:
- name: containerd-sock
hostPath:
path: /run/containerd/containerd.sock
...
containers:
- name: kube-router
args:
- --runtime-endpoint=unix:///run/containerd/containerd.sock
...
volumeMounts:
- name: containerd-sock
mountPath: /run/containerd/containerd.sock
readOnly: true
...
```

## More Details About DSR

In order to facilitate troubleshooting it is worth while to explain how kube-router accomplishes DSR functionality.

1. kube-router adds iptables rules to the `mangle` table which marks incoming packets destined for DSR based services with a unique FW mark. This mark is then used in later stages to identify the packet and route it correctly. Additionally, for TCP streams, there are rules that enable [TCP MSS](https://en.wikipedia.org/wiki/Maximum_segment_size) since the packets will change MTU when traversing an ipip tunnel later on.
2. kube-router adds the marks to an `ip rule` (see: [ip-rule(8)](https://man7.org/linux/man-pages/man8/ip-rule.8.html)). This ip rule then forces the incoming DSR service packets to use a specific routing table.
3. kube-router adds a new `ip route` table (at the time of this writing the table number is `78`) which forces the packet to route to the host even though there are no interfaces on the host that carry the DSR IP address
4. kube-router adds an IPVS server configured for the custom FW mark. When packets arrive on the localhost interface because of the above `ip rule` and `ip route`, IPVS will intercept them based on their unique FW mark.
5. When pods selected by the DSR service become ready, kube-router adds endpoints configured for tunnel mode to the above IPVS server. Each endpoint is configured in tunnel mode (as opposed to masquerade mode), which then encapsulates the incoming packet in an ipip packet. It is at this point that the pod's destination IP is placed on the ipip packet header so that a packet can be routed to the pod via the kube-bridge on either this host or the destination host.
6. kube-router then finds the targeted pod and enters it's local network namespace. Once inside the pod's linux network namespace, it sets up two new interfaces called `kube-dummy-if` and `ipip`. `kube-dummy-if` is configured with the externalIP address of the service.
7. When the ipip packet arrives inside the pod, the original source packet with the externalIP is then extracted from the ipip packet via the `ipip` interface and is accepted to the listening application via the `kube-dummy-if` interface.
8. When the application sends its response back to the client, it responds to the client's public IP address (since that is what it saw on the request's IP header) and the packet is returned directly to the client (as opposed to traversing the Kubernetes internal network and potentially making multiple intermediate hops)
27 changes: 0 additions & 27 deletions docs/user-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,33 +225,6 @@ If you want to also hairpin externalIPs declared for Service `my-service` (note,
kubectl annotate service my-service "kube-router.io/service.hairpin.externalips="
```

## Direct server return

Please read below blog on how to user DSR in combination with `--advertise-external-ip` to build highly scalable and available ingress.
https://cloudnativelabs.github.io/post/2017-11-01-kube-high-available-ingress/

You can enable DSR(Direct Server Return) functionality per service. When enabled service endpoint
will directly respond to the client by passing the service proxy. When DSR is enabled Kube-router
will uses LVS's tunneling mode to achieve this.

To enable DSR you need to annotate service with `kube-router.io/service.dsr=tunnel` annotation. For e.g.

```
kubectl annotate service my-service "kube-router.io/service.dsr=tunnel"
```

**In the current implementation when annotation is applied on the service, DSR will be applicable only to the external IP's.**

**Also when DSR is used, current implementation does not support port remapping. So you need to use same port and target port for the service**

You will need to enable `hostIPC: true` and `hostPID: true` in kube-router daemonset manifest.
Also host path `/var/run/docker.sock` must be made a volumemount to kube-router.

Above changes are required for kube-router to enter pod namespeace and create ipip tunnel in the pod and to
assign the external IP to the VIP.

For an e.g manifest please look at [manifest](../daemonset/kubeadm-kuberouter-all-features-dsr.yaml) with DSR requirements enabled.

## SNATing Service Traffic
By default, as traffic ingresses into the cluster, kube-router will source nat the traffic to ensure symmetric routing if it needs to proxy that traffic to ensure it gets to a node that has a service pod that is capable of servicing the traffic. This has a potential to cause issues when network policies are applied to that service since now the traffic will appear to be coming from a node in your cluster instead of the traffic originator.

Expand Down
Loading

0 comments on commit c3f90c5

Please sign in to comment.