From 1d68a02353fdb11d6e4cff6469dcb5f6a7f560e3 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Wed, 8 Dec 2021 17:43:33 +0000 Subject: [PATCH 1/6] Migrate reference details out Service concept Migrate away: - details of virtual IP mechanism for Services - detailed information about protocols for Services (UDP, TCP, SCTP) Co-authored-by: Antonio Ojea Co-authored-by: Qiming Teng --- .../concepts/services-networking/service.md | 398 +----------------- .../en/docs/reference/networking/_index.md | 10 + .../reference/networking/service-protocols.md | 124 ++++++ .../docs/reference/networking/virtual-ips.md | 337 +++++++++++++++ 4 files changed, 493 insertions(+), 376 deletions(-) create mode 100644 content/en/docs/reference/networking/_index.md create mode 100644 content/en/docs/reference/networking/service-protocols.md create mode 100644 content/en/docs/reference/networking/virtual-ips.md diff --git a/content/en/docs/concepts/services-networking/service.md b/content/en/docs/concepts/services-networking/service.md index 56b3c8fa9ef44..28be93a444eb7 100644 --- a/content/en/docs/concepts/services-networking/service.md +++ b/content/en/docs/concepts/services-networking/service.md @@ -145,14 +145,16 @@ spec: targetPort: http-web-svc ``` + This works even if there is a mixture of Pods in the Service using a single configured name, with the same network protocol available via different port numbers. This offers a lot of flexibility for deploying and evolving your Services. For example, you can change the port numbers that Pods expose in the next version of your backend software, without breaking clients. -The default protocol for Services is TCP; you can also use any other -[supported protocol](#protocol-support). +The default protocol for Services is +[TCP](/docs/reference/networking/service-protocols/#protocol-tcp); you can also +use any other [supported protocol](/docs/reference/networking/service-protocols/). As many Services need to expose more than one port, Kubernetes supports multiple port definitions on a Service object. @@ -316,150 +318,6 @@ This field follows standard Kubernetes label syntax. Values should either be [IANA standard service names](https://www.iana.org/assignments/service-names) or domain prefixed names such as `mycompany.com/my-custom-protocol`. -## Virtual IPs and service proxies - -Every node in a Kubernetes cluster runs a `kube-proxy`. `kube-proxy` is -responsible for implementing a form of virtual IP for `Services` of type other -than [`ExternalName`](#externalname). - -### Why not use round-robin DNS? - -A question that pops up every now and then is why Kubernetes relies on -proxying to forward inbound traffic to backends. What about other -approaches? For example, would it be possible to configure DNS records that -have multiple A values (or AAAA for IPv6), and rely on round-robin name -resolution? - -There are a few reasons for using proxying for Services: - -* There is a long history of DNS implementations not respecting record TTLs, - and caching the results of name lookups after they should have expired. -* Some apps do DNS lookups only once and cache the results indefinitely. -* Even if apps and libraries did proper re-resolution, the low or zero TTLs - on the DNS records could impose a high load on DNS that then becomes - difficult to manage. - -Later in this page you can read about how various kube-proxy implementations work. Overall, -you should note that, when running `kube-proxy`, kernel level rules may be -modified (for example, iptables rules might get created), which won't get cleaned up, -in some cases until you reboot. Thus, running kube-proxy is something that should -only be done by an administrator which understands the consequences of having a -low level, privileged network proxying service on a computer. Although the `kube-proxy` -executable supports a `cleanup` function, this function is not an official feature and -thus is only available to use as-is. - -### Configuration - -Note that the kube-proxy starts up in different modes, which are determined by its configuration. -- The kube-proxy's configuration is done via a ConfigMap, and the ConfigMap for kube-proxy - effectively deprecates the behavior for almost all of the flags for the kube-proxy. -- The ConfigMap for the kube-proxy does not support live reloading of configuration. -- The ConfigMap parameters for the kube-proxy cannot all be validated and verified on startup. - For example, if your operating system doesn't allow you to run iptables commands, - the standard kernel kube-proxy implementation will not work. - Likewise, if you have an operating system which doesn't support `netsh`, - it will not run in Windows userspace mode. - -### User space proxy mode {#proxy-mode-userspace} - -In this (legacy) mode, kube-proxy watches the Kubernetes control plane for the addition and -removal of Service and Endpoint objects. For each Service it opens a -port (randomly chosen) on the local node. Any connections to this "proxy port" -are proxied to one of the Service's backend Pods (as reported via -Endpoints). kube-proxy takes the `SessionAffinity` setting of the Service into -account when deciding which backend Pod to use. - -Lastly, the user-space proxy installs iptables rules which capture traffic to -the Service's `clusterIP` (which is virtual) and `port`. The rules -redirect that traffic to the proxy port which proxies the backend Pod. - -By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm. - -![Services overview diagram for userspace proxy](/images/docs/services-userspace-overview.svg) - -### `iptables` proxy mode {#proxy-mode-iptables} - -In this mode, kube-proxy watches the Kubernetes control plane for the addition and -removal of Service and Endpoint objects. For each Service, it installs -iptables rules, which capture traffic to the Service's `clusterIP` and `port`, -and redirect that traffic to one of the Service's -backend sets. For each Endpoint object, it installs iptables rules which -select a backend Pod. - -By default, kube-proxy in iptables mode chooses a backend at random. - -Using iptables to handle traffic has a lower system overhead, because traffic -is handled by Linux netfilter without the need to switch between userspace and the -kernel space. This approach is also likely to be more reliable. - -If kube-proxy is running in iptables mode and the first Pod that's selected -does not respond, the connection fails. This is different from userspace -mode: in that scenario, kube-proxy would detect that the connection to the first -Pod had failed and would automatically retry with a different backend Pod. - -You can use Pod [readiness probes](/docs/concepts/workloads/pods/pod-lifecycle/#container-probes) -to verify that backend Pods are working OK, so that kube-proxy in iptables mode -only sees backends that test out as healthy. Doing this means you avoid -having traffic sent via kube-proxy to a Pod that's known to have failed. - -![Services overview diagram for iptables proxy](/images/docs/services-iptables-overview.svg) - -### IPVS proxy mode {#proxy-mode-ipvs} - -{{< feature-state for_k8s_version="v1.11" state="stable" >}} - -In `ipvs` mode, kube-proxy watches Kubernetes Services and Endpoints, -calls `netlink` interface to create IPVS rules accordingly and synchronizes -IPVS rules with Kubernetes Services and Endpoints periodically. -This control loop ensures that IPVS status matches the desired -state. -When accessing a Service, IPVS directs traffic to one of the backend Pods. - -The IPVS proxy mode is based on netfilter hook function that is similar to -iptables mode, but uses a hash table as the underlying data structure and works -in the kernel space. -That means kube-proxy in IPVS mode redirects traffic with lower latency than -kube-proxy in iptables mode, with much better performance when synchronizing -proxy rules. Compared to the other proxy modes, IPVS mode also supports a -higher throughput of network traffic. - -IPVS provides more options for balancing traffic to backend Pods; -these are: - -* `rr`: round-robin -* `lc`: least connection (smallest number of open connections) -* `dh`: destination hashing -* `sh`: source hashing -* `sed`: shortest expected delay -* `nq`: never queue - -{{< note >}} -To run kube-proxy in IPVS mode, you must make IPVS available on -the node before starting kube-proxy. - -When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS -kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy -falls back to running in iptables proxy mode. -{{< /note >}} - -![Services overview diagram for IPVS proxy](/images/docs/services-ipvs-overview.svg) - -In these proxy models, the traffic bound for the Service's IP:Port is -proxied to an appropriate backend without the clients knowing anything -about Kubernetes or Services or Pods. - -If you want to make sure that connections from a particular client -are passed to the same Pod each time, you can select the session affinity based -on the client's IP addresses by setting `service.spec.sessionAffinity` to "ClientIP" -(the default is "None"). -You can also set the maximum session sticky time by setting -`service.spec.sessionAffinityConfig.clientIP.timeoutSeconds` appropriately. -(the default value is 10800, which works out to be 3 hours). - -{{< note >}} -On Windows, setting the maximum session sticky time for Services is not supported. -{{< /note >}} - ## Multi-Port Services For some Services, you need to expose more than one port. @@ -507,40 +365,6 @@ The IP address that you choose must be a valid IPv4 or IPv6 address from within If you try to create a Service with an invalid clusterIP address value, the API server will return a 422 HTTP status code to indicate that there's a problem. -## Traffic policies - -### External traffic policy - -You can set the `spec.externalTrafficPolicy` field to control how traffic from external sources is routed. -Valid values are `Cluster` and `Local`. Set the field to `Cluster` to route external traffic to all ready endpoints -and `Local` to only route to ready node-local endpoints. If the traffic policy is `Local` and there are no node-local -endpoints, the kube-proxy does not forward any traffic for the relevant Service. - -{{< note >}} -{{< feature-state for_k8s_version="v1.22" state="alpha" >}} -If you enable the `ProxyTerminatingEndpoints` -[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) -for the kube-proxy, the kube-proxy checks if the node -has local endpoints and whether or not all the local endpoints are marked as terminating. -If there are local endpoints and **all** of those are terminating, then the kube-proxy ignores -any external traffic policy of `Local`. Instead, whilst the node-local endpoints remain as all -terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, -as if the external traffic policy were set to `Cluster`. -This forwarding behavior for terminating endpoints exists to allow external load balancers to -gracefully drain connections that are backed by `NodePort` Services, even when the health check -node port starts to fail. Otherwise, traffic can be lost between the time a node is still in the node pool of a load -balancer and traffic is being dropped during the termination period of a pod. -{{< /note >}} - -### Internal traffic policy - -{{< feature-state for_k8s_version="v1.22" state="beta" >}} - -You can set the `spec.internalTrafficPolicy` field to control how traffic from internal sources is routed. -Valid values are `Cluster` and `Local`. Set the field to `Cluster` to route internal traffic to all ready endpoints -and `Local` to only route to ready node-local endpoints. If the traffic policy is `Local` and there are no node-local -endpoints, traffic is dropped by kube-proxy. - ## Discovering services Kubernetes supports 2 primary modes of finding a Service - environment @@ -666,6 +490,12 @@ Kubernetes `ServiceTypes` allow you to specify what kind of Service you want. to use the `ExternalName` type. {{< /note >}} +The `type` field was designed as nested functionality - each level adds to the +previous. This is not strictly required on all cloud providers (for example: Google +Compute Engine does not need to allocate a node port to make `type: LoadBalancer` work, +but another cloud provider integration might do). Although strict nesting is not required, +but the Kubernetes API design for Service requires it anyway. + You can also use [Ingress](/docs/concepts/services-networking/ingress/) to expose your Service. Ingress is not a Service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource as it can expose multiple @@ -793,6 +623,7 @@ _As an alpha feature_, you can configure a load balanced Service to [omit](#load-balancer-nodeport-allocation) assigning a node port, provided that the cloud provider implementation supports this. + {{< note >}} On **Azure**, if you want to use a user-specified public type `loadBalancerIP`, you first need @@ -1352,211 +1183,26 @@ spec: - 80.11.12.10 ``` -## Shortcomings +## Session stickiness -Using the userspace proxy for VIPs works at small to medium scale, but will -not scale to very large clusters with thousands of Services. The -[original design proposal for portals](https://github.com/kubernetes/kubernetes/issues/1107) -has more details on this. - -Using the userspace proxy obscures the source IP address of a packet accessing -a Service. -This makes some kinds of network filtering (firewalling) impossible. The iptables -proxy mode does not -obscure in-cluster source IPs, but it does still impact clients coming through -a load balancer or node-port. - -The `Type` field is designed as nested functionality - each level adds to the -previous. This is not strictly required on all cloud providers (e.g. Google Compute Engine does -not need to allocate a `NodePort` to make `LoadBalancer` work, but AWS does) -but the Kubernetes API design for Service requires it anyway. - -## Virtual IP implementation {#the-gory-details-of-virtual-ips} - -The previous information should be sufficient for many people who want to -use Services. However, there is a lot going on behind the scenes that may be -worth understanding. - -### Avoiding collisions - -One of the primary philosophies of Kubernetes is that you should not be -exposed to situations that could cause your actions to fail through no fault -of your own. For the design of the Service resource, this means not making -you choose your own port number if that choice might collide with -someone else's choice. That is an isolation failure. - -In order to allow you to choose a port number for your Services, we must -ensure that no two Services can collide. Kubernetes does that by allocating each -Service its own IP address from within the `service-cluster-ip-range` -CIDR range that is configured for the API server. - -To ensure each Service receives a unique IP, an internal allocator atomically -updates a global allocation map in {{< glossary_tooltip term_id="etcd" >}} -prior to creating each Service. The map object must exist in the registry for -Services to get IP address assignments, otherwise creations will -fail with a message indicating an IP address could not be allocated. - -In the control plane, a background controller is responsible for creating that -map (needed to support migrating from older versions of Kubernetes that used -in-memory locking). Kubernetes also uses controllers to check for invalid -assignments (e.g. due to administrator intervention) and for cleaning up allocated -IP addresses that are no longer used by any Services. - -#### IP address ranges for `type: ClusterIP` Services {#service-ip-static-sub-range} - -{{< feature-state for_k8s_version="v1.25" state="beta" >}} -However, there is a problem with this `ClusterIP` allocation strategy, because a user -can also [choose their own address for the service](#choosing-your-own-ip-address). -This could result in a conflict if the internal allocator selects the same IP address -for another Service. - -The `ServiceIPStaticSubrange` -[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled by default in v1.25 -and later, using an allocation strategy that divides the `ClusterIP` range into two bands, based on -the size of the configured `service-cluster-ip-range` by using the following formula -`min(max(16, cidrSize / 16), 256)`, described as _never less than 16 or more than 256, -with a graduated step function between them_. Dynamic IP allocations will be preferentially -chosen from the upper band, reducing risks of conflicts with the IPs -assigned from the lower band. -This allows users to use the lower band of the `service-cluster-ip-range` for their -Services with static IPs assigned with a very low risk of running into conflicts. - -### Service IP addresses {#ips-and-vips} - -Unlike Pod IP addresses, which actually route to a fixed destination, -Service IPs are not actually answered by a single host. Instead, kube-proxy -uses iptables (packet processing logic in Linux) to define _virtual_ IP addresses -which are transparently redirected as needed. When clients connect to the -VIP, their traffic is automatically transported to an appropriate endpoint. -The environment variables and DNS for Services are actually populated in -terms of the Service's virtual IP address (and port). - -kube-proxy supports three proxy modes—userspace, iptables and IPVS—which -each operate slightly differently. - -#### Userspace - -As an example, consider the image processing application described above. -When the backend Service is created, the Kubernetes master assigns a virtual -IP address, for example 10.0.0.1. Assuming the Service port is 1234, the -Service is observed by all of the kube-proxy instances in the cluster. -When a proxy sees a new Service, it opens a new random port, establishes an -iptables redirect from the virtual IP address to this new port, and starts accepting -connections on it. - -When a client connects to the Service's virtual IP address, the iptables -rule kicks in, and redirects the packets to the proxy's own port. -The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend. - -This means that Service owners can choose any port they want without risk of -collision. Clients can connect to an IP and port, without being aware -of which Pods they are actually accessing. - -#### iptables - -Again, consider the image processing application described above. -When the backend Service is created, the Kubernetes control plane assigns a virtual -IP address, for example 10.0.0.1. Assuming the Service port is 1234, the -Service is observed by all of the kube-proxy instances in the cluster. -When a proxy sees a new Service, it installs a series of iptables rules which -redirect from the virtual IP address to per-Service rules. The per-Service -rules link to per-Endpoint rules which redirect traffic (using destination NAT) -to the backends. - -When a client connects to the Service's virtual IP address the iptables rule kicks in. -A backend is chosen (either based on session affinity or randomly) and packets are -redirected to the backend. Unlike the userspace proxy, packets are never -copied to userspace, the kube-proxy does not have to be running for the virtual -IP address to work, and Nodes see traffic arriving from the unaltered client IP -address. - -This same basic flow executes when traffic comes in through a node-port or -through a load-balancer, though in those cases the client IP does get altered. - -#### IPVS - -iptables operations slow down dramatically in large scale cluster e.g. 10,000 Services. -IPVS is designed for load balancing and based on in-kernel hash tables. -So you can achieve performance consistency in large number of Services from IPVS-based kube-proxy. -Meanwhile, IPVS-based kube-proxy has more sophisticated load balancing algorithms -(least conns, locality, weighted, persistence). +If you want to make sure that connections from a particular client are passed to +the same Pod each time, you can configure session affinity based on the client's +IP address. Read [session affinity](/docs/reference/networking/virtual-ips/#session-affinity) +to learn more. ## API Object Service is a top-level resource in the Kubernetes REST API. You can find more details about the [Service API object](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#service-v1-core). -## Supported protocols {#protocol-support} - -### TCP - -You can use TCP for any kind of Service, and it's the default network protocol. - -### UDP - -You can use UDP for most Services. For type=LoadBalancer Services, UDP support -depends on the cloud provider offering this facility. - -### SCTP - -{{< feature-state for_k8s_version="v1.20" state="stable" >}} - -When using a network plugin that supports SCTP traffic, you can use SCTP for -most Services. For type=LoadBalancer Services, SCTP support depends on the cloud -provider offering this facility. (Most do not). - -#### Warnings {#caveat-sctp-overview} - -##### Support for multihomed SCTP associations {#caveat-sctp-multihomed} - -{{< warning >}} -The support of multihomed SCTP associations requires that the CNI plugin can support the -assignment of multiple interfaces and IP addresses to a Pod. - -NAT for multihomed SCTP associations requires special logic in the corresponding kernel modules. -{{< /warning >}} - -##### Windows {#caveat-sctp-windows-os} - -{{< note >}} -SCTP is not supported on Windows based nodes. -{{< /note >}} - -##### Userspace kube-proxy {#caveat-sctp-kube-proxy-userspace} - -{{< warning >}} -The kube-proxy does not support the management of SCTP associations when it is in userspace mode. -{{< /warning >}} - -### HTTP - -If your cloud provider supports it, you can use a Service in LoadBalancer mode -to set up external HTTP / HTTPS reverse proxying, forwarded to the Endpoints -of the Service. - -{{< note >}} -You can also use {{< glossary_tooltip term_id="ingress" >}} in place of Service -to expose HTTP/HTTPS Services. -{{< /note >}} - -### PROXY protocol - -If your cloud provider supports it, -you can use a Service in LoadBalancer mode to configure a load balancer outside -of Kubernetes itself, that will forward connections prefixed with -[PROXY protocol](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt). - -The load balancer will send an initial series of octets describing the -incoming connection, similar to this example - -``` -PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n -``` - -followed by the data from the client. - ## {{% heading "whatsnext" %}} * Follow the [Connecting Applications with Services](/docs/tutorials/services/connect-applications-service/) tutorial * Read about [Ingress](/docs/concepts/services-networking/ingress/) * Read about [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/) + +For more context: +* Read [Virtual IPs and Service Proxies](/docs/reference/networking/virtual-ips/) +* Read the [API reference](/docs/reference/kubernetes-api/service-resources/service-v1/) for the Service API +* Read the [API reference](/docs/reference/kubernetes-api/service-resources/endpoints-v1/) for the Endpoints API +* Read the [API reference](/docs/reference/kubernetes-api/service-resources/endpoint-slice-v1/) for the EndpointSlice API diff --git a/content/en/docs/reference/networking/_index.md b/content/en/docs/reference/networking/_index.md new file mode 100644 index 0000000000000..e40a09ded38eb --- /dev/null +++ b/content/en/docs/reference/networking/_index.md @@ -0,0 +1,10 @@ +--- +title: Networking Reference +content_type: reference +--- + + +This section of the Kubernetes documentation provides reference details +of Kubernetes networking. + + \ No newline at end of file diff --git a/content/en/docs/reference/networking/service-protocols.md b/content/en/docs/reference/networking/service-protocols.md new file mode 100644 index 0000000000000..4643cd48735fe --- /dev/null +++ b/content/en/docs/reference/networking/service-protocols.md @@ -0,0 +1,124 @@ +--- +title: Protocols for Services +content_type: reference +--- + + +If you configure a {{< glossary_tooltip text="Service" term_id="service" >}}, +you can select from any network protocol that Kubernetes supports. + +Kubernetes supports the following protocols with Services: + +- [`SCTP`](#protocol-sctp) +- [`TCP`](#protocol-tcp) _(the default)_ +- [`UDP`](#protocol-udp) + +When you define a Service, you can also specify the +[application protocol](/docs/concepts/services-networking/service/#application-protocol) +that it uses. + +This document details some special cases, all of them typically using TCP +as a transport protocol: + +- [HTTP](#protocol-http-special) and [HTTPS](#protocol-http-special) +- [PROXY protocol](#protocol-proxy-special) +- [TLS](#protocol-tls-special) termination at the load balancer + + +## Supported protocols {#protocol-support} + +There are 3 valid values for the `protocol` of a port for a Service: + +### `SCTP` {#protocol-sctp} + +{{< feature-state for_k8s_version="v1.20" state="stable" >}} + +When using a network plugin that supports SCTP traffic, you can use SCTP for +most Services. For `type: LoadBalancer` Services, SCTP support depends on the cloud +provider offering this facility. (Most do not). + +SCTP is not supported on nodes that run Windows. + +#### Support for multihomed SCTP associations {#caveat-sctp-multihomed} + +The support of multihomed SCTP associations requires that the CNI plugin can support the assignment of multiple interfaces and IP addresses to a Pod. + +NAT for multihomed SCTP associations requires special logic in the corresponding kernel modules. + +{{< note >}} +The kube-proxy does not support the management of SCTP associations when it is in userspace mode. +{{< /note >}} + + +### `TCP` {#protocol-tcp} + +You can use TCP for any kind of Service, and it's the default network protocol. + +### `UDP` {#protocol-udp} + +You can use UDP for most Services. For `type: LoadBalancer` Services, +UDP support depends on the cloud provider offering this facility. + + +## Special cases + +### HTTP {#protocol-http-special} + +If your cloud provider supports it, you can use a Service set to `type: LoadBalancer` as a way +to set up external HTTP / HTTPS reverse proxying, forwarded to the EndpointSlices / Endpoints of that Service. + +Typically, you set the protocol to `TCP` and add an +{{< glossary_tooltip text="annotation" term_id="annotation" >}} +(usually specific to your cloud provider) that configures the load balancer +to handle traffic at the HTTP level. +This configuration might also include serving HTTPS (HTTP over TLS) and +reverse-proxying plain HTTP to your workload. + +{{< note >}} +You can also use an {{< glossary_tooltip term_id="ingress" >}} to expose +HTTP/HTTPS Services. +{{< /note >}} + +You might additionally want to specify that the +[application protocol](/docs/concepts/services-networking/service/#application-protocol) +of the connection is `http` or `https`. Use `http` if the session from the +load balancer to your workload is HTTP without TLS, and use `https` if the +session from the load balancer to your workload uses TLS encryption. + +### PROXY protocol {#protocol-proxy-special} + +If your cloud provider supports it, you can use a Service set to `type: LoadBalancer` +to configure a load balancer outside of Kubernetes itself, that will forward connections +wrapped with the +[PROXY protocol](https://www.haproxy.org/download/2.5/doc/proxy-protocol.txt). + +The load balancer then sends an initial series of octets describing the +incoming connection, similar to this example (PROXY protocol v1): + +``` +PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n +``` + +The data after the proxy protocol preamble are the original +data from the client. When either side closes the connection, +the load balancer also triggers a connection close and sends +any remaining data where feasible. + +Typically, you define a Service with the protocol to `TCP`. +You also set an annotation, specific to your +cloud provider, that configures the load balancer to wrap each incoming connection in the PROXY protocol. + +### TLS {#protocol-tls-special} + +If your cloud provider supports it, you can use a Service set to `type: LoadBalancer` as +a way to set up external reverse proxying, where the connection from client to load +balancer is TLS encrypted and the load balancer is the TLS server peer. +The connection from the load balancer to your workload can also be TLS, +or might be plain text. The exact options available to you depend on your +cloud provider or custom Service implementation. + +Typically, you set the protocol to `TCP` and set an annotation +(usually specific to your cloud provider) that configures the load balancer +to act as a TLS server. You would configure the TLS identity (as server, +and possibly also as a client that connects to your workload) using +mechanisms that are specific to your cloud provider. diff --git a/content/en/docs/reference/networking/virtual-ips.md b/content/en/docs/reference/networking/virtual-ips.md new file mode 100644 index 0000000000000..583a12f096378 --- /dev/null +++ b/content/en/docs/reference/networking/virtual-ips.md @@ -0,0 +1,337 @@ +--- +title: Virtual IPs and Service Proxies +content_type: reference +--- + + +Every {{< glossary_tooltip term_id="node" text="node" >}} in a Kubernetes +cluster runs a [kube-proxy](/docs/reference/command-line-tools-reference/kube-proxy/) +(unless you have deployed your own alternative component in place of `kube-proxy`). + +The `kube-proxy` component is responsible for implementing a _virtual IP_ +mechanism for {{< glossary_tooltip term_id="service" text="Services">}} +of `type` other than +[`ExternalName`](/docs/concepts/services-networking/service/#externalname). + + +A question that pops up every now and then is why Kubernetes relies on +proxying to forward inbound traffic to backends. What about other +approaches? For example, would it be possible to configure DNS records that +have multiple A values (or AAAA for IPv6), and rely on round-robin name +resolution? + +There are a few reasons for using proxying for Services: + +* There is a long history of DNS implementations not respecting record TTLs, + and caching the results of name lookups after they should have expired. +* Some apps do DNS lookups only once and cache the results indefinitely. +* Even if apps and libraries did proper re-resolution, the low or zero TTLs + on the DNS records could impose a high load on DNS that then becomes + difficult to manage. + +Later in this page you can read about how various kube-proxy implementations work. +Overall, you should note that, when running `kube-proxy`, kernel level rules may be modified +(for example, iptables rules might get created), which won't get cleaned up, in some +cases until you reboot. Thus, running kube-proxy is something that should only be done +by an administrator which understands the consequences of having a low level, privileged +network proxying service on a computer. Although the `kube-proxy` executable supports a +`cleanup` function, this function is not an official feature and thus is only available +to use as-is. + + + +Some of the details in this reference refer to an example: the back end Pods for a stateless +image-processing workload, running with three replicas. Those replicas are +fungible—frontends do not care which backend they use. While the actual Pods that +compose the backend set may change, the frontend clients should not need to be aware of that, +nor should they need to keep track of the set of backends themselves. + + + + +## Proxy modes + +Note that the kube-proxy starts up in different modes, which are determined by its configuration. + +- The kube-proxy's configuration is done via a ConfigMap, and the ConfigMap for + kube-proxy effectively deprecates the behavior for almost all of the flags for + the kube-proxy. +- The ConfigMap for the kube-proxy does not support live reloading of configuration. +- The ConfigMap parameters for the kube-proxy cannot all be validated and verified on startup. + For example, if your operating system doesn't allow you to run iptables commands, + the standard kernel kube-proxy implementation will not work. + Likewise, if you have an operating system which doesn't support `netsh`, + it will not run in Windows userspace mode. + +### User space proxy mode {#proxy-mode-userspace} + +{{< feature-state for_k8s_version="v1.23" state="deprecated" >}} + +This (legacy) mode uses iptables to install interception rules, and then performs +traffic forwarding with the assistance of the kube-proxy tool. +The kube-procy watches the Kubernetes control plane for the addition, modification +and removal of Service and Endpoints objects. For each Service, the kube-proxy +opens a port (randomly chosen) on the local node. Any connections to this _proxy port_ +are proxied to one of the Service's backend Pods (as reported via +Endpoints). The kube-proxy takes the `sessionAffinity` setting of the Service into +account when deciding which backend Pod to use. + +The user-space proxy installs iptables rules which capture traffic to the +Service's `clusterIP` (which is virtual) and `port`. Those rules redirect that traffic +to the proxy port which proxies the backend Pod. + +By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm. + +{{< figure src="/images/docs/services-userspace-overview.svg" title="Services overview diagram for userspace proxy" class="diagram-medium" >}} + + +#### Example {#packet-processing-userspace} + +As an example, consider the image processing application described [earlier](#example) +in the page. +When the backend Service is created, the Kubernetes control plane assigns a virtual +IP address, for example 10.0.0.1. Assuming the Service port is 1234, the +Service is observed by all of the kube-proxy instances in the cluster. +When a proxy sees a new Service, it opens a new random port, establishes an +iptables redirect from the virtual IP address to this new port, and starts accepting +connections on it. + +When a client connects to the Service's virtual IP address, the iptables +rule kicks in, and redirects the packets to the proxy's own port. +The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend. + +This means that Service owners can choose any port they want without risk of +collision. Clients can connect to an IP and port, without being aware +of which Pods they are actually accessing. + +#### Scaling challenges {#scaling-challenges-userspace} + +Using the userspace proxy for VIPs works at small to medium scale, but will +not scale to very large clusters with thousands of Services. The +[original design proposal for portals](https://github.com/kubernetes/kubernetes/issues/1107) +has more details on this. + +Using the userspace proxy obscures the source IP address of a packet accessing +a Service. +This makes some kinds of network filtering (firewalling) impossible. The iptables +proxy mode does not +obscure in-cluster source IPs, but it does still impact clients coming through +a load balancer or node-port. + +### `iptables` proxy mode {#proxy-mode-iptables} + +In this mode, kube-proxy watches the Kubernetes control plane for the addition and +removal of Service and Endpoints objects. For each Service, it installs +iptables rules, which capture traffic to the Service's `clusterIP` and `port`, +and redirect that traffic to one of the Service's +backend sets. For each endpoint, it installs iptables rules which +select a backend Pod. + +By default, kube-proxy in iptables mode chooses a backend at random. + +Using iptables to handle traffic has a lower system overhead, because traffic +is handled by Linux netfilter without the need to switch between userspace and the +kernel space. This approach is also likely to be more reliable. + +If kube-proxy is running in iptables mode and the first Pod that's selected +does not respond, the connection fails. This is different from userspace +mode: in that scenario, kube-proxy would detect that the connection to the first +Pod had failed and would automatically retry with a different backend Pod. + +You can use Pod [readiness probes](/docs/concepts/workloads/pods/pod-lifecycle/#container-probes) +to verify that backend Pods are working OK, so that kube-proxy in iptables mode +only sees backends that test out as healthy. Doing this means you avoid +having traffic sent via kube-proxy to a Pod that's known to have failed. + +{{< figure src="/images/docs/services-iptables-overview.svg" title="Services overview diagram for iptables proxy" class="diagram-medium" >}} + +#### Example {#packet-processing-iptables} + +Again, consider the image processing application described [earlier](#example). +When the backend Service is created, the Kubernetes control plane assigns a virtual +IP address, for example 10.0.0.1. For this example, assume that the +Service port is 1234. +All of the kube-proxy instances in the cluster observe the creation of the new +Service. + +When kube-proxy on a node sees a new Service, it installs a series of iptables rules +which redirect from the virtual IP address to more iptables rules, defined per Service. +The per-Service rules link to further rules for each backend endpoint, and the per- +endpoint rules redirect traffic (using destination NAT) to the backends. + +When a client connects to the Service's virtual IP address the iptables rule kicks in. +A backend is chosen (either based on session affinity or randomly) and packets are +redirected to the backend. Unlike the userspace proxy, packets are never +copied to userspace, the kube-proxy does not have to be running for the virtual +IP address to work, and Nodes see traffic arriving from the unaltered client IP +address. + +This same basic flow executes when traffic comes in through a node-port or +through a load-balancer, though in those cases the client IP address does get altered. + +### IPVS proxy mode {#proxy-mode-ipvs} + +In `ipvs` mode, kube-proxy watches Kubernetes Services and Endpoints, +calls `netlink` interface to create IPVS rules accordingly and synchronizes +IPVS rules with Kubernetes Services and Endpoints periodically. +This control loop ensures that IPVS status matches the desired +state. +When accessing a Service, IPVS directs traffic to one of the backend Pods. + +The IPVS proxy mode is based on netfilter hook function that is similar to +iptables mode, but uses a hash table as the underlying data structure and works +in the kernel space. +That means kube-proxy in IPVS mode redirects traffic with lower latency than +kube-proxy in iptables mode, with much better performance when synchronizing +proxy rules. Compared to the other proxy modes, IPVS mode also supports a +higher throughput of network traffic. + +IPVS provides more options for balancing traffic to backend Pods; +these are: + +* `rr`: round-robin +* `lc`: least connection (smallest number of open connections) +* `dh`: destination hashing +* `sh`: source hashing +* `sed`: shortest expected delay +* `nq`: never queue + +{{< note >}} +To run kube-proxy in IPVS mode, you must make IPVS available on +the node before starting kube-proxy. + +When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS +kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy +falls back to running in iptables proxy mode. +{{< /note >}} + +{{< figure src="/images/docs/services-ipvs-overview.svg" title="Services overview diagram for IPVS proxy" class="diagram-medium" >}} + +## Session affinity + +In these proxy models, the traffic bound for the Service's IP:Port is +proxied to an appropriate backend without the clients knowing anything +about Kubernetes or Services or Pods. + +If you want to make sure that connections from a particular client +are passed to the same Pod each time, you can select the session affinity based +on the client's IP addresses by setting `.spec.sessionAffinity` to `ClientIP` +for a Service (the default is `None`). + +### Session stickiness timeout + +You can also set the maximum session sticky time by setting +`.spec.sessionAffinityConfig.clientIP.timeoutSeconds` appropriately for a Service. +(the default value is 10800, which works out to be 3 hours). + +{{< note >}} +On Windows, setting the maximum session sticky time for Services is not supported. +{{< /note >}} + +## IP address assignment to Services + +Unlike Pod IP addresses, which actually route to a fixed destination, +Service IPs are not actually answered by a single host. Instead, kube-proxy +uses packet processing logic (such as Linux iptables) to define _virtual_ IP +addresses which are transparently redirected as needed. + +When clients connect to the VIP, their traffic is automatically transported to an +appropriate endpoint. The environment variables and DNS for Services are actually +populated in terms of the Service's virtual IP address (and port). + +### Avoiding collisions + +One of the primary philosophies of Kubernetes is that you should not be +exposed to situations that could cause your actions to fail through no fault +of your own. For the design of the Service resource, this means not making +you choose your own port number if that choice might collide with +someone else's choice. That is an isolation failure. + +In order to allow you to choose a port number for your Services, we must +ensure that no two Services can collide. Kubernetes does that by allocating each +Service its own IP address from within the `service-cluster-ip-range` +CIDR range that is configured for the API server. + +To ensure each Service receives a unique IP, an internal allocator atomically +updates a global allocation map in {{< glossary_tooltip term_id="etcd" >}} +prior to creating each Service. The map object must exist in the registry for +Services to get IP address assignments, otherwise creations will +fail with a message indicating an IP address could not be allocated. + +In the control plane, a background controller is responsible for creating that +map (needed to support migrating from older versions of Kubernetes that used +in-memory locking). Kubernetes also uses controllers to check for invalid +assignments (e.g. due to administrator intervention) and for cleaning up allocated +IP addresses that are no longer used by any Services. + +#### IP address ranges for Service virtual IP addresses {#service-ip-static-sub-range} + +{{< feature-state for_k8s_version="v1.25" state="beta" >}} + +Kubernetes divides the `ClusterIP` range into two bands, based on +the size of the configured `service-cluster-ip-range` by using the following formula +`min(max(16, cidrSize / 16), 256)`. That formula paraphrases as _never less than 16 or +more than 256, with a graduated step function between them_. + +Kubernetes prefers to allocate dynamic IP addresses to Services by choosing from the upper band, +which means that if you want to assign a specific IP address to a `type: ClusterIP` +Service, you should manually assign an IP address from the **lower** band. That approach +reduces the risk of a conflict over allocation. + +If you disable the `ServiceIPStaticSubrange` +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) then Kubernetes +uses a single shared pool for both manually and dynamically assigned IP addresses, +that are used for `type: ClusterIP` Services. + +## Traffic policies + +You can set the `.spec.internalTrafficPolicy` and `.spec.externalTrafficPolicy` fields +to control how Kubernetes routes traffic to healthy (“ready”) backends. + +### External traffic policy + +You can set the `.spec.externalTrafficPolicy` field to control how traffic from +external sources is routed. Valid values are `Cluster` and `Local`. Set the field +to `Cluster` to route external traffic to all ready endpoints and `Local` to only +route to ready node-local endpoints. If the traffic policy is `Local` and there are +are no node-local endpoints, the kube-proxy does not forward any traffic for the +relevant Service. + +{{< note >}} +{{< feature-state for_k8s_version="v1.22" state="alpha" >}} + +If you enable the `ProxyTerminatingEndpoints` +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +for the kube-proxy, the kube-proxy checks if the node +has local endpoints and whether or not all the local endpoints are marked as terminating. +If there are local endpoints and **all** of those are terminating, then the kube-proxy ignores +any external traffic policy of `Local`. Instead, whilst the node-local endpoints remain as all +terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, +as if the external traffic policy were set to `Cluster`. + +This forwarding behavior for terminating endpoints exists to allow external load balancers to +gracefully drain connections that are backed by `NodePort` Services, even when the health check +node port starts to fail. Otherwise, traffic can be lost between the time a node is +still in the node pool of a load balancer and traffic is being dropped during the +termination period of a pod. +{{< /note >}} + +### Internal traffic policy + +{{< feature-state for_k8s_version="v1.22" state="beta" >}} + +You can set the `.spec.internalTrafficPolicy` field to control how traffic from +internal sources is routed. Valid values are `Cluster` and `Local`. Set the field to +`Cluster` to route internal traffic to all ready endpoints and `Local` to only route +to ready node-local endpoints. If the traffic policy is `Local` and there are no +node-local endpoints, traffic is dropped by kube-proxy. + +## {{% heading "whatsnext" %}} + +To learn more about Services, +read [Connecting Applications with Services](/docs/concepts/services-networking/connect-applications-service/). + +You can also: + +* Read about [Services](/docs/concepts/services-networking/service/) +* Read the [API reference](/docs/reference/kubernetes-api/service-resources/service-v1/) for the Service API \ No newline at end of file From d76017635e2919eccb986612299ce247054ba9e5 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Sat, 22 Oct 2022 01:17:29 +0100 Subject: [PATCH 2/6] Document kube-proxy querying EndpointSlices The Endpoints API is deprecated; adjust docs to match. --- content/en/docs/reference/networking/virtual-ips.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/docs/reference/networking/virtual-ips.md b/content/en/docs/reference/networking/virtual-ips.md index 583a12f096378..245611afa628f 100644 --- a/content/en/docs/reference/networking/virtual-ips.md +++ b/content/en/docs/reference/networking/virtual-ips.md @@ -70,10 +70,10 @@ Note that the kube-proxy starts up in different modes, which are determined by i This (legacy) mode uses iptables to install interception rules, and then performs traffic forwarding with the assistance of the kube-proxy tool. The kube-procy watches the Kubernetes control plane for the addition, modification -and removal of Service and Endpoints objects. For each Service, the kube-proxy +and removal of Service and EndpointSlice objects. For each Service, the kube-proxy opens a port (randomly chosen) on the local node. Any connections to this _proxy port_ are proxied to one of the Service's backend Pods (as reported via -Endpoints). The kube-proxy takes the `sessionAffinity` setting of the Service into +EndpointSlices). The kube-proxy takes the `sessionAffinity` setting of the Service into account when deciding which backend Pod to use. The user-space proxy installs iptables rules which capture traffic to the @@ -121,7 +121,7 @@ a load balancer or node-port. ### `iptables` proxy mode {#proxy-mode-iptables} In this mode, kube-proxy watches the Kubernetes control plane for the addition and -removal of Service and Endpoints objects. For each Service, it installs +removal of Service and EndpointSlice objects. For each Service, it installs iptables rules, which capture traffic to the Service's `clusterIP` and `port`, and redirect that traffic to one of the Service's backend sets. For each endpoint, it installs iptables rules which @@ -171,9 +171,9 @@ through a load-balancer, though in those cases the client IP address does get al ### IPVS proxy mode {#proxy-mode-ipvs} -In `ipvs` mode, kube-proxy watches Kubernetes Services and Endpoints, +In `ipvs` mode, kube-proxy watches Kubernetes Services and EndpointSlices, calls `netlink` interface to create IPVS rules accordingly and synchronizes -IPVS rules with Kubernetes Services and Endpoints periodically. +IPVS rules with Kubernetes Services and EndpointSlices periodically. This control loop ensures that IPVS status matches the desired state. When accessing a Service, IPVS directs traffic to one of the backend Pods. From ca9e39658911e7dc2664b1f1d8008a1075e9f047 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Sat, 22 Oct 2022 01:28:28 +0100 Subject: [PATCH 3/6] Improve wording about HTTP L7 proxying integration Some cloud providers implement custom HTTP-aware reverse proxying that integrates with Service (no Ingress or Gateway). Tweak the wording around the docs for this detail. --- content/en/docs/reference/networking/service-protocols.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/content/en/docs/reference/networking/service-protocols.md b/content/en/docs/reference/networking/service-protocols.md index 4643cd48735fe..a8c0fe75e8a24 100644 --- a/content/en/docs/reference/networking/service-protocols.md +++ b/content/en/docs/reference/networking/service-protocols.md @@ -64,10 +64,12 @@ UDP support depends on the cloud provider offering this facility. ### HTTP {#protocol-http-special} -If your cloud provider supports it, you can use a Service set to `type: LoadBalancer` as a way -to set up external HTTP / HTTPS reverse proxying, forwarded to the EndpointSlices / Endpoints of that Service. +If your cloud provider supports it, you can use a Service in LoadBalancer mode to +configure a load balancer outside of your Kubernetes cluster, in a special mode +where your cloud provider's load balancer implements HTTP / HTTPS reverse proxying, +with traffic forwarded to the backend endpoints for that Service. -Typically, you set the protocol to `TCP` and add an +Typically, you set the protocol for the Service to `TCP` and add an {{< glossary_tooltip text="annotation" term_id="annotation" >}} (usually specific to your cloud provider) that configures the load balancer to handle traffic at the HTTP level. From 0912bf0a8415e31a8f6d963b73fb091f34033e5a Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Thu, 24 Nov 2022 18:04:05 +0000 Subject: [PATCH 4/6] Preserve existing hyperlinks in Service concept --- content/en/docs/concepts/services-networking/service.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/content/en/docs/concepts/services-networking/service.md b/content/en/docs/concepts/services-networking/service.md index 28be93a444eb7..9bd08bc8ac690 100644 --- a/content/en/docs/concepts/services-networking/service.md +++ b/content/en/docs/concepts/services-networking/service.md @@ -1195,6 +1195,14 @@ to learn more. Service is a top-level resource in the Kubernetes REST API. You can find more details about the [Service API object](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#service-v1-core). + + + +## Virtual IP addressing mechanism + +Read [Virtual IPs and Service Proxies](/docs/reference/networking/virtual-ips/) to learn about the +mechanism Kubernetes provides to expose a Service with a virtual IP address. + ## {{% heading "whatsnext" %}} * Follow the [Connecting Applications with Services](/docs/tutorials/services/connect-applications-service/) tutorial From 25c25864b831f17bae9bbfc4de9e5119ed96c137 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Sat, 22 Oct 2022 01:33:09 +0100 Subject: [PATCH 5/6] Move ports and protocols page into network reference --- .../en/docs/reference/{ => networking}/ports-and-protocols.md | 0 static/_redirects | 2 ++ 2 files changed, 2 insertions(+) rename content/en/docs/reference/{ => networking}/ports-and-protocols.md (100%) diff --git a/content/en/docs/reference/ports-and-protocols.md b/content/en/docs/reference/networking/ports-and-protocols.md similarity index 100% rename from content/en/docs/reference/ports-and-protocols.md rename to content/en/docs/reference/networking/ports-and-protocols.md diff --git a/static/_redirects b/static/_redirects index 981fa104a543a..99c40ce0cda32 100644 --- a/static/_redirects +++ b/static/_redirects @@ -235,6 +235,8 @@ /docs/reference/kubernetes-api/labels-annotations-taints/ /docs/reference/labels-annotations-taints/ 301 +/docs/reference/ports-and-protocols/ /docs/reference/networking/ports-and-protocols/ 301 + /docs/reporting-security-issues/ /docs/reference/issues-security/security/ 301 /security/ /docs/reference/issues-security/security/ 302 From b581bd417a97418b71100a19d249fa2f042f8f96 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Thu, 24 Nov 2022 18:11:56 +0000 Subject: [PATCH 6/6] Set page weights --- content/en/docs/reference/networking/_index.md | 1 + content/en/docs/reference/networking/ports-and-protocols.md | 2 +- content/en/docs/reference/networking/service-protocols.md | 1 + content/en/docs/reference/networking/virtual-ips.md | 1 + 4 files changed, 4 insertions(+), 1 deletion(-) diff --git a/content/en/docs/reference/networking/_index.md b/content/en/docs/reference/networking/_index.md index e40a09ded38eb..e771d23f8619a 100644 --- a/content/en/docs/reference/networking/_index.md +++ b/content/en/docs/reference/networking/_index.md @@ -1,6 +1,7 @@ --- title: Networking Reference content_type: reference +weight: 85 --- diff --git a/content/en/docs/reference/networking/ports-and-protocols.md b/content/en/docs/reference/networking/ports-and-protocols.md index cdba8383c7b4f..2e716e4d46fd9 100644 --- a/content/en/docs/reference/networking/ports-and-protocols.md +++ b/content/en/docs/reference/networking/ports-and-protocols.md @@ -1,7 +1,7 @@ --- title: Ports and Protocols content_type: reference -weight: 90 +weight: 40 --- When running Kubernetes in an environment with strict network boundaries, such diff --git a/content/en/docs/reference/networking/service-protocols.md b/content/en/docs/reference/networking/service-protocols.md index a8c0fe75e8a24..578020d30cbc3 100644 --- a/content/en/docs/reference/networking/service-protocols.md +++ b/content/en/docs/reference/networking/service-protocols.md @@ -1,6 +1,7 @@ --- title: Protocols for Services content_type: reference +weight: 10 --- diff --git a/content/en/docs/reference/networking/virtual-ips.md b/content/en/docs/reference/networking/virtual-ips.md index 245611afa628f..bf08efb1a91b3 100644 --- a/content/en/docs/reference/networking/virtual-ips.md +++ b/content/en/docs/reference/networking/virtual-ips.md @@ -1,6 +1,7 @@ --- title: Virtual IPs and Service Proxies content_type: reference +weight: 50 ---