From 0026fe07c0399d3feaa27577ed03b0fa6e95c2af Mon Sep 17 00:00:00 2001 From: Yury Kulazhenkov Date: Thu, 6 Jun 2024 13:46:45 +0300 Subject: [PATCH] Update README.md with docs for CIDRPool Signed-off-by: Yury Kulazhenkov --- README.md | 323 +++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 258 insertions(+), 65 deletions(-) diff --git a/README.md b/README.md index 5e9a12c..8ae439a 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,59 @@ An IP Address Management (IPAM) CNI plugin designed to operate in a Kubernetes e This Plugins allows to assign IP addresses dynamically across the cluster while keeping speed and performance in mind. -IP subnets are defined by the user as named _IP Pools_, then for each IP Pool a unique _IP Block_ is assigned -to each K8s Node which is used to assign IPs to container network interfaces. +NVIDIA IPAM plugin supports allocation of IP ranges and Network prefixes for nodes. + +* [IPPool CR](#ippool-cr) can be used to create an IP Pool. This type of pool can be used to split a single IP network into multiple unique IP ranges and allocate them for nodes. The nodes will use the same network mask as the original IP network. + + This pool type is useful for flat networks where Pods from all nodes have L2 connectivity with each other. + + **Example:** + ``` + network: 192.168.0.0/16 + gateway: 192.168.0.1 + perNodeBlockSize: 4 (amount of IPs) + + node1 will allocate IPs from the following range: 192.168.0.1-192.168.0.4 (gateway is part of the range) + + node2 will allocate IPs from the following range: 192.168.0.5-192.168.0.8 + + First Pod on the node1 will get the following IP config: + IP: 192.168.0.2/16 (gateway IP was skipped) + Gateway: 192.168.0.1 + + First Pod on the node2 will get the following IP config: + IP: 192.168.0.5/16 + Gateway: 192.168.0.1 + + Pods from different nodes can have L2 connectivity. + ``` + + +* [CIDRPool CR](#cidrpool-cr) can be used to create a CIDR Pool. This type of pool can be used to split a large network into multiple unique smaller subnets and allocate them for nodes. Each node will have a specific gateway and network mask with a node's subnet size. + + This pool type is useful for routed networks where Pods on each node should use uniq subnets and node-specific gateways to communicate with Pods from other nodes. + + **Example:** + ``` + cidr: 192.168.0.0/16 + perNodeNetworkPrefix: 24 (subnet size) + gatewayIndex: 1 + + node1 will allocate IPs from the following subnet: 192.168.0.0/24 + + node2 will allocate IPs from the following subnet: 192.168.1.0/24 + + First Pod on the node1 will get the following IP config: + IP: 192.168.0.2/24 + Gateway: 192.168.0.1 + + First Pod on the node2 will get the following IP config: + IP: 192.168.1.2/24 + Gateway: 192.168.1.1 + + Pods from different nodes don't have L2 connectivity, routing is required. + ``` + NVIDIA IPAM plugin currently support only IP allocation for K8s Secondary Networks. e.g Additional networks provided by [multus CNI plugin](https://github.com/k8snetworkplumbingwg/multus-cni). @@ -29,13 +80,13 @@ NVIDIA IPAM plugin consists of 3 main components: ### ipam-controller -A Kubernetes(K8s) controller that Watches on IPPools CRs in a predefined Namespace. -It then proceeds by assiging each node via IPPools Status a cluster unique range of IPs of the defined IP Pools. +A Kubernetes(K8s) controller that Watches on IPPools and CIDRPools CRs in a predefined Namespace. +It then proceeds by assiging each node via CR's Status a cluster unique range of IPs or subnet that is used by the ipam-node. #### Validation webhook -ipam-controller implements validation webhook for IPPool resource. -The webhook can prevent the creation of IPPool resources with invalid configurations. +ipam-controller implements validation webhook for IPPool and CIDRPool resources. +The webhook can prevent the creation of resources with invalid configurations. Supported X.509 certificate management system should be available in the cluster to enable the webhook. Currently supported systems are [certmanager](https://cert-manager.io/) and [Openshift certificate management](https://docs.openshift.com/container-platform/4.13/security/certificates/service-serving-certificate.html) @@ -50,9 +101,10 @@ The daemon is responsible for: - run periodic jobs, such as cleanup of the stale IP address allocations A node daemon provides GRPC service, which nv-ipam CNI plugin uses to request IP address allocation/deallocation. -IPs are allocated from the provided IP Block assigned by ipam-controller for the node. -To determine the cluster unique IP Block for the defined IP Pool, ipam-node watches K8s API -for the IPPool objects and extracts IP Block information from IPPool Status. + +IPs are allocated from the provided IP Blocks and prefixes assigned by ipam-controller for the node. +ipam-node watches K8s API +for the IPPool and CIDRPool objects and extracts allocated ranges or prefixes from the status field. ### nv-ipam @@ -61,6 +113,8 @@ To allocate/deallocate IP address nv-ipam calls GRPC API of ipam-node daemon. ### IP allocation flow +#### IPPool + 1. User (cluster administrator) defines a set of named IP Pools to be used for IP allocation of container interfaces via IPPool CRD (more information in [Configuration](#configuration) section) @@ -139,6 +193,81 @@ _Example macvlan CNI configuration_: 4. nv-ipam plugin, as a result of CNI ADD command allocates a free IP from the IP Block of the corresponding IP Pool that was allocated for the node +#### CIDRPool + +1. User (cluster administrator) defines a set of named CIDR Pools to be used for prefix allocation +of container interfaces via CIDRPool CRD (more information in [Configuration](#configuration) section) + +_Example_: + +```yaml +apiVersion: nv-ipam.nvidia.com/v1alpha1 +kind: CIDRPool +metadata: + name: pool1 + namespace: kube-system +spec: + cidr: 192.168.0.0/16 + gatewayIndex: 1 + perNodeNetworkPrefix: 24 + nodeSelector: + nodeSelectorTerms: + - matchExpressions: + - key: node-role.kubernetes.io/worker + operator: Exists + +``` + +2. ipam-controller calculates and assigns unique subnets (with `perNodeNetworkPrefix` size) for each Node via CIDRPool Status: + +_Example_: + +```yaml +apiVersion: nv-ipam.nvidia.com/v1alpha1 +kind: CIDRPool +metadata: + name: pool1 + namespace: kube-system +spec: + cidr: 192.168.0.0/16 + gatewayIndex: 1 + perNodeNetworkPrefix: 24 + nodeSelector: + nodeSelectorTerms: + - matchExpressions: + - key: node-role.kubernetes.io/worker + operator: Exists +status: + allocations: + - gateway: 192.168.0.1 + nodeName: host-a + prefix: 192.168.0.0/24 + - gateway: 192.168.1.1 + nodeName: host-b + prefix: 192.168.1.0/24 + +``` + +3. User specifies nv-ipam as IPAM plugin in CNI configuration + +_Example macvlan CNI configuration_: + +```json +{ + "type": "macvlan", + "cniVersion": "0.3.1", + "master": "enp3s0f0np0", + "mode": "bridge", + "ipam": { + "type": "nv-ipam", + "poolName": "pool1", + "poolType": "CIDRPool" + } +} +``` + +4. nv-ipam plugin, as a result of CNI ADD command allocates a free IP from the prefix that was allocated for the node + ## Configuration ### ipam-controller configuration @@ -239,17 +368,111 @@ spec: operator: Exists ``` -* `spec`: contains the IP pool configuration - * `subnet`: IP Subnet of the pool - * `gateway`: Gateway IP of the subnet +##### Fields + +* `spec`: contains the IP pool configuration. + * `subnet`: IP Subnet of the pool. + * `gateway` (optional): Gateway IP of the subnet. * `perNodeBlockSize`: the number of IPs of IP Blocks allocated to Nodes. - * `nodeSelector`: A list of node selector terms. The terms are ORed. Each term can have a list of matchExpressions that are ANDed. Only the nodes that match the provided labels will get assigned IP Blocks for the defined pool. + * `nodeSelector` (optional): A list of node selector terms. The terms are ORed. Each term can have a list of matchExpressions that are ANDed. Only the nodes that match the provided labels will get assigned IP Blocks for the defined pool. > __Notes:__ > -> * pool name is composed of alphanumeric letters separated by dots(`.`) underscores(`_`) or hyphens(`-`) -> * `perNodeBlockSize` minimum size is 2 -> * `subnet` must be large enough to accommodate at least one `perNodeBlockSize` block of IPs +> * pool name is composed of alphanumeric letters separated by dots(`.`) underscores(`_`) or hyphens(`-`). +> * `perNodeBlockSize` minimum size is 2. +> * `subnet` must be large enough to accommodate at least one `perNodeBlockSize` block of IPs. + + + +#### CIDRPool CR + +ipam-controller accepts CIDR Pools configuration via CIDRPool CRs. +Multiple CIDRPool CRs can be created, with different NodeSelectors. + +##### IPv4 example + +```yaml +apiVersion: nv-ipam.nvidia.com/v1alpha1 +kind: CIDRPool +metadata: + name: pool1 + namespace: kube-system +spec: + cidr: 192.168.0.0/16 + gatewayIndex: 1 + perNodeNetworkPrefix: 24 + exclusions: # optional + - startIP: 192.168.0.10 + endIP: 192.168.0.20 + staticAllocations: + - nodeName: node-33 + prefix: 192.168.33.0/24 + gateway: 192.168.33.10 + - prefix: 192.168.1.0/24 + nodeSelector: # optional + nodeSelectorTerms: + - matchExpressions: + - key: node-role.kubernetes.io/worker + operator: Exists +``` + +##### IPv6 example + +```yaml +apiVersion: nv-ipam.nvidia.com/v1alpha1 +kind: CIDRPool +metadata: + name: pool1 + namespace: kube-system +spec: + cidr: fd52:2eb5:44::/48 + gatewayIndex: 1 + perNodeNetworkPrefix: 120 + nodeSelector: # optional + nodeSelectorTerms: + - matchExpressions: + - key: node-role.kubernetes.io/worker + operator: Exists + +``` + +##### Point to point prefixes + +```yaml +apiVersion: nv-ipam.nvidia.com/v1alpha1 +kind: CIDRPool +metadata: + name: pool1 + namespace: kube-system +spec: + cidr: 192.168.100.0/24 + perNodeNetworkPrefix: 31 +``` + +##### Fields + +* `spec`: contains the CIDR pool configuration. + * `cidr`: pool's CIDR block which will be split to smaller prefixes(size is define in perNodePrefixSize) and distributed between matching nodes. + * `gatewayIndex` (optional): `not set` - no gateway, if set automatically use IP with this index from the host prefix as a gateway. + * `perNodeNetworkPrefix`: size of the network prefix for each host, the network defined in `cidr` field will be slitted to multiple networks with this size. + * `exclusions` (optional, list): contains reserved IP addresses that should not be allocated by nv-ipam node component. + + * `startIP`: start IP of the exclude range (inclusive). + * `endIP`: end IP of the exclude range (inclusive). + + * `staticAllocations` (optional, list): static allocations for the pool. + + * `nodeName` (optional): name of the node for static allocation, can be empty in case if the prefix should be preallocated without assigning it for a specific node gateway for the node. + * `gateway` (optional): gateway for the node, if not set the gateway will be computed from `gatewayIndex`. + * `prefix`: statically allocated prefix. + + * `nodeSelector`(optional): A list of node selector terms. The terms are ORed. Each term can have a list of matchExpressions that are ANDed. Only the nodes that match the provided labels will get assigned IP Blocks for the defined pool. + +> __Notes:__ +> +> * pool name is composed of alphanumeric letters separated by dots(`.`) underscores(`_`) or hyphens(`-`). +> * `perNodeNetworkPrefix` must be equal or smaller (more network bits) then the size of pool's `cidr`. + ### ipam-node configuration @@ -335,6 +558,7 @@ nv-ipam accepts the following CNI configuration: { "type": "nv-ipam", "poolName": "pool1,pool2", + "poolType": "IPPool", "daemonSocket": "unix:///var/lib/cni/nv-ipam/daemon.sock", "daemonCallTimeoutSeconds": 5, "confDir": "/etc/cni/net.d/nv-ipam.d", @@ -348,6 +572,7 @@ nv-ipam accepts the following CNI configuration: It is possible to allocate two IPs for the interface from different pools by specifying pool names separated by coma, e.g. `"my-ipv4-pool,my-ipv6-pool"`. The primary intent to support multiple pools is a dual-stack use-case when an interface should have two IP addresses: one IPv4 and one IPv6. (default: network name as provided in CNI call) +* `poolType` (string, optional): type (IPPool, CIDRPool) of the pool which is referred by the `poolName`. The field is case-insensitive. (default: `"IPPool"`) * `daemonSocket` (string, optional): address of GRPC server socket served by IPAM daemon * `daemonCallTimeoutSeconds` (integer, optional): timeout for GRPC calls to IPAM daemon * `confDir` (string, optional): path to configuration dir. (default: `"/etc/cni/net.d/nv-ipam.d"`) @@ -382,57 +607,15 @@ kubectl kustomize https://github.com/mellanox/nvidia-k8s-ipam/deploy/overlays/ce kubectl kustomize https://github.com/mellanox/nvidia-k8s-ipam/deploy/overlays/openshift?ref=main | kubectl apply -f - ``` -### Create IPPool CR - -```shell -cat < /etc/cni/net.d/10-mynet.conf < Start IP: 192.168.0.1 End IP: 192.168.0.24 + host-a => Start IP: 192.168.0.1 End IP: 192.168.0.24 host-b => Start IP: 192.168.0.25 End IP: 192.168.0.48 host-c => Start IP: 192.168.0.49 End IP: 192.168.0.72 k8s-master => Start IP: 192.168.0.73 End IP: 192.168.0.96 @@ -444,25 +627,36 @@ pool2 k8s-master => Start IP: 172.16.0.151 End IP: 172.16.0.200 ``` -View network status of pods: +### View allocated IP Prefixes for a node from CIDRPool: + +```shell +kubectl get cidrpools.nv-ipam.nvidia.com -A -o jsonpath='{range .items[*]}{.metadata.name}{"\n"} {range .status.allocations[*]}{"\t"}{.nodeName} => Prefix: {.prefix} Gateway: {.gateway}{"\n"}{end}{"\n"}{end}' + +pool1 + host-a => Prefix: 10.0.0.0/24 Gateway: 10.0.0.1 + host-b => Prefix: 10.0.1.0/24 Gateway: 10.0.2.1 + +``` + +### View network status of pods: ```shell kubectl get pods -o=custom-columns='NAME:metadata.name,NODE:spec.nodeName,NETWORK-STATUS:metadata.annotations.k8s\.v1\.cni\.cncf\.io/network-status' ``` -View ipam-controller logs: +### View ipam-controller logs: ```shell kubectl -n kube-system logs ``` -View ipam-node logs: +### View ipam-node logs: ```shell kubectl -n kube-system logs ``` -View nv-ipam CNI logs: +### View nv-ipam CNI logs: ```shell cat /var/log/nv-ipam-cni.log @@ -476,7 +670,6 @@ cat /var/log/nv-ipam-cni.log * Before removing a node from cluster, drain all workloads to ensure proper cleanup of IPs on node. * IP Block allocated to a node with Gateway IP in its range will have one less IP than what defined in perNodeBlockSize, deployers should take this into account. -* Defining multiple IP Pools while supported, was not thoroughly testing ## Contributing