kubernetes-csi · k8s-ci-robot · Dec 16, 2020 · Oct 4, 2019 · Sep 24, 2020 · Dec 2, 2020
diff --git a/README.md b/README.md
@@ -74,7 +74,7 @@ Note that the external-provisioner does not scale with more replicas. Only one e
 
 See the [storage capacity section](#capacity-support) below for details.
 
-* `--capacity-controller-deployment-mode=central`: Setting this enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. 'central' is currently the only supported mode. Use it when there is just one active provisioner in the cluster. The default is to not produce CSIStorageCapacity objects.
+* `--capacity-controller-deployment-mode=central|local`: Setting this enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. Use `central` when there is just one active external-provisioner in the cluster. Use `local` when deploying external-provisioner on each node with distributed provisioning. The default is to not produce CSIStorageCapacity objects.
 
 * `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet).
 
@@ -84,6 +84,16 @@ See the [storage capacity section](#capacity-support) below for details.
 
 * `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`.
 
+##### Distributed provisioning
+
+* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.
+
+* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired. Disabling it may be useful for example when a custom controller will select nodes for PVCs.
+
+* `--node-deployment-base-delay`: Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding. Defaults to 20 seconds.
+
+* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.
+
 #### Other recognized arguments
 * `--feature-gates <gates>`: A set of comma separated `<feature-name>=<true|false>` pairs that describe feature gates for alpha/experimental features. See [list of features](#feature-status) or `--help` output for list of recognized features. Example: `--feature-gates Topology=true` to enable Topology feature that's disabled by default.
 
@@ -139,7 +149,7 @@ determine with the `POD_NAME/POD_NAMESPACE` environment variables and
 the `--capacity-ownerref-level` parameter. Other solutions will be
 added in the future.
 
-To enable this feature in a driver deployment (see also the
+To enable this feature in a driver deployment with a central controller (see also the
 [`deploy/kubernetes/storage-capacity.yaml`](deploy/kubernetes/storage-capacity.yaml)
 example):
 
@@ -155,7 +165,7 @@ example):
         fieldRef:
         fieldPath: metadata.name
 ```
-- Add `--enable-capacity=central` to the command line flags.
+- Add `--capacity-controller-deployment-mode=central` to the command line flags.
 - Add `StorageCapacity: true` to the CSIDriver information object.
   Without it, external-provisioner will publish information, but the
   Kubernetes scheduler will ignore it. This can be used to first
@@ -170,7 +180,7 @@ example):
   with `--capacity-threads`.
 - Optional: enable producing information also for storage classes that
   use immediate volume binding with
-  `--enable-capacity=immediate-binding`. This is usually not needed
+  `--capacity-for-immediate-binding`. This is usually not needed
   because such volumes are created by the driver without involving the
   Kubernetes scheduler and thus the published information would just
   be ignored.
@@ -220,6 +230,14 @@ CSIStorageCapacity objects, so in theory a malfunctioning or malicious
 driver deployment could also publish incorrect information about some
 other driver.
 
+The deployment with [distributed
+provisioning](#distributed-provisioning) is almost the same as above,
+with some minor changes:
+- Add `--capacity-controller-deployment-mode=local` to the command line flags.
+- Use `--capacity-ownerref-level=0` and the `POD_NAMESPACE/POD_NAME`
+  variables to make the pod that contains the external-provisioner
+  the owner of CSIStorageCapacity objects for the node.
+
 ### CSI error and timeout handling
 The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by `--timeout` command line argument (15 seconds by default).
 
@@ -242,6 +260,125 @@ The external-provisioner optionally exposes an HTTP endpoint at address:port spe
 * Metrics path, as set by `--metrics-path` argument (default is `/metrics`).
 * Leader election health check at `/healthz/leader-election`. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-provisioner leader that fails to connect to the API server to renew its leadership. See https://github.com/kubernetes-csi/csi-lib-utils/issues/66 for details.
 
+### Deployment on each node
+
+Normally, external-provisioner is deployed once in a cluster and
+communicates with a control instance of the CSI driver which then
+provisions volumes via some kind of storage backend API. CSI drivers
+which manage local storage on a node don't have such an API that a
+central controller could use.
+
+To support this case, external-provisioner can be deployed alongside
+each CSI driver on different nodes. The CSI driver deployment must:
+- support topology, usually with one topology key
+  ("csi.example.org/node") and the Kubernetes node name as value
+- use a service account that has the same RBAC rules as for a normal
+  deployment
+- invoke external-provisioner with `--node-deployment`
+- tweak `--node-deployment-base-delay` and `--node-deployment-max-delay`
+  to match the expected cluster size and desired response times
+  (only relevant when there are storage classes with immediate binding,
+  see below for details)
+- set the `NODE_NAME` environment variable to the name of the Kubernetes node
+- implement `GetCapacity`
+
+Usage of `--strict-topology` and `--immediate-topology=false` is
+recommended because it makes the `CreateVolume` invocations simpler.
+Topology information is always derived exclusively from the
+information returned by the CSI driver that runs on the same node,
+without combining that with information stored for other nodes. This
+works as long as each node is in its own topology segment,
+i.e. usually with a single topology key and one unique value for each
+node.
+
+Volume provisioning with late binding works as before, except that
+each external-provisioner instance checks the "selected node"
+annotation and only creates volumes if that node is the one it runs
+on. It also only deletes volumes on its own node.
+
+Immediate binding is also supported, but not recommended. It is
+implemented by letting the external-provisioner instances assign a PVC
+to one of them: when they see a new PVC with immediate binding, they
+all attempt to set the "selected node" annotation with their own node
+name as value. Only one update request can succeed, all others get a
+"conflict" error and then know that some other instance was faster. To
+avoid the thundering herd problem, each instance waits for a random
+period before issuing an update request.
+
+When `CreateVolume` call fails with `ResourcesExhausted`, the normal
+recovery mechanism is used, i.e. the external-provisioner instance
+removes the "selected node" annotation and the process repeats. But
+this triggers events for the PVC and delays volume creation, in
+particular when storage is exhausted on most nodes. Therefore
+external-provisioner checks with `GetCapacity` *before* attempting to
+own a PVC whether the currently available capacity is sufficient for
+the volume. When it isn't, the PVC is ignored and some other instance
+can own it.
+
+The `--node-deployment-base-delay` parameter determines the initial
+wait period. It also sets the jitter, so in practice the initial wait period will be
+in the range from zero to the base delay. If the value is high,
+volumes with immediate binding get created more slowly. If it is low,
+then the risk of conflicts while setting the "selected node"
+annotation increases and the apiserver load will be higher.
+
+There is an exponential backoff per PVC which is used for unexpected
+problems. Normally, an owner for a PVC is chosen during the first
+attempt, so most PVCs will use the base delays. A maximum can be set
+with `--node-deployment-max-delay` anyway, to avoid very long delays
+when something went wrong repeatedly.
+
+During scale testing with 100 external-provisioner instances, a base
+delay of 20 seconds worked well. When provisioning 3000 volumes, there
+were only 500 conflicts which the apiserver handled without getting
+overwhelmed. The average provisioning rate of around 40 volumes/second
+was the same as with a delay of 10 seconds. The worst-case latency per
+volume was probably higher, but that wasn't measured.
+
+Note that the QPS settings of kube-controller-manager and
+external-provisioner have to be increased at the moment (Kubernetes
+1.19) to provision volumes faster than around 4 volumes/second. Those
+settings will eventually get replaced with [flow control in the API
+server
+itself](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/).
+
+Beware that if *no* node has sufficient storage available, then also
+no `CreateVolume` call is attempted and thus no events are generated
+for the PVC, i.e. some other means of tracking remaining storage
+capacity must be used to detect when the cluster runs out of storage.
+
+Because PVCs with immediate binding get distributed randomly among
+nodes, they get spread evenly. If that is not desirable, then it is
+possible to disable support for immediate binding in distributed
+provisioning with `--node-deployment-immediate-binding=false` and
+instead implement a custom policy in a separate controller which sets
+the "selected node" annotation to trigger local provisioning on the
+desired node.
+
+### Deleting local volumes after a node failure or removal
+
+When a node with local volumes gets removed from a cluster before
+deleting those volumes, the PV and PVC objects may still exist. It may
+be possible to remove the PVC normally if the volume was not in use by
+any pod on the node, but normal deletion of the volume and thus
+deletion of the PV is not possible anymore because the CSI driver
+instance on the node is not available or reachable anymore and therefore
+Kubernetes cannot be sure that it is okay to remove the PV.
+
+When an administrator is sure that the node is never going to come
+back, then the local volumes can be removed manually:
+- force-delete objects: `kubectl delete pv <pv> --wait=false --grace-period=0 --force`
+- remove all finalizers: `kubectl patch pv <pv> -p '{"metadata":{"finalizers":null}}'`
+
+It may also be necessary to scrub disks before reusing them because
+the CSI driver had no chance to do that.
+
+If there still was a PVC which was bound to that PV, it then will be
+moved to phase "Lost". It has to be deleted and re-created if still
+needed because no new volume will be created for it. Editing the PVC
+to revert it to phase "Unbound" is not allowed by the Kubernetes
+API server.
+
 ## Community, discussion, contribution, and support
 
 Learn how to engage with the Kubernetes community on the [community page](http://kubernetes.io/community/).

diff --git a/cmd/csi-provisioner/csi-provisioner.go b/cmd/csi-provisioner/csi-provisioner.go
@@ -29,11 +29,14 @@ import (
 
 	"github.com/container-storage-interface/spec/lib/go/csi"
 	flag "github.com/spf13/pflag"
+	v1 "k8s.io/api/core/v1"
+	storagev1 "k8s.io/api/storage/v1"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/runtime/schema"
 	utilfeature "k8s.io/apiserver/pkg/util/feature"
 	"k8s.io/client-go/informers"
 	"k8s.io/client-go/kubernetes"
-	v1 "k8s.io/client-go/listers/core/v1"
+	listersv1 "k8s.io/client-go/listers/core/v1"
 	storagelistersv1 "k8s.io/client-go/listers/storage/v1"
 	"k8s.io/client-go/rest"
 	"k8s.io/client-go/tools/clientcmd"
@@ -90,6 +93,11 @@ var (
 	capacityPollInterval     = flag.Duration("capacity-poll-interval", time.Minute, "How long the external-provisioner waits before checking for storage capacity changes.")
 	capacityOwnerrefLevel    = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc.")
 
+	enableNodeDeployment           = flag.Bool("node-deployment", false, "Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes.")
+	nodeDeploymentImmediateBinding = flag.Bool("node-deployment-immediate-binding", true, "Determines whether immediate binding is supported when deployed on each node.")
+	nodeDeploymentBaseDelay        = flag.Duration("node-deployment-base-delay", 20*time.Second, "Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding.")
+	nodeDeploymentMaxDelay         = flag.Duration("node-deployment-max-delay", 60*time.Second, "Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding.")
+
 	featureGates        map[string]bool
 	provisionController *controller.ProvisionController
 	version             = "unknown"
@@ -116,6 +124,11 @@ func main() {
 		klog.Fatal(err)
 	}
 
+	node := os.Getenv("NODE_NAME")
+	if *enableNodeDeployment && node == "" {
+		klog.Fatal("The NODE_NAME environment variable must be set when using --enable-node-deployment.")
+	}
+
 	if *showVersion {
 		fmt.Println(os.Args[0], version)
 		os.Exit(0)
@@ -214,6 +227,9 @@ func main() {
 	// Generate a unique ID for this provisioner
 	timeStamp := time.Now().UnixNano() / int64(time.Millisecond)
 	identity := strconv.FormatInt(timeStamp, 10) + "-" + strconv.Itoa(rand.Intn(10000)) + "-" + provisionerName
+	if *enableNodeDeployment {
+		identity = identity + "-" + node
+	}
 
 	factory := informers.NewSharedInformerFactory(clientset, ctrl.ResyncPeriodOfCsiNodeInformer)
 	var factoryForNamespace informers.SharedInformerFactory // usually nil, only used for CSIStorageCapacity
@@ -224,18 +240,76 @@ func main() {
 	scLister := factory.Storage().V1().StorageClasses().Lister()
 	claimLister := factory.Core().V1().PersistentVolumeClaims().Lister()
 
-	var csiNodeLister storagelistersv1.CSINodeLister
 	var vaLister storagelistersv1.VolumeAttachmentLister
 	if controllerCapabilities[csi.ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME] {
 		klog.Info("CSI driver supports PUBLISH_UNPUBLISH_VOLUME, watching VolumeAttachments")
 		vaLister = factory.Storage().V1().VolumeAttachments().Lister()
 	} else {
 		klog.Info("CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments")
 	}
-	var nodeLister v1.NodeLister
+
+	var nodeDeployment *ctrl.NodeDeployment
+	if *enableNodeDeployment {
+		nodeDeployment = &ctrl.NodeDeployment{
+			NodeName:         node,
+			ClaimInformer:    factory.Core().V1().PersistentVolumeClaims(),
+			ImmediateBinding: *nodeDeploymentImmediateBinding,
+			BaseDelay:        *nodeDeploymentBaseDelay,
+			MaxDelay:         *nodeDeploymentMaxDelay,
+		}
+		nodeInfo, err := ctrl.GetNodeInfo(grpcClient, *operationTimeout)
+		if err != nil {
+			klog.Fatalf("Failed to get node info from CSI driver: %v", err)
+		}
+		nodeDeployment.NodeInfo = *nodeInfo
+	}
+
+	var nodeLister listersv1.NodeLister
+	var csiNodeLister storagelistersv1.CSINodeLister
 	if ctrl.SupportsTopology(pluginCapabilities) {
-		csiNodeLister = factory.Storage().V1().CSINodes().Lister()
-		nodeLister = factory.Core().V1().Nodes().Lister()
+		if nodeDeployment != nil {
+			// Avoid watching in favor of fake, static objects. This is particularly relevant for
+			// Node objects, which can generate significant traffic.
+			csiNode := &storagev1.CSINode{
+				ObjectMeta: metav1.ObjectMeta{
+					Name: nodeDeployment.NodeName,
+				},
+				Spec: storagev1.CSINodeSpec{
+					Drivers: []storagev1.CSINodeDriver{
+						{
+							Name:   provisionerName,
+							NodeID: nodeDeployment.NodeInfo.NodeId,
+						},
+					},
+				},
+			}
+			node := &v1.Node{
+				ObjectMeta: metav1.ObjectMeta{
+					Name: nodeDeployment.NodeName,
+				},
+			}
+			if nodeDeployment.NodeInfo.AccessibleTopology != nil {
+				for key := range nodeDeployment.NodeInfo.AccessibleTopology.Segments {
+					csiNode.Spec.Drivers[0].TopologyKeys = append(csiNode.Spec.Drivers[0].TopologyKeys, key)
+				}
+				node.Labels = nodeDeployment.NodeInfo.AccessibleTopology.Segments
+			}
+			klog.Infof("using local topology with Node = %+v and CSINode = %+v", node, csiNode)
+
+			// We make those fake objects available to the topology code via informers which
+			// never change.
+			stoppedFactory := informers.NewSharedInformerFactory(clientset, 1000*time.Hour)
+			csiNodes := stoppedFactory.Storage().V1().CSINodes()
+			nodes := stoppedFactory.Core().V1().Nodes()
+			csiNodes.Informer().GetStore().Add(csiNode)
+			nodes.Informer().GetStore().Add(node)
+			csiNodeLister = csiNodes.Lister()
+			nodeLister = nodes.Lister()
+
+		} else {
+			csiNodeLister = factory.Storage().V1().CSINodes().Lister()
+			nodeLister = factory.Core().V1().Nodes().Lister()
+		}
 	}
 
 	// -------------------------------
@@ -292,6 +366,7 @@ func main() {
 		vaLister,
 		*extraCreateMetadata,
 		*defaultFSType,
+		nodeDeployment,
 	)
 
 	provisionController = controller.NewProvisionController(
@@ -311,7 +386,8 @@ func main() {
 	)
 
 	var capacityController *capacity.Controller
-	if *capacityMode == capacity.DeploymentModeCentral {
+	if *capacityMode == capacity.DeploymentModeCentral ||
+		*capacityMode == capacity.DeploymentModeLocal {
 		podName := os.Getenv("POD_NAME")
 		namespace := os.Getenv("POD_NAMESPACE")
 		if podName == "" || namespace == "" {
@@ -328,13 +404,28 @@ func main() {
 		}
 		klog.Infof("using %s/%s %s as owner of CSIStorageCapacity objects", controller.APIVersion, controller.Kind, controller.Name)
 
-		topologyInformer := topology.NewNodeTopology(
-			provisionerName,
-			clientset,
-			factory.Core().V1().Nodes(),
-			factory.Storage().V1().CSINodes(),
-			workqueue.NewNamedRateLimitingQueue(rateLimiter, "csitopology"),
-		)
+		var topologyInformer topology.Informer
+		if *capacityMode == capacity.DeploymentModeCentral {
+			topologyInformer = topology.NewNodeTopology(
+				provisionerName,
+				clientset,
+				factory.Core().V1().Nodes(),
+				factory.Storage().V1().CSINodes(),
+				workqueue.NewNamedRateLimitingQueue(rateLimiter, "csitopology"),
+			)
+		} else {
+			var segment topology.Segment
+			if nodeDeployment == nil {
+				klog.Fatal("--capacity-controller-deployment-mode=local is only valid in combination with --node-deployment")
+			}
+			if nodeDeployment.NodeInfo.AccessibleTopology != nil {
+				for key, value := range nodeDeployment.NodeInfo.AccessibleTopology.Segments {
+					segment = append(segment, topology.SegmentEntry{Key: key, Value: value})
+				}
+			}
+			klog.Infof("producing CSIStorageCapacity objects with fixed topology segment %s", segment)
+			topologyInformer = topology.NewFixedNodeTopology(&segment)
+		}
 
 		// We only need objects from our own namespace. The normal factory would give
 		// us an informer for the entire cluster.

diff --git a/go.mod b/go.mod
@@ -31,6 +31,7 @@ require (
 	k8s.io/apiserver v0.20.0
 	k8s.io/client-go v0.20.0
 	k8s.io/component-base v0.20.0
+	k8s.io/component-helpers v0.20.0
 	k8s.io/csi-translation-lib v0.20.0
 	k8s.io/klog/v2 v2.4.0
 	k8s.io/kubernetes v1.20.0