Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed provisioning #524

Merged
merged 7 commits into from
Dec 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 141 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Note that the external-provisioner does not scale with more replicas. Only one e

See the [storage capacity section](#capacity-support) below for details.

* `--capacity-controller-deployment-mode=central`: Setting this enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. 'central' is currently the only supported mode. Use it when there is just one active provisioner in the cluster. The default is to not produce CSIStorageCapacity objects.
* `--capacity-controller-deployment-mode=central|local`: Setting this enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. Use `central` when there is just one active external-provisioner in the cluster. Use `local` when deploying external-provisioner on each node with distributed provisioning. The default is to not produce CSIStorageCapacity objects.

* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet).

Expand All @@ -84,6 +84,16 @@ See the [storage capacity section](#capacity-support) below for details.

* `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`.

##### Distributed provisioning

* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be consolidated with --capacity-controller-deployment-mode=local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some way to to do local deployment without enabling storage capacity because that might not be supported by the cluster or not desired, so we cannot have just one flag that enables both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can avoid flag skew issues like --node-deployment=true and --capacity-controller-deployment-mode=central.

Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?

In retrospect that would be nicer, but it is a bit late to change the semantic of the --capacity-controller-deployment-mode that way - we would have to do a major update of external-provisioner for that because it is a breaking change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an alpha feature, I think we can make breaking changes to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So --enable-capacity=true/false instead of --capacity-controller-deployment-mode=central and the actual mode of operation (central vs. local) determined by --node-deployment?

Works for me, I just won't get to it today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me! Do you think enable-capacity will always be a permanent flag, or would it fit a feature gate better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a permanent flag is better. The feature causes additional work and might not be that relevant for some drivers (like those where all volumes are available everywhere), so being able to keep it turned off will remain useful.


* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired. Disabling it may be useful for example when a custom controller will select nodes for PVCs.

* `--node-deployment-base-delay`: Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding. Defaults to 20 seconds.

* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.

#### Other recognized arguments
* `--feature-gates <gates>`: A set of comma separated `<feature-name>=<true|false>` pairs that describe feature gates for alpha/experimental features. See [list of features](#feature-status) or `--help` output for list of recognized features. Example: `--feature-gates Topology=true` to enable Topology feature that's disabled by default.

Expand Down Expand Up @@ -139,7 +149,7 @@ determine with the `POD_NAME/POD_NAMESPACE` environment variables and
the `--capacity-ownerref-level` parameter. Other solutions will be
added in the future.

To enable this feature in a driver deployment (see also the
To enable this feature in a driver deployment with a central controller (see also the
[`deploy/kubernetes/storage-capacity.yaml`](deploy/kubernetes/storage-capacity.yaml)
example):

Expand All @@ -155,7 +165,7 @@ example):
fieldRef:
fieldPath: metadata.name
```
- Add `--enable-capacity=central` to the command line flags.
- Add `--capacity-controller-deployment-mode=central` to the command line flags.
- Add `StorageCapacity: true` to the CSIDriver information object.
Without it, external-provisioner will publish information, but the
Kubernetes scheduler will ignore it. This can be used to first
Expand All @@ -170,7 +180,7 @@ example):
with `--capacity-threads`.
- Optional: enable producing information also for storage classes that
use immediate volume binding with
`--enable-capacity=immediate-binding`. This is usually not needed
`--capacity-for-immediate-binding`. This is usually not needed
because such volumes are created by the driver without involving the
Kubernetes scheduler and thus the published information would just
be ignored.
Expand Down Expand Up @@ -220,6 +230,14 @@ CSIStorageCapacity objects, so in theory a malfunctioning or malicious
driver deployment could also publish incorrect information about some
other driver.

The deployment with [distributed
provisioning](#distributed-provisioning) is almost the same as above,
with some minor changes:
- Add `--capacity-controller-deployment-mode=local` to the command line flags.
- Use `--capacity-ownerref-level=0` and the `POD_NAMESPACE/POD_NAME`
variables to make the pod that contains the external-provisioner
the owner of CSIStorageCapacity objects for the node.

### CSI error and timeout handling
The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by `--timeout` command line argument (15 seconds by default).

Expand All @@ -242,6 +260,125 @@ The external-provisioner optionally exposes an HTTP endpoint at address:port spe
* Metrics path, as set by `--metrics-path` argument (default is `/metrics`).
* Leader election health check at `/healthz/leader-election`. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-provisioner leader that fails to connect to the API server to renew its leadership. See https://github.com/kubernetes-csi/csi-lib-utils/issues/66 for details.

### Deployment on each node

Normally, external-provisioner is deployed once in a cluster and
communicates with a control instance of the CSI driver which then
provisions volumes via some kind of storage backend API. CSI drivers
which manage local storage on a node don't have such an API that a
central controller could use.

To support this case, external-provisioner can be deployed alongside
each CSI driver on different nodes. The CSI driver deployment must:
- support topology, usually with one topology key
("csi.example.org/node") and the Kubernetes node name as value
- use a service account that has the same RBAC rules as for a normal
deployment
- invoke external-provisioner with `--node-deployment`
- tweak `--node-deployment-base-delay` and `--node-deployment-max-delay`
to match the expected cluster size and desired response times
(only relevant when there are storage classes with immediate binding,
see below for details)
- set the `NODE_NAME` environment variable to the name of the Kubernetes node
- implement `GetCapacity`

Usage of `--strict-topology` and `--immediate-topology=false` is
recommended because it makes the `CreateVolume` invocations simpler.
Topology information is always derived exclusively from the
information returned by the CSI driver that runs on the same node,
without combining that with information stored for other nodes. This
works as long as each node is in its own topology segment,
i.e. usually with a single topology key and one unique value for each
node.

Volume provisioning with late binding works as before, except that
each external-provisioner instance checks the "selected node"
annotation and only creates volumes if that node is the one it runs
on. It also only deletes volumes on its own node.

Immediate binding is also supported, but not recommended. It is
implemented by letting the external-provisioner instances assign a PVC
to one of them: when they see a new PVC with immediate binding, they
all attempt to set the "selected node" annotation with their own node
name as value. Only one update request can succeed, all others get a
"conflict" error and then know that some other instance was faster. To
avoid the thundering herd problem, each instance waits for a random
period before issuing an update request.

When `CreateVolume` call fails with `ResourcesExhausted`, the normal
recovery mechanism is used, i.e. the external-provisioner instance
removes the "selected node" annotation and the process repeats. But
this triggers events for the PVC and delays volume creation, in
particular when storage is exhausted on most nodes. Therefore
external-provisioner checks with `GetCapacity` *before* attempting to
own a PVC whether the currently available capacity is sufficient for
the volume. When it isn't, the PVC is ignored and some other instance
can own it.

The `--node-deployment-base-delay` parameter determines the initial
wait period. It also sets the jitter, so in practice the initial wait period will be
in the range from zero to the base delay. If the value is high,
volumes with immediate binding get created more slowly. If it is low,
then the risk of conflicts while setting the "selected node"
annotation increases and the apiserver load will be higher.

There is an exponential backoff per PVC which is used for unexpected
problems. Normally, an owner for a PVC is chosen during the first
attempt, so most PVCs will use the base delays. A maximum can be set
with `--node-deployment-max-delay` anyway, to avoid very long delays
when something went wrong repeatedly.

During scale testing with 100 external-provisioner instances, a base
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point to links to results anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment. I intend to write up my experience with scale testing and will publish it as a .md file in perf-tests/log but haven't gotten around to it yet. I can link to that once it is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a chance to run on 1000 nodes. Distributed provisioning was almost as fast as central provisioning in terms of volumes/second for immediate binding with no overload of the apiserver - see report in kubernetes/perf-tests#1676

delay of 20 seconds worked well. When provisioning 3000 volumes, there
were only 500 conflicts which the apiserver handled without getting
overwhelmed. The average provisioning rate of around 40 volumes/second
was the same as with a delay of 10 seconds. The worst-case latency per
volume was probably higher, but that wasn't measured.

Note that the QPS settings of kube-controller-manager and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point to bug tracking api fairness effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added " Those
settings will eventually get replaced with flow control in the API
server
itself
."

external-provisioner have to be increased at the moment (Kubernetes
1.19) to provision volumes faster than around 4 volumes/second. Those
settings will eventually get replaced with [flow control in the API
server
itself](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/).

Beware that if *no* node has sufficient storage available, then also
no `CreateVolume` call is attempted and thus no events are generated
for the PVC, i.e. some other means of tracking remaining storage
capacity must be used to detect when the cluster runs out of storage.

Because PVCs with immediate binding get distributed randomly among
nodes, they get spread evenly. If that is not desirable, then it is
possible to disable support for immediate binding in distributed
provisioning with `--node-deployment-immediate-binding=false` and
instead implement a custom policy in a separate controller which sets
the "selected node" annotation to trigger local provisioning on the
desired node.

### Deleting local volumes after a node failure or removal

When a node with local volumes gets removed from a cluster before
deleting those volumes, the PV and PVC objects may still exist. It may
be possible to remove the PVC normally if the volume was not in use by
any pod on the node, but normal deletion of the volume and thus
deletion of the PV is not possible anymore because the CSI driver
instance on the node is not available or reachable anymore and therefore
Kubernetes cannot be sure that it is okay to remove the PV.

When an administrator is sure that the node is never going to come
back, then the local volumes can be removed manually:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that they also should make sure the data on the disks are deleted before bringing the disk back into service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "It may also be necessary to scrub disks before reusing them because
the CSI driver had no chance to do that."

- force-delete objects: `kubectl delete pv <pv> --wait=false --grace-period=0 --force`
- remove all finalizers: `kubectl patch pv <pv> -p '{"metadata":{"finalizers":null}}'`

It may also be necessary to scrub disks before reusing them because
the CSI driver had no chance to do that.

If there still was a PVC which was bound to that PV, it then will be
moved to phase "Lost". It has to be deleted and re-created if still
needed because no new volume will be created for it. Editing the PVC
to revert it to phase "Unbound" is not allowed by the Kubernetes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to mention the last sentence. There's a lot of things we can't do :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in this case it is something that might occur to a reader. It certainly occurred to me, so I tried it - unsuccessfully 😅

By saying upfront that it won't work we save someone who is in that situation the time to try that out.

API server.

## Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the [community page](http://kubernetes.io/community/).
Expand Down
117 changes: 104 additions & 13 deletions cmd/csi-provisioner/csi-provisioner.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,14 @@ import (

"github.com/container-storage-interface/spec/lib/go/csi"
flag "github.com/spf13/pflag"
v1 "k8s.io/api/core/v1"
storagev1 "k8s.io/api/storage/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime/schema"
utilfeature "k8s.io/apiserver/pkg/util/feature"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
v1 "k8s.io/client-go/listers/core/v1"
listersv1 "k8s.io/client-go/listers/core/v1"
storagelistersv1 "k8s.io/client-go/listers/storage/v1"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
Expand Down Expand Up @@ -90,6 +93,11 @@ var (
capacityPollInterval = flag.Duration("capacity-poll-interval", time.Minute, "How long the external-provisioner waits before checking for storage capacity changes.")
capacityOwnerrefLevel = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc.")

enableNodeDeployment = flag.Bool("node-deployment", false, "Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes.")
nodeDeploymentImmediateBinding = flag.Bool("node-deployment-immediate-binding", true, "Determines whether immediate binding is supported when deployed on each node.")
nodeDeploymentBaseDelay = flag.Duration("node-deployment-base-delay", 20*time.Second, "Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding.")
nodeDeploymentMaxDelay = flag.Duration("node-deployment-max-delay", 60*time.Second, "Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding.")

featureGates map[string]bool
provisionController *controller.ProvisionController
version = "unknown"
Expand All @@ -116,6 +124,11 @@ func main() {
klog.Fatal(err)
}

node := os.Getenv("NODE_NAME")
if *enableNodeDeployment && node == "" {
klog.Fatal("The NODE_NAME environment variable must be set when using --enable-node-deployment.")
}

if *showVersion {
fmt.Println(os.Args[0], version)
os.Exit(0)
Expand Down Expand Up @@ -214,6 +227,9 @@ func main() {
// Generate a unique ID for this provisioner
timeStamp := time.Now().UnixNano() / int64(time.Millisecond)
identity := strconv.FormatInt(timeStamp, 10) + "-" + strconv.Itoa(rand.Intn(10000)) + "-" + provisionerName
if *enableNodeDeployment {
identity = identity + "-" + node
}

factory := informers.NewSharedInformerFactory(clientset, ctrl.ResyncPeriodOfCsiNodeInformer)
var factoryForNamespace informers.SharedInformerFactory // usually nil, only used for CSIStorageCapacity
Expand All @@ -224,18 +240,76 @@ func main() {
scLister := factory.Storage().V1().StorageClasses().Lister()
claimLister := factory.Core().V1().PersistentVolumeClaims().Lister()

var csiNodeLister storagelistersv1.CSINodeLister
var vaLister storagelistersv1.VolumeAttachmentLister
if controllerCapabilities[csi.ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME] {
klog.Info("CSI driver supports PUBLISH_UNPUBLISH_VOLUME, watching VolumeAttachments")
vaLister = factory.Storage().V1().VolumeAttachments().Lister()
} else {
klog.Info("CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments")
}
var nodeLister v1.NodeLister

var nodeDeployment *ctrl.NodeDeployment
if *enableNodeDeployment {
nodeDeployment = &ctrl.NodeDeployment{
NodeName: node,
ClaimInformer: factory.Core().V1().PersistentVolumeClaims(),
ImmediateBinding: *nodeDeploymentImmediateBinding,
BaseDelay: *nodeDeploymentBaseDelay,
MaxDelay: *nodeDeploymentMaxDelay,
}
nodeInfo, err := ctrl.GetNodeInfo(grpcClient, *operationTimeout)
if err != nil {
klog.Fatalf("Failed to get node info from CSI driver: %v", err)
}
nodeDeployment.NodeInfo = *nodeInfo
}

var nodeLister listersv1.NodeLister
var csiNodeLister storagelistersv1.CSINodeLister
if ctrl.SupportsTopology(pluginCapabilities) {
csiNodeLister = factory.Storage().V1().CSINodes().Lister()
nodeLister = factory.Core().V1().Nodes().Lister()
if nodeDeployment != nil {
// Avoid watching in favor of fake, static objects. This is particularly relevant for
// Node objects, which can generate significant traffic.
csiNode := &storagev1.CSINode{
ObjectMeta: metav1.ObjectMeta{
Name: nodeDeployment.NodeName,
},
Spec: storagev1.CSINodeSpec{
Drivers: []storagev1.CSINodeDriver{
{
Name: provisionerName,
NodeID: nodeDeployment.NodeInfo.NodeId,
},
},
},
}
node := &v1.Node{
ObjectMeta: metav1.ObjectMeta{
Name: nodeDeployment.NodeName,
},
}
if nodeDeployment.NodeInfo.AccessibleTopology != nil {
for key := range nodeDeployment.NodeInfo.AccessibleTopology.Segments {
csiNode.Spec.Drivers[0].TopologyKeys = append(csiNode.Spec.Drivers[0].TopologyKeys, key)
}
node.Labels = nodeDeployment.NodeInfo.AccessibleTopology.Segments
}
klog.Infof("using local topology with Node = %+v and CSINode = %+v", node, csiNode)

// We make those fake objects available to the topology code via informers which
// never change.
stoppedFactory := informers.NewSharedInformerFactory(clientset, 1000*time.Hour)
csiNodes := stoppedFactory.Storage().V1().CSINodes()
nodes := stoppedFactory.Core().V1().Nodes()
csiNodes.Informer().GetStore().Add(csiNode)
nodes.Informer().GetStore().Add(node)
csiNodeLister = csiNodes.Lister()
nodeLister = nodes.Lister()

} else {
csiNodeLister = factory.Storage().V1().CSINodes().Lister()
nodeLister = factory.Core().V1().Nodes().Lister()
}
}

// -------------------------------
Expand Down Expand Up @@ -292,6 +366,7 @@ func main() {
vaLister,
*extraCreateMetadata,
*defaultFSType,
nodeDeployment,
)

provisionController = controller.NewProvisionController(
Expand All @@ -311,7 +386,8 @@ func main() {
)

var capacityController *capacity.Controller
if *capacityMode == capacity.DeploymentModeCentral {
if *capacityMode == capacity.DeploymentModeCentral ||
*capacityMode == capacity.DeploymentModeLocal {
podName := os.Getenv("POD_NAME")
namespace := os.Getenv("POD_NAMESPACE")
if podName == "" || namespace == "" {
Expand All @@ -328,13 +404,28 @@ func main() {
}
klog.Infof("using %s/%s %s as owner of CSIStorageCapacity objects", controller.APIVersion, controller.Kind, controller.Name)

topologyInformer := topology.NewNodeTopology(
provisionerName,
clientset,
factory.Core().V1().Nodes(),
factory.Storage().V1().CSINodes(),
workqueue.NewNamedRateLimitingQueue(rateLimiter, "csitopology"),
)
var topologyInformer topology.Informer
if *capacityMode == capacity.DeploymentModeCentral {
topologyInformer = topology.NewNodeTopology(
provisionerName,
clientset,
factory.Core().V1().Nodes(),
factory.Storage().V1().CSINodes(),
workqueue.NewNamedRateLimitingQueue(rateLimiter, "csitopology"),
)
} else {
var segment topology.Segment
if nodeDeployment == nil {
klog.Fatal("--capacity-controller-deployment-mode=local is only valid in combination with --node-deployment")
}
if nodeDeployment.NodeInfo.AccessibleTopology != nil {
for key, value := range nodeDeployment.NodeInfo.AccessibleTopology.Segments {
segment = append(segment, topology.SegmentEntry{Key: key, Value: value})
}
}
klog.Infof("producing CSIStorageCapacity objects with fixed topology segment %s", segment)
topologyInformer = topology.NewFixedNodeTopology(&segment)
}

// We only need objects from our own namespace. The normal factory would give
// us an informer for the entire cluster.
Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ require (
k8s.io/apiserver v0.20.0
k8s.io/client-go v0.20.0
k8s.io/component-base v0.20.0
k8s.io/component-helpers v0.20.0
k8s.io/csi-translation-lib v0.20.0
k8s.io/klog/v2 v2.4.0
k8s.io/kubernetes v1.20.0
Expand Down
Loading