distributed provisioning #524

pohly · 2020-11-03T17:29:32Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

A CSI driver which manages local resources typically has no control plane and thus (currently) can only provision volumes on a single node. With this PR, external-provisioner gets extended to support deployment alongside a local CSI driver on each node.

The csi-driver-host-path deployment could be extended to use this.

Which issue(s) this PR fixes:
Fixes #487

Special notes for your reviewer:

This is based on #367. Out of courtesy to @jsanda, the original commit is the one from that PR although it gets changed substantially later on. I could also squash more aggressively, if that is desired.

TODO:

make the UML diagram nicer (text formatting potentially layout)
scale testing: done on 100 nodes, maybe try something higher
automated tests for the new code paths
storage capacity tracking support also for distributed provisioning

Does this PR introduce a user-facing change?:

external-provisioner can be deployed alongside a CSI driver on each node to manage local volumes.

pohly · 2020-11-03T17:37:23Z

Before I proceed with this feature, I'd like to get feedback whether the approach of letting the "selected node" annotation be set by the external-provisioner is acceptable for volumes with immediate binding. I have tested this on a 50 node cluster and it worked nicely.

My argument why it is compatible with the current design is that the state of the PVC is indistinguishable from a PVC that was assigned to a node while the storage class used late binding. This can happen and if the current code cannot handle this, then it needs to be fixed. However, I don't see why it should fail (famous last words...).

@msau42, @jsafrane: what do you think about the UML diagram? That commit itself is unrelated to this PR, I could also submit it separately. I added it here because it makes it easier to understand how the node-local deployment fits into the overall architecture.

pohly · 2020-11-03T17:38:18Z

Hmm, looks like my recent rebasing broke the unit tests. Will fix that...

jingxu97 · 2020-11-23T19:01:19Z

cc @jingxu97

README.md

pkg/controller/ratelimiter.go

jsafrane · 2020-11-23T20:35:25Z

pkg/controller/controller.go

+	klog.V(5).Infof("will try to become owner of PVC %s/%s in %s, attempt #%d", claim.Namespace, claim.Name, delay, attempts)
+	sleep, cancel := context.WithTimeout(ctx, delay)
+	defer cancel()
+	ticker := time.NewTicker(10 * time.Millisecond)


Did you consider using claimInformer to get the events?

Instead of polling? Yes, I had been thinking about that, but couldn't come up with a good way to hook up the informer callback with this function here.

Callbacks can only be added, but not removed. So I would have to install some generic callback that gets told at runtime who is interested in which PVC - felt rather complicated for a rather small improvement in latency.

Do we need the polling period to be so frequent considering that the delays are in seconds?

Probably not. I've replaced it with 'base delay / 100', so now it scales together with the base delay and (for the default) is at 300ms.

I don't want to make it too large either because then one go routine would be blocked for potentially quite a while before it becomes available for some other PVC work item. With many provisioners running in parallel the actual smallest delay can be smaller than the maximum of 30 seconds.

doc/provisioning.puml

jsafrane

I think there will be some changes necessary for scalability, but I'm fine with merging this and fix them later after we get some numbers.

One thing troubles me more: PV deletion.
What should happen when a node is removed from the cluster? All its PVs can't be deleted - there is no provisioner that would do that. Is it just a documentation issue? Or shall there be a cluster-level provisioner that deletes PVs from missing nodes? What if the node rejoins the cluster?

jsafrane · 2020-11-25T14:51:50Z

pkg/controller/controller.go

+		// themselves.
+		//
+		// With a value of 10 seconds, when creating 5000
+		// volumes on a cluster with 50 instances only ~300


Just to be sure, nr. of conflicts does not grow linearly with nr. of nodes, right? 5000 nodes != 30 000 conflicts, but much more. How much?

I'll try to test that.

jsafrane · 2020-11-25T14:52:39Z

pkg/controller/controller.go

+		// It might make sense to make this value configurable so
+		// that CSI driver deployments can tweak it depending
+		// on their needs.
+		baseDelay := 10 * time.Second


This IMO needs to be configurable or grow with the cluster size (as a separate PR)

This got me thinking. The current implementation does exponential backoff per PVC. This is almost never going to make a difference in practice because after the initial conflict, the PVC has a selected node and there's no need to do another update attempt for it. The only exception is when provisioning on that selected node fails and there has to be a global retry - that should be rare.

In this approach, we can't make this grow with cluster size either: we cannot assume that the CSI driver runs on all nodes. It might manage specialized resources that are only available on some nodes. With deployment handled by Kubernetes, the individual instances have no information which other instances exist.

I had considered making this independent of the PVC. The issue with that is the question of when to reduce the delay and how much. When the "winning" instance resets it to zero, then it is also likely to win for the next PVC until its space is exhausted and it stops trying to own a PVC (because of the capacity check). I preferred even spreading across the cluster. But perhaps filling up one node is also fine: if users want a certain policy for volume scheduling, they should use late binding and influence volume placement indirectly through that.

I think I will give that a try...

It ended up working worse. The main problem was that the instance with the smallest backoff delay ends up grabbing more volumes for provisioning than what it has space for, so once it is the owner it fails for at least some of them. Those then are made available to other instances, but the same pattern just repeats.

It could be made to work if the provisioner could model remaining capacity considering how many volumes already were assigned, but that is driver specific and complex.

This is more of a problem under load (= many pending PVCs) and a full cluster (= capacity exhausted on some nodes), but still, when volumes are spread evenly purely based on statistics, it worked better.

So I'll stick with the current approach and perhaps tweak it a bit: when tracking the number of collisions per number of PVCs, it should be possible to scale the base delay.

So I'll stick with the current approach and perhaps tweak it a bit: when tracking the number of collisions per number of PVCs, it should be possible to scale the base delay.

That has the same problem. When using a running average of the failure rate as indicator for "delay is too low", I saw increasing numbers of failed provisioning attempts because of full nodes.

TLDR: cluster admins will have to adjust the delay for the expected cluster size and number of volumes.

pohly · 2020-11-25T16:27:11Z

I think there will be some changes necessary for scalability, but I'm fine with merging this and fix them later after we get some numbers.

I just had a call today regarding that: I will get access to a 100 node cluster from SIG-Scalability that I can run scalability tests on. We don't need to merge for that, I'll simply use a custom image. If it works on 100 nodes, I'll try to run with more.

One thing troubles me more: PV deletion.
What should happen when a node is removed from the cluster? All its PVs can't be deleted - there is no provisioner that would do that. Is it just a documentation issue?

I think we need to document how to deal with this: if the admin is sure that the node is gone, they will have to force-delete orphan PVs from the node, including removal of finalizers.

The same problem occurs when the CSI driver handles communication with nodes internally - if it handles the issue. In PMEM-CSI we have an open bug around that, just that we erred towards allowing PV removal when uncertain, with the result that we sometimes don't delete a volume on a node.

Or shall there be a cluster-level provisioner that deletes PVs from missing nodes? What if the node rejoins the cluster?

That's exactly the problem. Without further information, Kubernetes cannot know whether the node is truly gone. I'm hoping that the work on better node shutdown handling may lead to a better solution, but that'll take time.

msau42 · 2020-12-03T02:20:25Z

/assign @verult

pohly · 2020-12-03T11:23:29Z

I've pushed updated commits that reflect the current status.

I included a temporary commit for using kubernetes-sigs/sig-storage-lib-external-provisioner#100 but there is some issue with the replace statement.

@verult IMHO it makes sense to review anyway. Let's do a new sig-storage-lib-external-provisioner release with that pending PR and then I can use that cleanly here.

pohly · 2020-12-03T16:30:06Z

I think we need to document how to deal with this: if the admin is sure that the node is gone, they will have to force-delete orphan PVs from the node, including removal of finalizers.

I've added one commit with documentation for that.

I included a temporary commit for using kubernetes-sigs/sig-storage-lib-external-provisioner#100 but there is some issue with the replace statement.

I took out that commit to make this PR ready for merging now.

~~The only remaining gap is support for storage capacity tracking in distributed provisioning mode. I'll work on that next, but the PR is already useful without it.~~ Edit: done.

pohly · 2020-12-15T13:21:20Z

/retest

pohly · 2020-12-15T16:33:09Z

/retest

msau42 · 2020-12-15T17:27:36Z

/rete

msau42 · 2020-12-15T17:27:40Z

/retest

Immediate binding is not recommended, but is needed for the sake of feature parity. With immediate binding also supported, the code no longer just passively checks the selected node, so a different name seems more appropriate. Besides implementing immediate binding support, the original implementation also gets fixed: DeleteVolume was called by all external-provisioner instances. On most nodes that then looked like the volume had been removed already and the PV got removed before the actual node had a chance to finish the volume deletion.

When deploying external-provisioner on each node, the topology information that it needs is most likely just the values reported by the local CSI driver instance. We can avoid the extra work for watching Node and CSINode in that case.

This is intentionally a separate section because although it applies to distributed provisioning, the same problem also arises when a CSI driver handles provisioning of local volumes differently.

Producing CSIStorageCapacity objects for a node uses the same code, the only difference is that there is just a single topology segment that the external-provisioner needs to iterate over. Also, that segment is fixed. Therefore we can use the simple mock informer that previously was only used for testing.

pohly · 2020-12-15T19:21:43Z

Rebased to resolve the conflict in pkg/controller/controller.go.

msau42 · 2020-12-15T18:11:44Z

README.md

+
+* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.
+
+* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired.


I'm not sure we need to have this option. While we should encourage everyone to use delayed binding, immediate binding is still an available option that users can set independently of a driver.

Ok I saw this before seeing the explanation below for a custom policy. Maybe worth mentioning that you should set it to false if you want to implement your own custom algorithm for immediate binding.

Okay. Added "Disabling it may be useful for example when a custom controller will select nodes for PVCs."

msau42 · 2020-12-15T18:39:15Z

README.md

+
+* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.
+
+* `--local-topology`: Instead of watching Node and CSINode objects, use only the topology provided by the CSI driver. Only valid in combination with `--node-deployment`. Disabled by default, but recommended for drivers which have a single topology key with different values for each node (i.e. local volumes).


Do you see a reason why someone would want local-topology false and node-deployment? I would prefer not supporting non-local topology in node mode until a use case comes up that needs it since per-node informers can be expensive.

Do you see a reason why someone would want local-topology false and node-deployment?

I could not quite convince myself that there really is no situation where a CSI driver wants local deployment and has more complex topology. I don't have a specific example, it is just a feeling.

I don't mind removing the command line option and always do "local topology".

Yeah I think that would simplify things initially.

Done.

I kept the removal in a separate commit, in case that we want to bring it back.

msau42 · 2020-12-15T18:40:07Z

README.md

@@ -84,6 +84,18 @@ See the [storage capacity section](#capacity-support) below for details.

 * `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`.

+##### Distributed provisioning
+
+* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.


Can this be consolidated with --capacity-controller-deployment-mode=local?

We need some way to to do local deployment without enabling storage capacity because that might not be supported by the cluster or not desired, so we cannot have just one flag that enables both.

I'm wondering if we can avoid flag skew issues like --node-deployment=true and --capacity-controller-deployment-mode=central.

Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?

Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?

In retrospect that would be nicer, but it is a bit late to change the semantic of the --capacity-controller-deployment-mode that way - we would have to do a major update of external-provisioner for that because it is a breaking change.

It's an alpha feature, I think we can make breaking changes to it.

So --enable-capacity=true/false instead of --capacity-controller-deployment-mode=central and the actual mode of operation (central vs. local) determined by --node-deployment?

Works for me, I just won't get to it today.

That sounds good to me! Do you think enable-capacity will always be a permanent flag, or would it fit a feature gate better?

I think a permanent flag is better. The feature causes additional work and might not be that relevant for some drivers (like those where all volumes are available everywhere), so being able to keep it turned off will remain useful.

msau42 · 2020-12-15T18:43:29Z

README.md

+with `--node-deployment-max-delay` anyway, to avoid very long delays
+when something went wrong repeatedly.
+
+During scale testing with 100 external-provisioner instances, a base


Can we point to links to results anywhere?

Not at the moment. I intend to write up my experience with scale testing and will publish it as a .md file in perf-tests/log but haven't gotten around to it yet. I can link to that once it is available.

I got a chance to run on 1000 nodes. Distributed provisioning was almost as fast as central provisioning in terms of volumes/second for immediate binding with no overload of the apiserver - see report in kubernetes/perf-tests#1676

msau42 · 2020-12-15T18:43:56Z

README.md

+was the same as with a delay of 10 seconds. The worst-case latency per
+volume was probably higher, but that wasn't measured.
+
+Note that the QPS settings of kube-controller-manager and


Point to bug tracking api fairness effort?

Added " Those
settings will eventually get replaced with flow control in the API
server
itself."

msau42 · 2020-12-15T18:49:14Z

README.md

+If there still was a PVC which was bound to that PV, it then will be
+moved to phase "Lost". It has to be deleted and re-created if still
+needed because no new volume will be created for it. Editing the PVC
+to revert it to phase "Unbound" is not allowed by the Kubernetes


I don't think we need to mention the last sentence. There's a lot of things we can't do :)

Yes, but in this case it is something that might occur to a reader. It certainly occurred to me, so I tried it - unsuccessfully 😅

By saying upfront that it won't work we save someone who is in that situation the time to try that out.

msau42 · 2020-12-15T18:50:50Z

README.md

+Kubernetes cannot be sure that it is okay to remove the PV.
+
+When an administrator is sure that the node is never going to come
+back, then the local volumes can be removed manually:


Note that they also should make sure the data on the disks are deleted before bringing the disk back into service.

Added "It may also be necessary to scrub disks before reusing them because
the CSI driver had no chance to do that."

msau42 · 2020-12-15T19:05:20Z

pkg/controller/controller.go

+// NodeDeployment contains additional parameters for running external-provisioner alongside a
+// CSI driver on one or more nodes in the cluster.
+type NodeDeployment struct {
+	NodeName         string


Can you add some comments on what each of these fields do?

msau42 · 2020-12-15T19:11:28Z

pkg/controller/controller.go

+	switch selectedNode {
+	case "":
+		logger := klog.V(5)
+		if logger.Enabled() {


Any reason why we do this method of checking log level?

Because then the check only needs to be done once and the overhead for defer can be avoided entirely in most cases. It's not important, I can also use klog.V(5).Info instead if you prefer that.

msau42 · 2020-12-15T19:24:56Z

pkg/controller/controller.go

+	klog.V(5).Infof("will try to become owner of PVC %s/%s in %s, attempt #%d", claim.Namespace, claim.Name, delay, attempts)
+	sleep, cancel := context.WithTimeout(ctx, delay)
+	defer cancel()
+	ticker := time.NewTicker(10 * time.Millisecond)


Do we need the polling period to be so frequent considering that the delays are in seconds?

pohly · 2020-12-15T20:14:47Z

@msau42 : please take another look.

It is uncertain whether that option is needed. Removing it simplifies the code.

pohly · 2020-12-15T21:13:06Z

/retest

verult · 2020-12-15T21:32:04Z

/lgtm

LGTM on my end, leaving final approval to @msau42 / @jsafrane

msau42 · 2020-12-16T00:50:46Z

/approve

k8s-ci-robot · 2020-12-16T00:51:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…se-4.4 Update go modules

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 3, 2020

k8s-ci-robot requested review from davidz627 and msau42 November 3, 2020 17:29

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 3, 2020

pohly force-pushed the distributed-provisioning branch from 06d26a9 to b89078f Compare November 4, 2020 12:33

jsafrane reviewed Nov 23, 2020

View reviewed changes

jsafrane reviewed Nov 25, 2020

View reviewed changes

doc/provisioning.puml Outdated Show resolved Hide resolved

jsafrane reviewed Nov 25, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 25, 2020

pohly force-pushed the distributed-provisioning branch from 89655f7 to 36e96f4 Compare November 30, 2020 10:15

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2020

k8s-ci-robot assigned verult Dec 3, 2020

kubernetes-csi deleted a comment from k8s-ci-robot Dec 3, 2020

pohly mentioned this pull request Dec 3, 2020

doc: UML diagram for volume creation and deletion #532

Merged

pohly force-pushed the distributed-provisioning branch 2 times, most recently from aeb9002 to d9ae323 Compare December 3, 2020 11:14

pohly force-pushed the distributed-provisioning branch 2 times, most recently from 0538807 to 14e474f Compare December 3, 2020 16:29

pohly changed the title ~~RFC: distributed provisioning~~ distributed provisioning Dec 3, 2020

pohly mentioned this pull request Dec 4, 2020

avoid one GET Node per volume with late binding #536

Merged

pohly force-pushed the distributed-provisioning branch from 14e474f to 1aaef5f Compare December 4, 2020 07:22

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020

jsanda and others added 5 commits December 15, 2020 20:16

check that the PVC provisioning request is for the current node

80a167a

add --local-topology

53c168e

When deploying external-provisioner on each node, the topology information that it needs is most likely just the values reported by the local CSI driver instance. We can avoid the extra work for watching Node and CSINode in that case.

docs: document handling of local volumes on missing nodes

2b3ac29

This is intentionally a separate section because although it applies to distributed provisioning, the same problem also arises when a CSI driver handles provisioning of local volumes differently.

pohly force-pushed the distributed-provisioning branch from 3e7ea65 to b9301ee Compare December 15, 2020 19:21

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020

msau42 reviewed Dec 15, 2020

View reviewed changes

docs, polling delay: review feedback

28257ef

pohly force-pushed the distributed-provisioning branch from 685a7a3 to 28257ef Compare December 15, 2020 20:14

remove --local-topology

d71138a

It is uncertain whether that option is needed. Removing it simplifies the code.

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2020

k8s-ci-robot merged commit 7ad1cd2 into kubernetes-csi:master Dec 16, 2020

pohly mentioned this pull request Dec 16, 2020

replace -capacity-controller-deployment-mode with -enable-capacity #540

Merged

This was referenced Mar 22, 2021

distributed snapshotting kubernetes-csi/external-snapshotter#484

Closed

distributed resizing kubernetes-csi/external-resizer#142

Open

xing-yang mentioned this pull request Aug 12, 2021

Distributed health monitor controller kubernetes-csi/external-health-monitor#87

Closed

kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this pull request Dec 29, 2023

Merge pull request kubernetes-csi#524 from msau42/module-update-relea…

54384f9

…se-4.4 Update go modules


		* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.

		* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired.


		* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.

		* `--local-topology`: Instead of watching Node and CSINode objects, use only the topology provided by the CSI driver. Only valid in combination with `--node-deployment`. Disabled by default, but recommended for drivers which have a single topology key with different values for each node (i.e. local volumes).

distributed provisioning #524

distributed provisioning #524

Conversation

pohly commented Nov 3, 2020 • edited Loading

pohly commented Nov 3, 2020

pohly commented Nov 3, 2020

jingxu97 commented Nov 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Nov 25, 2020

msau42 commented Dec 3, 2020 • edited Loading

pohly commented Dec 3, 2020

pohly commented Dec 3, 2020 • edited Loading

pohly commented Dec 15, 2020

pohly commented Dec 15, 2020

msau42 commented Dec 15, 2020

msau42 commented Dec 15, 2020

pohly commented Dec 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Dec 15, 2020

pohly commented Dec 15, 2020

verult commented Dec 15, 2020

msau42 commented Dec 16, 2020

k8s-ci-robot commented Dec 16, 2020

pohly commented Nov 3, 2020 •

edited

Loading

msau42 commented Dec 3, 2020 •

edited

Loading

pohly commented Dec 3, 2020 •

edited

Loading