Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed provisioning #524

Merged
merged 7 commits into from
Dec 16, 2020

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Nov 3, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:

A CSI driver which manages local resources typically has no control plane and thus (currently) can only provision volumes on a single node. With this PR, external-provisioner gets extended to support deployment alongside a local CSI driver on each node.

The csi-driver-host-path deployment could be extended to use this.

Which issue(s) this PR fixes:
Fixes #487

Special notes for your reviewer:

This is based on #367. Out of courtesy to @jsanda, the original commit is the one from that PR although it gets changed substantially later on. I could also squash more aggressively, if that is desired.

TODO:

  • make the UML diagram nicer (text formatting potentially layout)
  • scale testing: done on 100 nodes, maybe try something higher
  • automated tests for the new code paths
  • storage capacity tracking support also for distributed provisioning

Does this PR introduce a user-facing change?:

external-provisioner can be deployed alongside a CSI driver on each node to manage local volumes.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 3, 2020
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 3, 2020
@pohly
Copy link
Contributor Author

pohly commented Nov 3, 2020

Before I proceed with this feature, I'd like to get feedback whether the approach of letting the "selected node" annotation be set by the external-provisioner is acceptable for volumes with immediate binding. I have tested this on a 50 node cluster and it worked nicely.

My argument why it is compatible with the current design is that the state of the PVC is indistinguishable from a PVC that was assigned to a node while the storage class used late binding. This can happen and if the current code cannot handle this, then it needs to be fixed. However, I don't see why it should fail (famous last words...).

@msau42, @jsafrane: what do you think about the UML diagram? That commit itself is unrelated to this PR, I could also submit it separately. I added it here because it makes it easier to understand how the node-local deployment fits into the overall architecture.

@pohly
Copy link
Contributor Author

pohly commented Nov 3, 2020

Hmm, looks like my recent rebasing broke the unit tests. Will fix that...

@pohly pohly force-pushed the distributed-provisioning branch from 06d26a9 to b89078f Compare November 4, 2020 12:33
@jingxu97
Copy link
Contributor

cc @jingxu97

README.md Outdated Show resolved Hide resolved
pkg/controller/ratelimiter.go Outdated Show resolved Hide resolved
klog.V(5).Infof("will try to become owner of PVC %s/%s in %s, attempt #%d", claim.Namespace, claim.Name, delay, attempts)
sleep, cancel := context.WithTimeout(ctx, delay)
defer cancel()
ticker := time.NewTicker(10 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using claimInformer to get the events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of polling? Yes, I had been thinking about that, but couldn't come up with a good way to hook up the informer callback with this function here.

Callbacks can only be added, but not removed. So I would have to install some generic callback that gets told at runtime who is interested in which PVC - felt rather complicated for a rather small improvement in latency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the polling period to be so frequent considering that the delays are in seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. I've replaced it with 'base delay / 100', so now it scales together with the base delay and (for the default) is at 300ms.

I don't want to make it too large either because then one go routine would be blocked for potentially quite a while before it becomes available for some other PVC work item. With many provisioners running in parallel the actual smallest delay can be smaller than the maximum of 30 seconds.

doc/provisioning.puml Outdated Show resolved Hide resolved
Copy link
Contributor

@jsafrane jsafrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will be some changes necessary for scalability, but I'm fine with merging this and fix them later after we get some numbers.

One thing troubles me more: PV deletion.
What should happen when a node is removed from the cluster? All its PVs can't be deleted - there is no provisioner that would do that. Is it just a documentation issue? Or shall there be a cluster-level provisioner that deletes PVs from missing nodes? What if the node rejoins the cluster?

// themselves.
//
// With a value of 10 seconds, when creating 5000
// volumes on a cluster with 50 instances only ~300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, nr. of conflicts does not grow linearly with nr. of nodes, right? 5000 nodes != 30 000 conflicts, but much more. How much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to test that.

// It might make sense to make this value configurable so
// that CSI driver deployments can tweak it depending
// on their needs.
baseDelay := 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IMO needs to be configurable or grow with the cluster size (as a separate PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking. The current implementation does exponential backoff per PVC. This is almost never going to make a difference in practice because after the initial conflict, the PVC has a selected node and there's no need to do another update attempt for it. The only exception is when provisioning on that selected node fails and there has to be a global retry - that should be rare.

In this approach, we can't make this grow with cluster size either: we cannot assume that the CSI driver runs on all nodes. It might manage specialized resources that are only available on some nodes. With deployment handled by Kubernetes, the individual instances have no information which other instances exist.

I had considered making this independent of the PVC. The issue with that is the question of when to reduce the delay and how much. When the "winning" instance resets it to zero, then it is also likely to win for the next PVC until its space is exhausted and it stops trying to own a PVC (because of the capacity check). I preferred even spreading across the cluster. But perhaps filling up one node is also fine: if users want a certain policy for volume scheduling, they should use late binding and influence volume placement indirectly through that.

I think I will give that a try...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ended up working worse. The main problem was that the instance with the smallest backoff delay ends up grabbing more volumes for provisioning than what it has space for, so once it is the owner it fails for at least some of them. Those then are made available to other instances, but the same pattern just repeats.

It could be made to work if the provisioner could model remaining capacity considering how many volumes already were assigned, but that is driver specific and complex.

This is more of a problem under load (= many pending PVCs) and a full cluster (= capacity exhausted on some nodes), but still, when volumes are spread evenly purely based on statistics, it worked better.

So I'll stick with the current approach and perhaps tweak it a bit: when tracking the number of collisions per number of PVCs, it should be possible to scale the base delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'll stick with the current approach and perhaps tweak it a bit: when tracking the number of collisions per number of PVCs, it should be possible to scale the base delay.

That has the same problem. When using a running average of the failure rate as indicator for "delay is too low", I saw increasing numbers of failed provisioning attempts because of full nodes.

TLDR: cluster admins will have to adjust the delay for the expected cluster size and number of volumes.

@pohly
Copy link
Contributor Author

pohly commented Nov 25, 2020

I think there will be some changes necessary for scalability, but I'm fine with merging this and fix them later after we get some numbers.

I just had a call today regarding that: I will get access to a 100 node cluster from SIG-Scalability that I can run scalability tests on. We don't need to merge for that, I'll simply use a custom image. If it works on 100 nodes, I'll try to run with more.

One thing troubles me more: PV deletion.
What should happen when a node is removed from the cluster? All its PVs can't be deleted - there is no provisioner that would do that. Is it just a documentation issue?

I think we need to document how to deal with this: if the admin is sure that the node is gone, they will have to force-delete orphan PVs from the node, including removal of finalizers.

The same problem occurs when the CSI driver handles communication with nodes internally - if it handles the issue. In PMEM-CSI we have an open bug around that, just that we erred towards allowing PV removal when uncertain, with the result that we sometimes don't delete a volume on a node.

Or shall there be a cluster-level provisioner that deletes PVs from missing nodes? What if the node rejoins the cluster?

That's exactly the problem. Without further information, Kubernetes cannot know whether the node is truly gone. I'm hoping that the work on better node shutdown handling may lead to a better solution, but that'll take time.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 25, 2020
@pohly pohly force-pushed the distributed-provisioning branch from 89655f7 to 36e96f4 Compare November 30, 2020 10:15
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2020
@msau42
Copy link
Collaborator

msau42 commented Dec 3, 2020

/assign @verult

@kubernetes-csi kubernetes-csi deleted a comment from k8s-ci-robot Dec 3, 2020
@pohly pohly force-pushed the distributed-provisioning branch 2 times, most recently from aeb9002 to d9ae323 Compare December 3, 2020 11:14
@pohly
Copy link
Contributor Author

pohly commented Dec 3, 2020

I've pushed updated commits that reflect the current status.

I included a temporary commit for using kubernetes-sigs/sig-storage-lib-external-provisioner#100 but there is some issue with the replace statement.

@verult IMHO it makes sense to review anyway. Let's do a new sig-storage-lib-external-provisioner release with that pending PR and then I can use that cleanly here.

@pohly pohly force-pushed the distributed-provisioning branch 2 times, most recently from 0538807 to 14e474f Compare December 3, 2020 16:29
@pohly
Copy link
Contributor Author

pohly commented Dec 3, 2020

I think we need to document how to deal with this: if the admin is sure that the node is gone, they will have to force-delete orphan PVs from the node, including removal of finalizers.

I've added one commit with documentation for that.

I included a temporary commit for using kubernetes-sigs/sig-storage-lib-external-provisioner#100 but there is some issue with the replace statement.

I took out that commit to make this PR ready for merging now.

The only remaining gap is support for storage capacity tracking in distributed provisioning mode. I'll work on that next, but the PR is already useful without it. Edit: done.

@pohly pohly changed the title RFC: distributed provisioning distributed provisioning Dec 3, 2020
@pohly pohly force-pushed the distributed-provisioning branch from 14e474f to 1aaef5f Compare December 4, 2020 07:22
@pohly
Copy link
Contributor Author

pohly commented Dec 15, 2020

/retest

1 similar comment
@pohly
Copy link
Contributor Author

pohly commented Dec 15, 2020

/retest

@msau42
Copy link
Collaborator

msau42 commented Dec 15, 2020

/rete

@msau42
Copy link
Collaborator

msau42 commented Dec 15, 2020

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020
jsanda and others added 5 commits December 15, 2020 20:16
Immediate binding is not recommended, but is needed for the sake of
feature parity. With immediate binding also supported, the code no
longer just passively checks the selected node, so a different name
seems more appropriate.

Besides implementing immediate binding support, the original
implementation also gets fixed: DeleteVolume was called by all
external-provisioner instances. On most nodes that then looked like
the volume had been removed already and the PV got removed before the
actual node had a chance to finish the volume deletion.
When deploying external-provisioner on each node, the topology
information that it needs is most likely just the values reported by
the local CSI driver instance. We can avoid the extra work for
watching Node and CSINode in that case.
This is intentionally a separate section because although it applies
to distributed provisioning, the same problem also arises when a CSI
driver handles provisioning of local volumes differently.
Producing CSIStorageCapacity objects for a node uses the same code,
the only difference is that there is just a single topology segment
that the external-provisioner needs to iterate over.

Also, that segment is fixed. Therefore we can use the simple mock
informer that previously was only used for testing.
@pohly pohly force-pushed the distributed-provisioning branch from 3e7ea65 to b9301ee Compare December 15, 2020 19:21
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020
@pohly
Copy link
Contributor Author

pohly commented Dec 15, 2020

Rebased to resolve the conflict in pkg/controller/controller.go.

README.md Outdated

* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.

* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to have this option. While we should encourage everyone to use delayed binding, immediate binding is still an available option that users can set independently of a driver.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I saw this before seeing the explanation below for a custom policy. Maybe worth mentioning that you should set it to false if you want to implement your own custom algorithm for immediate binding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Added "Disabling it may be useful for example when a custom controller will select nodes for PVCs."

README.md Outdated

* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.

* `--local-topology`: Instead of watching Node and CSINode objects, use only the topology provided by the CSI driver. Only valid in combination with `--node-deployment`. Disabled by default, but recommended for drivers which have a single topology key with different values for each node (i.e. local volumes).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a reason why someone would want local-topology false and node-deployment? I would prefer not supporting non-local topology in node mode until a use case comes up that needs it since per-node informers can be expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a reason why someone would want local-topology false and node-deployment?

I could not quite convince myself that there really is no situation where a CSI driver wants local deployment and has more complex topology. I don't have a specific example, it is just a feeling.

I don't mind removing the command line option and always do "local topology".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that would simplify things initially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I kept the removal in a separate commit, in case that we want to bring it back.

@@ -84,6 +84,18 @@ See the [storage capacity section](#capacity-support) below for details.

* `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`.

##### Distributed provisioning

* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be consolidated with --capacity-controller-deployment-mode=local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some way to to do local deployment without enabling storage capacity because that might not be supported by the cluster or not desired, so we cannot have just one flag that enables both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can avoid flag skew issues like --node-deployment=true and --capacity-controller-deployment-mode=central.

Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?

In retrospect that would be nicer, but it is a bit late to change the semantic of the --capacity-controller-deployment-mode that way - we would have to do a major update of external-provisioner for that because it is a breaking change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an alpha feature, I think we can make breaking changes to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So --enable-capacity=true/false instead of --capacity-controller-deployment-mode=central and the actual mode of operation (central vs. local) determined by --node-deployment?

Works for me, I just won't get to it today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me! Do you think enable-capacity will always be a permanent flag, or would it fit a feature gate better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a permanent flag is better. The feature causes additional work and might not be that relevant for some drivers (like those where all volumes are available everywhere), so being able to keep it turned off will remain useful.

with `--node-deployment-max-delay` anyway, to avoid very long delays
when something went wrong repeatedly.

During scale testing with 100 external-provisioner instances, a base
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point to links to results anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment. I intend to write up my experience with scale testing and will publish it as a .md file in perf-tests/log but haven't gotten around to it yet. I can link to that once it is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a chance to run on 1000 nodes. Distributed provisioning was almost as fast as central provisioning in terms of volumes/second for immediate binding with no overload of the apiserver - see report in kubernetes/perf-tests#1676

was the same as with a delay of 10 seconds. The worst-case latency per
volume was probably higher, but that wasn't measured.

Note that the QPS settings of kube-controller-manager and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point to bug tracking api fairness effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added " Those
settings will eventually get replaced with flow control in the API
server
itself
."

If there still was a PVC which was bound to that PV, it then will be
moved to phase "Lost". It has to be deleted and re-created if still
needed because no new volume will be created for it. Editing the PVC
to revert it to phase "Unbound" is not allowed by the Kubernetes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to mention the last sentence. There's a lot of things we can't do :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in this case it is something that might occur to a reader. It certainly occurred to me, so I tried it - unsuccessfully 😅

By saying upfront that it won't work we save someone who is in that situation the time to try that out.

Kubernetes cannot be sure that it is okay to remove the PV.

When an administrator is sure that the node is never going to come
back, then the local volumes can be removed manually:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that they also should make sure the data on the disks are deleted before bringing the disk back into service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "It may also be necessary to scrub disks before reusing them because
the CSI driver had no chance to do that."

// NodeDeployment contains additional parameters for running external-provisioner alongside a
// CSI driver on one or more nodes in the cluster.
type NodeDeployment struct {
NodeName string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments on what each of these fields do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

switch selectedNode {
case "":
logger := klog.V(5)
if logger.Enabled() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we do this method of checking log level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because then the check only needs to be done once and the overhead for defer can be avoided entirely in most cases. It's not important, I can also use klog.V(5).Info instead if you prefer that.

klog.V(5).Infof("will try to become owner of PVC %s/%s in %s, attempt #%d", claim.Namespace, claim.Name, delay, attempts)
sleep, cancel := context.WithTimeout(ctx, delay)
defer cancel()
ticker := time.NewTicker(10 * time.Millisecond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the polling period to be so frequent considering that the delays are in seconds?

@pohly pohly force-pushed the distributed-provisioning branch from 685a7a3 to 28257ef Compare December 15, 2020 20:14
@pohly
Copy link
Contributor Author

pohly commented Dec 15, 2020

@msau42 : please take another look.

It is uncertain whether that option is needed. Removing it simplifies
the code.
@pohly
Copy link
Contributor Author

pohly commented Dec 15, 2020

/retest

@verult
Copy link
Contributor

verult commented Dec 15, 2020

/lgtm

LGTM on my end, leaving final approval to @msau42 / @jsafrane

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2020
@msau42
Copy link
Collaborator

msau42 commented Dec 16, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7ad1cd2 into kubernetes-csi:master Dec 16, 2020
kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this pull request Dec 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deployment without central controller
8 participants