-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed provisioning #524
Changes from all commits
80a167a
cacdc0b
53c168e
2b3ac29
b9301ee
28257ef
d71138a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,7 +74,7 @@ Note that the external-provisioner does not scale with more replicas. Only one e | |
|
||
See the [storage capacity section](#capacity-support) below for details. | ||
|
||
* `--capacity-controller-deployment-mode=central`: Setting this enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. 'central' is currently the only supported mode. Use it when there is just one active provisioner in the cluster. The default is to not produce CSIStorageCapacity objects. | ||
* `--capacity-controller-deployment-mode=central|local`: Setting this enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. Use `central` when there is just one active external-provisioner in the cluster. Use `local` when deploying external-provisioner on each node with distributed provisioning. The default is to not produce CSIStorageCapacity objects. | ||
|
||
* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet). | ||
|
||
|
@@ -84,6 +84,16 @@ See the [storage capacity section](#capacity-support) below for details. | |
|
||
* `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`. | ||
|
||
##### Distributed provisioning | ||
|
||
* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default. | ||
|
||
* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired. Disabling it may be useful for example when a custom controller will select nodes for PVCs. | ||
|
||
* `--node-deployment-base-delay`: Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding. Defaults to 20 seconds. | ||
|
||
* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds. | ||
|
||
#### Other recognized arguments | ||
* `--feature-gates <gates>`: A set of comma separated `<feature-name>=<true|false>` pairs that describe feature gates for alpha/experimental features. See [list of features](#feature-status) or `--help` output for list of recognized features. Example: `--feature-gates Topology=true` to enable Topology feature that's disabled by default. | ||
|
||
|
@@ -139,7 +149,7 @@ determine with the `POD_NAME/POD_NAMESPACE` environment variables and | |
the `--capacity-ownerref-level` parameter. Other solutions will be | ||
added in the future. | ||
|
||
To enable this feature in a driver deployment (see also the | ||
To enable this feature in a driver deployment with a central controller (see also the | ||
[`deploy/kubernetes/storage-capacity.yaml`](deploy/kubernetes/storage-capacity.yaml) | ||
example): | ||
|
||
|
@@ -155,7 +165,7 @@ example): | |
fieldRef: | ||
fieldPath: metadata.name | ||
``` | ||
- Add `--enable-capacity=central` to the command line flags. | ||
- Add `--capacity-controller-deployment-mode=central` to the command line flags. | ||
- Add `StorageCapacity: true` to the CSIDriver information object. | ||
Without it, external-provisioner will publish information, but the | ||
Kubernetes scheduler will ignore it. This can be used to first | ||
|
@@ -170,7 +180,7 @@ example): | |
with `--capacity-threads`. | ||
- Optional: enable producing information also for storage classes that | ||
use immediate volume binding with | ||
`--enable-capacity=immediate-binding`. This is usually not needed | ||
`--capacity-for-immediate-binding`. This is usually not needed | ||
because such volumes are created by the driver without involving the | ||
Kubernetes scheduler and thus the published information would just | ||
be ignored. | ||
|
@@ -220,6 +230,14 @@ CSIStorageCapacity objects, so in theory a malfunctioning or malicious | |
driver deployment could also publish incorrect information about some | ||
other driver. | ||
|
||
The deployment with [distributed | ||
provisioning](#distributed-provisioning) is almost the same as above, | ||
with some minor changes: | ||
- Add `--capacity-controller-deployment-mode=local` to the command line flags. | ||
- Use `--capacity-ownerref-level=0` and the `POD_NAMESPACE/POD_NAME` | ||
variables to make the pod that contains the external-provisioner | ||
the owner of CSIStorageCapacity objects for the node. | ||
|
||
### CSI error and timeout handling | ||
The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by `--timeout` command line argument (15 seconds by default). | ||
|
||
|
@@ -242,6 +260,125 @@ The external-provisioner optionally exposes an HTTP endpoint at address:port spe | |
* Metrics path, as set by `--metrics-path` argument (default is `/metrics`). | ||
* Leader election health check at `/healthz/leader-election`. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-provisioner leader that fails to connect to the API server to renew its leadership. See https://github.com/kubernetes-csi/csi-lib-utils/issues/66 for details. | ||
|
||
### Deployment on each node | ||
|
||
Normally, external-provisioner is deployed once in a cluster and | ||
communicates with a control instance of the CSI driver which then | ||
provisions volumes via some kind of storage backend API. CSI drivers | ||
which manage local storage on a node don't have such an API that a | ||
central controller could use. | ||
|
||
To support this case, external-provisioner can be deployed alongside | ||
each CSI driver on different nodes. The CSI driver deployment must: | ||
- support topology, usually with one topology key | ||
("csi.example.org/node") and the Kubernetes node name as value | ||
- use a service account that has the same RBAC rules as for a normal | ||
deployment | ||
- invoke external-provisioner with `--node-deployment` | ||
- tweak `--node-deployment-base-delay` and `--node-deployment-max-delay` | ||
to match the expected cluster size and desired response times | ||
(only relevant when there are storage classes with immediate binding, | ||
see below for details) | ||
- set the `NODE_NAME` environment variable to the name of the Kubernetes node | ||
- implement `GetCapacity` | ||
|
||
Usage of `--strict-topology` and `--immediate-topology=false` is | ||
recommended because it makes the `CreateVolume` invocations simpler. | ||
Topology information is always derived exclusively from the | ||
information returned by the CSI driver that runs on the same node, | ||
without combining that with information stored for other nodes. This | ||
works as long as each node is in its own topology segment, | ||
i.e. usually with a single topology key and one unique value for each | ||
node. | ||
|
||
Volume provisioning with late binding works as before, except that | ||
each external-provisioner instance checks the "selected node" | ||
annotation and only creates volumes if that node is the one it runs | ||
on. It also only deletes volumes on its own node. | ||
|
||
Immediate binding is also supported, but not recommended. It is | ||
implemented by letting the external-provisioner instances assign a PVC | ||
to one of them: when they see a new PVC with immediate binding, they | ||
all attempt to set the "selected node" annotation with their own node | ||
name as value. Only one update request can succeed, all others get a | ||
"conflict" error and then know that some other instance was faster. To | ||
avoid the thundering herd problem, each instance waits for a random | ||
period before issuing an update request. | ||
|
||
When `CreateVolume` call fails with `ResourcesExhausted`, the normal | ||
recovery mechanism is used, i.e. the external-provisioner instance | ||
removes the "selected node" annotation and the process repeats. But | ||
this triggers events for the PVC and delays volume creation, in | ||
particular when storage is exhausted on most nodes. Therefore | ||
external-provisioner checks with `GetCapacity` *before* attempting to | ||
own a PVC whether the currently available capacity is sufficient for | ||
the volume. When it isn't, the PVC is ignored and some other instance | ||
can own it. | ||
|
||
The `--node-deployment-base-delay` parameter determines the initial | ||
wait period. It also sets the jitter, so in practice the initial wait period will be | ||
in the range from zero to the base delay. If the value is high, | ||
volumes with immediate binding get created more slowly. If it is low, | ||
then the risk of conflicts while setting the "selected node" | ||
annotation increases and the apiserver load will be higher. | ||
|
||
There is an exponential backoff per PVC which is used for unexpected | ||
problems. Normally, an owner for a PVC is chosen during the first | ||
attempt, so most PVCs will use the base delays. A maximum can be set | ||
with `--node-deployment-max-delay` anyway, to avoid very long delays | ||
when something went wrong repeatedly. | ||
|
||
During scale testing with 100 external-provisioner instances, a base | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we point to links to results anywhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not at the moment. I intend to write up my experience with scale testing and will publish it as a .md file in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got a chance to run on 1000 nodes. Distributed provisioning was almost as fast as central provisioning in terms of volumes/second for immediate binding with no overload of the apiserver - see report in kubernetes/perf-tests#1676 |
||
delay of 20 seconds worked well. When provisioning 3000 volumes, there | ||
were only 500 conflicts which the apiserver handled without getting | ||
overwhelmed. The average provisioning rate of around 40 volumes/second | ||
was the same as with a delay of 10 seconds. The worst-case latency per | ||
volume was probably higher, but that wasn't measured. | ||
|
||
Note that the QPS settings of kube-controller-manager and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Point to bug tracking api fairness effort? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added " Those |
||
external-provisioner have to be increased at the moment (Kubernetes | ||
1.19) to provision volumes faster than around 4 volumes/second. Those | ||
settings will eventually get replaced with [flow control in the API | ||
server | ||
itself](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/). | ||
|
||
Beware that if *no* node has sufficient storage available, then also | ||
no `CreateVolume` call is attempted and thus no events are generated | ||
for the PVC, i.e. some other means of tracking remaining storage | ||
capacity must be used to detect when the cluster runs out of storage. | ||
|
||
Because PVCs with immediate binding get distributed randomly among | ||
nodes, they get spread evenly. If that is not desirable, then it is | ||
possible to disable support for immediate binding in distributed | ||
provisioning with `--node-deployment-immediate-binding=false` and | ||
instead implement a custom policy in a separate controller which sets | ||
the "selected node" annotation to trigger local provisioning on the | ||
desired node. | ||
|
||
### Deleting local volumes after a node failure or removal | ||
|
||
When a node with local volumes gets removed from a cluster before | ||
deleting those volumes, the PV and PVC objects may still exist. It may | ||
be possible to remove the PVC normally if the volume was not in use by | ||
any pod on the node, but normal deletion of the volume and thus | ||
deletion of the PV is not possible anymore because the CSI driver | ||
instance on the node is not available or reachable anymore and therefore | ||
Kubernetes cannot be sure that it is okay to remove the PV. | ||
|
||
When an administrator is sure that the node is never going to come | ||
back, then the local volumes can be removed manually: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that they also should make sure the data on the disks are deleted before bringing the disk back into service. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added "It may also be necessary to scrub disks before reusing them because |
||
- force-delete objects: `kubectl delete pv <pv> --wait=false --grace-period=0 --force` | ||
- remove all finalizers: `kubectl patch pv <pv> -p '{"metadata":{"finalizers":null}}'` | ||
|
||
It may also be necessary to scrub disks before reusing them because | ||
the CSI driver had no chance to do that. | ||
|
||
If there still was a PVC which was bound to that PV, it then will be | ||
moved to phase "Lost". It has to be deleted and re-created if still | ||
needed because no new volume will be created for it. Editing the PVC | ||
to revert it to phase "Unbound" is not allowed by the Kubernetes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need to mention the last sentence. There's a lot of things we can't do :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but in this case it is something that might occur to a reader. It certainly occurred to me, so I tried it - unsuccessfully 😅 By saying upfront that it won't work we save someone who is in that situation the time to try that out. |
||
API server. | ||
|
||
## Community, discussion, contribution, and support | ||
|
||
Learn how to engage with the Kubernetes community on the [community page](http://kubernetes.io/community/). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be consolidated with
--capacity-controller-deployment-mode=local
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some way to to do local deployment without enabling storage capacity because that might not be supported by the cluster or not desired, so we cannot have just one flag that enables both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can avoid flag skew issues like
--node-deployment=true
and--capacity-controller-deployment-mode=central
.Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospect that would be nicer, but it is a bit late to change the semantic of the
--capacity-controller-deployment-mode
that way - we would have to do a major update of external-provisioner for that because it is a breaking change.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an alpha feature, I think we can make breaking changes to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
--enable-capacity=true/false
instead of--capacity-controller-deployment-mode=central
and the actual mode of operation (central vs. local) determined by--node-deployment
?Works for me, I just won't get to it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me! Do you think enable-capacity will always be a permanent flag, or would it fit a feature gate better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a permanent flag is better. The feature causes additional work and might not be that relevant for some drivers (like those where all volumes are available everywhere), so being able to keep it turned off will remain useful.