-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Improve Local Storage Management #306
Conversation
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
2728af6
to
4899d05
Compare
4899d05
to
c034e02
Compare
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. | ||
7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods. | ||
8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy. | ||
9. Once Alice decides to delete the database, the PVCs are expected to get deleted by the StatefulSet. PVs will then get recycled and deleted, and the addon adds it back to the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be global for all PVCs for StatefulSet going forward? Also, we will be depending on reasonable collection timeouts to ensure that users have time to collect data from Volumes after deletion (assuming they have a need to do so)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, the PVC will need to be deleted by the user to retain similar behavior as today. We are looking into an "inline" PVC feature that can automatically delete the PVCs when the StatefulSet gets destroyed. I'll update this to clarify that.
Regarding the retention policy, the PV can be changed to use the "Retain" policy if users need to collect data after deletion.
local-pv-2 10Gi Bound log-local-pvc-3 node-3 | ||
``` | ||
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are depending on priority and preemption to be implemented prior to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like what's described in this section also relies on some variant of #7562 / #30044 being implemented, as today there is no notion of a local PV (beyond the experimental HostPath volume type which doesn't do what's needed here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this proposal is also covering this new local PV type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kow3ns regarding priority and preemption, it is not a strict requirement for this feature, but will make the workflow smoother. There are also plans to implement this soon.
* Provide flexibility for users/vendors to utilize various types of storage devices | ||
* Define a standard partitioning scheme for storage drives for all Kubernetes nodes | ||
* Provide storage usage isolation for shared partitions | ||
* Support random access storage devices only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what support for "random access storage devices only" means? Does this mean using RAM as storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I took this to mean DASD (i.e. this will not work for Tape drives or other sequential access storage media). Is this not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it means not supporting tape.
* Support random access storage devices only | ||
|
||
# Non Goals | ||
* Provide isolation for all partitions. Isolation will not be of concern for most partitions since they are not expected to be shared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be written more concisely as "Provide usage isolation for non-shared partitions" which would also make it more parallel with the Goal "Provide storage usage isolation for shared partitions"
* Pods do not know how much local storage is available to them. | ||
* Pods cannot request “guaranteed” local storage. | ||
* Local storage is a “best-effort” resource | ||
* Pods can get evicted due to other pods filling up the local storage during which time no new pods will be admitted, until sufficient storage has been reclaimed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/during which/after which/
Primary partitions are shared partitions that can provide ephemeral local storage. The two supported primary partitions are: | ||
|
||
### Root | ||
This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/IO/IOPs/
3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. | ||
4. Alice’s pod is not provided any IO guarantees | ||
5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi | ||
6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/foo/fooc/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foo is correct because it's referring to the pod, and not the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quota feature assumes an appropriate supporting file system is being used. A large part of the distributed storage systems require raw (no file system) storage. How would that be managed? Would a raw partition be crated by a logical manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't plan to support raw partitions as a primary partition. Secondary partitions can have block level support though.
emptyDir: | ||
``` | ||
|
||
2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to the LimitRange user guide that explains this.
(BTW did you mean "burst" rather than "bust"?)
capacity: 2Gi | ||
``` | ||
|
||
6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/intent/intend/
BTW I'm not clear about the connection between prohibiting a minimum storage requirement in LimitRange and overcommit. Won't the scheduler prohibit overcommit regardless of what storage requirement you give for the EmptyDir (regardless of whether it's set manually or via LimitRange)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this means we will just not allow a request a limit for volumes.
local-pv-1 100Gi RWO Delete Available node-3 | ||
local-pv-2 10Gi RWO Delete Available node-3 | ||
``` | ||
3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no notion of tainting PVs, only nodes. Can you say more about what semantics you are expecting for tainting a PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to evict pods that are using tainted PVs, unbind the PVC, and reschedule the pod so that it can bind to a different PV. I think everything after the eviction could be handled by a separate controller.
local-pv-2 10Gi Bound log-local-pvc-3 node-3 | ||
``` | ||
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like what's described in this section also relies on some variant of #7562 / #30044 being implemented, as today there is no notion of a local PV (beyond the experimental HostPath volume type which doesn't do what's needed here).
### Alice manages a Database which needs access to “durable” and fast scratch space | ||
|
||
1. Cluster administrator provisions machines with local SSDs and brings up the cluster | ||
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is always a 1:1 correspondence between PV and partition on a secondary device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will let the cluster administrator decide how they want to provision local storage. They can have one partition per disk for IOPS isolation, or if sharing is ok, then create multiple partitions on a device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is based on the technology of the underlying filesystem. If not, then I think this depends a lot on some type of logical volume manager. If not only two things can happen: 1. a secondary partition is the entire disk, 2: A lot of disk fragmentation. I think more information on how number 1 is done may shed more light on this model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is up to the administrator to partition and create the filesystem first. And how that is done will depend on the partitioning tools (parted, mdadm, lvm, etc) available and which filesystems the administrator decides to use. From Kubernetes point of view, we will not handle creating partitions or filesystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is up to the administrator to partition and create the filesystem first
That's very inconvenient for an admin. Also, when such PV gets Released, who/how removes the data there and puts it back to Available? We'd like to deprecate recycler as soon as possible.
IMO, some sort of simple dynamic provisioning would be very helpful and it's super simple with LVM. It should be pluggable though to work on all other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current thought for the PV deletion workflow is to set the PV to Released phase, delete all the files (similar to how EmptyDir cleans up), delete the PV object, and then the addon daemonset will detect that the partition and then create the PV for it again.
So from an admin's point of view, the partitioning and fs setup is just a one time step whenever new disks are added. And for the use case that we are targeting, which requires whole disks for IOPs guarantees, the setup is simple: one partition across the whole disk, and create the filesystem on that partition.
As for LVM, I agree it is a simpler user model, but we cannot get IOPs guarantees from it, which is what customers we've talked to want the most. I don't think this design will prevent supporting an LVM-based model in the future though. I can imagine there can be a "storageType = lvm" option as part of the PV spec, and a dynamic provisioner can be written to use that to carve out LVs from a single VG. The scheduling changes that we have to make to support local PVs can still apply to a lvm-based volume. We're just not prioritizing it right now based on user requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with @jsafrane that we should have some default out of the box local disk PV provisioner and for default cases, we dont have to do a addon or some such thing. 90% use cases might be just simple use of local disks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on feedback we have gotten from customers and workloads team, it's the opposite. Most of the use cases require dedicated disks. We have not seen many requests for dynamic provisioning of shared disks. If you see some real use cases where an app wants to use persistent local storage (and all its semantics), but doesn't need performance guarantees, then I would be interested in hearing about them as well.
I do want to make sure that nothing in this proposal would prevent LVM and dynamic provisioning from being supported in the future. And that it will be able to take advantage of the scheduling and failure handling features we will be adding.
In terms of admin work, my hope is that the default addon will require a similar amount of admin intervention as the LVM model (configure the disk once in the beginning, the system takes care of the rest).
``` | ||
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. | ||
7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should occur only when the PV backing the PVC is permanently unavailable. If a controller creates a new PVC and relaunches the Pod with that PVC, it will never be able to reuse the data on the old PV anyway. To simplify this for controller developers, when some policy is applied to indicate that K8s should "give up" on recovering a PV, can we just delete the PV, and set the status of the PVC to pending? This would reduce the complexity of the interaction with DaemonSet, StatefulSet, and any other controllers and local persistent storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This situation could also occur if the node has failed or can no longer fulfill other requested resources, for example, if other pods got scheduled and took up the cpu or memory needed.
The main concern with deleting the PV and keeping the PVC, is that it may not follow the retention policy. The user may want to recover data from the PV, but won't have the pod->PVC->PV binding anymore. As another alternative, we could remove the PVC->PV binding, and if the PV policy is retain, also add an annotation with the old pod, PVC information so the user can figure out which PV had their data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of keeping the PVC and just removing the PVC->PV binding. If we expect the StatefulSet controller to modify the Pod to use a new PVC, that essentially means only the StatefulSet controller can perform the task of unblocking its unschedulable Pods. That in turn means that every controller needs to separately implement this behavior. For example, what if I have "stateless" Deployment Pods that want this behavior for their large caches on local PV?
If unblocking can be done without modifying the Pods to use a different PVC, then it leaves the door open to write a generic "local PV unbinding" controller that implements this behavior once for everyone who requests it via some annotation or field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generic PVC unbinding controller can monitor for this error condition, unbind the PVC, clean up the PV according to the reclaim policy, and then evict and reschedule the pods to force them to obtain a new PV.
|
||
3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. | ||
4. Alice’s pod is not provided any IO guarantees | ||
5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this interact with kubectl logs? Right now we are aggregating and rolling stdout and stderr? Are you proposing that we use local storage instead of, or in addition to, the current K8s logging infra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have no impact to kubectl logs. It's only changing the log rotation mechanism to be on a per container basis instead of on a node basis.
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. | ||
7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods. | ||
8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are targeting DBs and DFSs, and if a "taint" is really pertaining to a problem with the underlying storage media, I don't think anything in your target set will tolerate a taint. @davidopp shouldn't this be expressed by the controller in terms of declarative tolerations against node taints in any case. That is, don't I have to explicitly declare the taint as tolerated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having every controller/operator watch for the appearance of taints on a node and delete Pods should we consider the following approach?
- DBs and DFS should include a health check that causes the Pod to fail when the contained application can't write to storage media (Most, if not all, storage applications will fail on errors returned from fysnc/sync).
- When the application monitoring the storage device decides that the mounted PVs are unrecoverable, it should delete the PVs and mark the Bound PVCs as pending. The policy deciding when to do this can be applied here. Note that, this is no scarier than having the controller make the decision to delete the PVC. In either case, once the Pod is disassociated from its local volume and launched with another, it can never be safely re-associated with the prior volume. Both cases also need a good story around snapshops and backup. I think that, as the device monitoring application is a node local agent, it can make a better decision about when to "give up" trying bind a Pod to a local mount.
- As the volumes are deleted, we need not be concerned with the PVCs being fulfilled by this node unless it has volumes mounted on another, functional device.
- When controllers/operators recreate the Pods, their existing PVCs must be Bound to volumes provided by another node.
If we take an approach that is closer to this we don't have to duplicate the watch logic in every controller/operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, are you suggesting to leave it up to the application to handle local storage failures, since each application may have its own unique requirements and policies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I was not clear. I am saying the opposite. The "application monitoring the storage device" referred to above is based on the design statement that " Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints."
Rather than have every controller/operator attempt to heuristically guess when it should delete a PVC, might it not be better to have the "Node Problem Detector", kublet, or another local agent make a decision that the volume is no longer usable due to device failure, and to set the associated PVC back to pending. Perhaps using your suggestion above to retrain the volume for data recovery purposes. I can't think of a distributed storage application that will want to re-balance or re-replicate its data due to a temporary network partition or intermittent node failure. The only time, IMO, that we'd want a controller/operator to move a Pod with local PVs to a new node is if the storage device failed, or if the MTTR is so high that it might as well have. In the former case it might be best if a node local agent made the decision that the storage device is failed. In the latter case, we should at least consider having a global controller with a policy that can unbind local PVs from PVCs, rather than having every controller/operator have to implement its own policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the controller (e.g. StatefulSet) should be responsible for deleting Pods. I think @kow3ns is also saying that if I'm reading him correctly.
My understanding is that regular Node taints are noticed and enforced by kubelet, which may evict the Pod if it doesn't tolerate the taint. Wouldn't it make sense for kubelet to also evict the Pod if it does not tolerate a taint on one of its local PVs?
If recreated with the same PVC, the Pod would remain unschedulable due to the taint on the PV. At this point, the problem is reduced to being the same as (7) above. In this way, both (7) and (8) can be handled without necessarily requiring any changes to StatefulSet or other controllers (if a generic controller can be implemented as suggested above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently taints are only at the node level, but I think it could worth looking into expanding, as it already has a flexible interface for specifying per pod the tolerations and forgiveness for each taint. This workflow could also work for the case when the node fails or becomes unavailable. @davidopp
Then, when the pod gets evicted due to the taint, it reduces the problem to (7), as mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also possible to implement this without taints, and instead add an error state to the PV, and have a controller monitor for the error state and evict pods that way. But using taints may be nice as a future enhancement to unify the API.
capacity: 2Gi | ||
``` | ||
|
||
6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intent=intend
Storage-overlay: 200Mi | ||
type: Container | ||
- default: | ||
storage: 1Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, each empty dir backed volume will pick up this default capacity? so if a user had multiple empty dirs for some reason, each would get 1Gi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the limit per emptydir. You bring up a good point though, that the "type: Pod" implies that it's for the whole pod. We can change it to "type: EmptyDir"
volumes: | ||
name: myEmptyDir | ||
emptyDir: | ||
capacity: 1Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you worry users will get confused with this field as only being meaningful when the medium is disk and not memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be used for memory-backed emptydir too.
name: foo | ||
spec: | ||
containers: | ||
name: fooc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a missing -
here.
storage-logs: 500Mi | ||
storage-overlay: 1Gi | ||
volumes: | ||
name: myEmptyDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing -
.
name: fooc | ||
resources: | ||
limits: | ||
storage-logs: 500Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for doing it this way rather than:
limits:
storage:
logs: 500Mi
overlay: 1Gi
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a limitation in the LimitRange design. It doesn't support nesting of limits.
labels: | ||
storage.kubernetes.io/medium: ssd | ||
spec: | ||
volume-type: local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the convention be volumeType
?
``` | ||
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. | ||
7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of keeping the PVC and just removing the PVC->PV binding. If we expect the StatefulSet controller to modify the Pod to use a new PVC, that essentially means only the StatefulSet controller can perform the task of unblocking its unschedulable Pods. That in turn means that every controller needs to separately implement this behavior. For example, what if I have "stateless" Deployment Pods that want this behavior for their large caches on local PV?
If unblocking can be done without modifying the Pods to use a different PVC, then it leaves the door open to write a generic "local PV unbinding" controller that implements this behavior once for everyone who requests it via some annotation or field.
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. | ||
7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods. | ||
8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the controller (e.g. StatefulSet) should be responsible for deleting Pods. I think @kow3ns is also saying that if I'm reading him correctly.
My understanding is that regular Node taints are noticed and enforced by kubelet, which may evict the Pod if it doesn't tolerate the taint. Wouldn't it make sense for kubelet to also evict the Pod if it does not tolerate a taint on one of its local PVs?
If recreated with the same PVC, the Pod would remain unschedulable due to the taint on the PV. At this point, the problem is reduced to being the same as (7) above. In this way, both (7) and (8) can be handled without necessarily requiring any changes to StatefulSet or other controllers (if a generic controller can be implemented as suggested above).
storage.kubernetes.io/medium: ssd | ||
spec: | ||
volume-type: local | ||
storage-type: block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be storageType
? Also, having both volumeType
and storageType
seems confusing. Not sure what else these could be called though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would storageLevel be better?
|
||
# Design Overview | ||
|
||
A node’s local storage can be broken into primary and secondary partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were other options considered? Especially LVM would allow us to add / remove devices on hosts and RAID 0/1/5 per volume with very little overhead. With partitions, you must do all this manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of persistent local storage, most of the use cases we have heard about prioritize performance and being able to use dedicated disks.
In addition, LVM is only available on Linux, so it could be difficult to use as a generic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming a Primary and Secondary partitions are logical objects which can be implemented a multiple of ways. Do you mind elaborating on possible implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the kind of information you were looking for?
- Using an entire disk (this is the primary use case for persistent local storage)
- Adding multiple disks into a RAID volume
- Using LVM to carve out multiple logical partitions (if you don't need IOPs guarantees)
## Persistent Local Storage | ||
Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors: | ||
|
||
* Performance: On cloud providers, local SSDs give better performance than remote disks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will performance include a QoS IOPS requirement for distributed storage systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PVs will have to be created by the admin/addon that utilizes the entire disk to guarantee IOPs for performance use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand correctly, but why do PVs have to be full-disk? Why not a properly aligned partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not requiring that the PV has to use the whole disk (the volume is created on a partition), but if you need IOPS guarantees, then it should be a dedicated disk. Especially for rotational disks, the IO will still end up being on a shared path at the device layer.
SSDs may offer high enough IOPS that you can share them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @msau42 mentioned, the API that kubernetes would consume is a logical partition. It can map to any storage duration (RAID, JBOD, etc.). We recommend not sharing spinning disks unless either the storage configuration or IOPS requirements permits sharing them.
This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition. | ||
|
||
## Secondary Partitions | ||
All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod. Applications can continue to use their existing PVC specifications with minimal changes to request local storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When they are exposed as PVs, are they created and available as a pool? Do you mind elaborating a little more on Secondary Partitions? Who creates them, how are they managed, what sizes, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the workflow mentions this a little bit. I will also add it here.
There will be an addon daemonset that can discover all the secondary partitions on the local node and create PV objects for them. The capacity of the PV will be the size of the entire partition. We can provide a default daemonset that will look for the partitions under a known directory, and it is also possible to write your own addons for your own environment.
So the PVs will be statically created and not dynamically provisioned, but the addon can be long running and create new PVs as disks are added to the node.
|
||
# Design Overview | ||
|
||
A node’s local storage can be broken into primary and secondary partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming a Primary and Secondary partitions are logical objects which can be implemented a multiple of ways. Do you mind elaborating on possible implementations?
capacity: 20Gi | ||
``` | ||
|
||
3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the guarantees provided by the system? The FS?, logical volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For primary partitions, the node's local storage capacity will be exposed so that the scheduler can take into account a pod's storage limits and what nodes can satisfy that limit.
Then kubelet will monitor the storage usage of the emptydir volumes and containers so that they stay within their limits. If quota is supported, then it will use that.
3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. | ||
4. Alice’s pod is not provided any IO guarantees | ||
5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi | ||
6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quota feature assumes an appropriate supporting file system is being used. A large part of the distributed storage systems require raw (no file system) storage. How would that be managed? Would a raw partition be crated by a logical manager?
4. Alice’s pod is not provided any IO guarantees | ||
5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi | ||
6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation. | ||
7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the FS supports such a feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will mention otherwise kubelet can only enforce soft limits.
6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation. | ||
7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet. | ||
8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints. | ||
9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know number 8 showed there will be a Health monitor, but how would it detect that the primary partition is unhealthy on number 9? What does it mean to be unhealthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it more, health monitoring is dependent on the environment and configuration, so an external monitor may be needed for both primary and secondary.
It can monitor at various layers depending on how the partitions are configured:
disk layer: look at SMART data
raid layer: look for complete raid failure (non-recoverable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to do disk health monitoring if the Node is a VM and disk is a virtual disk? The smartctl or raid tools may not return correct data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Because the partition configuration is very dependent on the environment, I think we cannot do any monitoring ourselves. Instead, we can define a method for external monitors to report errors, and also define how kubernetes will react to those errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does our proposal/design require this health monitor. lets say in the default configuration, when there is no external health monitor, what is the behavior ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The health monitor is not required. In that case, it will behave the same way that it does today, which is undefined.
emptyDir: | ||
``` | ||
|
||
2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is a little bit confusing "His cluster administrator being aware....". Does that mean that this solution would require the administrator to take action or things may be incorrectly allocated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to solve today's local storage isolation problem, pods should specify limits for their local storage usage. In the absence of that, the administrator has the option to specify defaults for the namespace. If neither of those two occur, then you just have the same issue today.
``` | ||
|
||
4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. | ||
5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern I have here is that it requires a lot of interaction with an administrator and the user. If I am "Bob", I'm just going to keep asking for more storage (1, then 2, then .. ). That would move the Pod from node to node satisfying the storage size request. I'm guessing... How different is this from the current model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is a little bit of a trial and error going on here for Bob. But as an application developer, you will have to do this in order to size your apps appropriately. One goal that we're trying to achieve here is provide pods better isolation from other pods running on that node through storage isolation.
### Alice manages a Database which needs access to “durable” and fast scratch space | ||
|
||
1. Cluster administrator provisions machines with local SSDs and brings up the cluster | ||
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is based on the technology of the underlying filesystem. If not, then I think this depends a lot on some type of logical volume manager. If not only two things can happen: 1. a secondary partition is the entire disk, 2: A lot of disk fragmentation. I think more information on how number 1 is done may shed more light on this model
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium. | ||
|
||
```yaml | ||
kind: PersistentVolume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clear my confusion, these all are created by hand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could be created by hand, or if you put the partitions in a known directory, then the addon daemonset can discover the partitions and automatically create the PVs.
@msau42 FYI, I appreciate the quick turnaround! 👏 |
@msau42 here is a related use case for local storage handling from @fabiand |
4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it. | ||
5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example). | ||
6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes. | ||
7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs create by the Operator on those nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*created
* Local Persistent Volume bindings happening in the scheduler vs in PV controller | ||
* Should the PV controller fold into the scheduler | ||
* Supporting dedicated partitions for logs and volumes in Kubelet in addition to runtime overlay filesystem | ||
* This complicates kubelet.Not sure what value it adds to end users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs should usually not accumulate as they should be collected to a central location.
--> no need for separation
Overlay FS data can be used, but for heavy use or increased storage needs we do recommend and provide emptyDirs and the new local PVs.
--> no need for separation
As emptyDirs might be used for caches and heavy IO it might makes sense to let this be separated from the planned root PV.
Complicating the Kubelet for logs and overlay doesn't seem to make sense. We should definitely think about the usage pattern of emptyDirs after local PVs are available.
Would we recommend local PV usage for heavy IO caches instead of emptyDirs? If yes, then we might leave emptyDirs inside the root PV and let the user know that for anything serious he might need to migrate away from emptyDirs.
Definitely needs to be clearly documented what use-cases each one solves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since we don't plan to provide IOPS isolation for emptydir, then local PV should be used instead for those use cases. One question we have is are there use cases that need ephemeral IOPS guarantees that cannot be adapted to use local PVs? Do we need to look into an "inline" local PV feature where the PV gets created and destroyed with the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the automation of creating and using local PVs.
EmptyDirs work great without having to involve the cluster admin.
Local PVs most likely need cluster admin intervention. Maybe not always, but it's not 100% automated.
The path I see as reasonable would be:
- Leave emptyDir as "best-effort" scratch pad.
- Recommend local PVs for guaranteed IOPS.
- First iteration having to use manual cluster admin action
- Iterate on automating local PVs to bring them closer to emptyDir and PDs aka provide local PVs via dynamic provisioning
This would lead to no huge complexity additions in the kubelet as root, emptyDir, log and overlay FS are kept on the primary partition in the first iteration.
As additional note:
Persistent Volume as name seems confusing especially, when we recommend it as IOPS guaranteed scratch pad. (Maybe: Local Disk?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good plan! LocalDisk as the actual volume plugin name sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification. We would have:
- PersistentDisk (networked "unfailable" disk)
- emptyDir (shared temporary volume without guarantees)
- LocalDisk (local volume with guarantees, which might have some persistence)
- hostPath (local volume for testing)
- all the provider specific stuff, flexVolume, gitRepo and k8s API backed volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are planning to recommend using LocalDisk only through the PV/PVC interfaces for the following reasons:
- In failure scenarios, like the node failing, you may want to give up on the local disk and find a new one to use. You can do that by unbinding the PVC from the PV, instead of having to change the volume in the pod spec
- If you use local disk directly, it would be very similar to HostPath volumes, and have all its problems, where you have to specify the path, and understand the storage layout of the node, and understand that that particular volume can satisfy the pod's capacity needs. The PV interface hide those details.
- The PV interface gives a way to pool all the local volumes across the entire cluster and easily query for them, and find ones that will fit a pod's requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Always great to have that documented.
So recommendation:
PD + LD using PVC
emptyDir + hostPath used directly
Small addition:
The notion it projects using PV/PVCs about LD being persistent could create some confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update this doc to clarify that thanks! I agree the PV name could be misleading since the local disk can only offer semi-persistence, and has different semantics then normal PVs. I can add a section about the different semantics. Also, because of the different behavior, and its very targeted use cases, I want to make sure in the API layer, the user explicitly selects this option, and that they cannot use a local PV "accidentally".
/lgtm |
This proposal has gotten amazing feedback. However it has reached a point where it's too large to continue discussing further. I'm merging this PR based on @thockin's approval. As we implement the features in this proposal, some aspects of this proposal can (and possibly will) change. Expect this proposal to evolve over the next year as local storage features get added to kubernetes. If someone feels strongly about any pending comments, I'm happy to revert the merge and continue discussing if necessary. |
Automatic merge from submit-queue LocalStorage api **What this PR does / why we need it**: API changes to support persistent local volumes, as described [here](kubernetes/community#306) **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # Part of kubernetes#43640 **Special notes for your reviewer**: There were a few items I was concerned about. Will add review comments in those places. **Release note**: NONE Note will be added in subsequent PR with the volume plugin changes
This PR adds the new APIs to support storage capacity isolation as described in the proposal kubernetes/community#306 1. Add SizeLimit for emptyDir volume 2. Add scratch and overlay storage type used by container level or node level
Automatic merge from submit-queue Add Local Storage Capacity Isolation API This PR adds the new APIs to support storage capacity isolation as described in the proposal [https://github.com/kubernetes/community/pull/306](url) 1. Add SizeLimit for emptyDir volume 2. Add scratch and overlay storage type used by container level or node level **Release note**: ```release-note Alpha feature: Local volume Storage Capacity Isolation allows users to set storage limit to isolate EmptyDir volumes, container storage overlay, and also supports allocatable storage for shared root file system. ```
Automatic merge from submit-queue Add local storage (scratch space) allocatable support This PR adds the support for allocatable local storage (scratch space). This feature is only for root file system which is shared by kubernetes componenets, users' containers and/or images. User could use --kube-reserved flag to reserve the storage for kube system components. If the allocatable storage for user's pods is used up, some pods will be evicted to free the storage resource. This feature is part of local storage capacity isolation and described in the proposal kubernetes/community#306 **Release note**: ```release-note This feature exposes local storage capacity for the primary partitions, and supports & enforces storage reservation in Node Allocatable ```
Automatic merge from submit-queue Add EmptyDir volume capacity isolation This PR adds the support for isolating the emptyDir volume use. If user sets a size limit for emptyDir volume, kubelet's eviction manager monitors its usage and evict the pod if the usage exceeds the limit. This feature is part of local storage capacity isolation and described in the proposal kubernetes/community#306 **Release note**: ```release-note Alpha feature: allows users to set storage limit to isolate EmptyDir volumes. It enforces the limit by evicting pods that exceed their storage limits ```
# Open Questions & Discussion points | ||
* Single vs split “limit” for storage across writable layer and logs | ||
* Local Persistent Volume bindings happening in the scheduler vs in PV controller | ||
* Should the PV controller fold into the scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at present, local PVs have above question,PV, PVC bounded before schedulering,when bounded completed,scheduler select the node with PV node affinity,but now the node CPU, Mem not enough and so on,so the pod all the time schedule failed,so above question have plan to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the limitation in the first phase. We hope to solve it in the next release, but no concrete ideas yet, we're just prototyping at this point. At a high level, the PVC binding needs to be delayed until a pod is scheduled, so that it can take into account all the other scheduling requirements of the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cool if solve local volume PV PVC delay bound,now my project team worry about the question,so not dare use local volume plugin because pod schedule fail all the time easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will not work well right now in general purpose situations. But if you use the critical pod feature, or you run your workload that needs local storage first, then it that may work better. Still, the PVs may not get spread very well because the PV controller doesn't know that all these PVCs are replicas in the same workload. You may be able to work around the issue by labeling the PVs per workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,Thanks. I pay close attention to V1.8(Scheduler predicate for prebound local PVCs#43640) .
Automatic merge from submit-queue Add Local Storage Capacity Isolation API This PR adds the new APIs to support storage capacity isolation as described in the proposal [https://github.com/kubernetes/community/pull/306](url) 1. Add SizeLimit for emptyDir volume 2. Add scratch and overlay storage type used by container level or node level **Release note**: ```release-note Alpha feature: Local volume Storage Capacity Isolation allows users to set storage limit to isolate EmptyDir volumes, container storage overlay, and also supports allocatable storage for shared root file system. ```
Initial incomplete release notes draft for 1.7
[Proposal] Improve Local Storage Management
A note to reviewers: A detailed design proposal will be posted once the overall design in this proposal has been accepted by the community. So kindly hold on to specific design questions or suggestions that are not relevant to the overall high level design for local storage.
cc @kubernetes/sig-storage-proposals @kubernetes/sig-node-proposals @kubernetes/sig-apps-proposals @kubernetes/sig-scheduling-proposals
cc @msau42
For kubernetes/enhancements#121
TODO:
runtime
primary partition capacity