CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44

Jgoldick · 2017-06-19T00:36:01Z

Many/most clustered systems need to specify anti-affinity for new storage. Imagine a MySQL cluster where all volumes end up on the same external storage device. May leverage concept in https://github.com/container-storage-interface/spec/issues/7

Alternatively, if a Container needs to access multiple Volumes then they must be created in such a way that they can both be mounted by the same node at the same time. See https://github.com/container-storage-interface/spec/issues/43

jdef · 2017-06-26T13:51:46Z

Affinity (and anti-) for volume placement sound like a problem for a CO and/or plugin backend system (based on some opaque create-vol param), and not something for CSI to tackle. If you disagree, can you provide a compelling argument and use case for why this belongs in CSI?

…

On Sun, Jun 18, 2017 at 8:36 PM, Jgoldick ***@***.***> wrote: Many/most clustered systems need to specify anti-affinity for new storage. Imagine a MySQL cluster where all volumes end up on the same external storage device. May leverage concept in https://github.com/container- storage-interface/spec/issues/7 <http://url> Alternatively, if a Container needs to access multiple Volumes then they must be created in such a way that they can both be mounted by the same node at the same time. See https://github.com/container- storage-interface/spec/issues/43 <http://url> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#44>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLCleN8bg8s4_emGAtgseXIB9MdUPks5sFcJxgaJpZM4N9pus> .

Jgoldick · 2017-06-26T22:00:44Z

The anti-affinity use case is pretty common, not wanting replicas to be on the same storage system or failure domain. This includes pretty much all databases.

At present I see no way the CO can the desired behavior via CreateVolume since it has no failure domain knowledge available to it. The plugins cannot do it either because there's no way to supply them with a list of domains or volumes to avoid.

Jgoldick · 2017-06-26T22:18:38Z

A quick link to how we dealt with this in OpenStack may be useful.
https://wiki.openstack.org/wiki/VolumeAffinity as a qualifier on the general topic of VolumeTypeSchedulers https://wiki.openstack.org/wiki/VolumeTypeScheduler

To be clear, I'm not pushing this as a solution, but the problem is real. I'd prefer not to have to pre-create all fault tolerant replica sets outside of CSI and then publish them.

jdef · 2017-06-27T17:58:30Z

If we had support for `Domain` such that a CO could invoke CreateVolume for N volumes across X domains (and `Domain` was granular enough, e.g. down to the "node" level), wouldn't that let the CO tackle placement issues and avoid baking affinity/anti-affinity APIs into the CSI spec?

…

On Mon, Jun 26, 2017 at 6:18 PM, Jgoldick ***@***.***> wrote: A quick link to how we dealt with this in OpenStack may be useful. https://wiki.openstack.org/wiki/VolumeAffinity as a qualifier on the general topic of VolumeTypeSchedulers https://wiki.openstack.org/ wiki/VolumeTypeScheduler To be clear, I'm not pushing this as a solution, but the problem is real. I'd prefer not to have to pre-create all fault tolerant replica sets outside of CSI and then publish them. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLM_S6yCeTgKp33-d-YOv2qgyoyvJks5sIC4_gaJpZM4N9pus> .

Jgoldick · 2017-06-27T22:36:21Z

Not unless we assume that each storage failure domain matches that of the nodes that can access it. This assumption would hold if the volumes were actually inside nodes, like internal drives. The assumption breaks down if the volumes are external, think NetApp/Dell/etc or there are storage-centric nodes in the cluster and compute-centric ones. This is the reason I separated this issue out from https://github.com/container-storage-interface/spec/issues/43

If we want to model V1 on a node internal-only storage model and leave external storage systems support for a future version then node-level Domains would likely be sufficient.

jdef · 2017-06-28T15:00:54Z

My assumption was that the failure domain of the node was related to the that of the volume, but not necessarily equivalent. Supporting only local/internal storage is probably not sufficient for v1 - we need to consider external storage systems. This still feels very, very strongly related to the Domain conversation. There's been a bit of back and forth at the WG level about what constitutes a sufficient failure domain model for v1; it's very easy to start sliding down the slope of complex topologies .. and not all CO's should necessarily have to understand the complexity of every storage topology. I think what would help crystallize things here for me are a set of concrete use cases, perhaps starting with the MySQL cluster you mentioned in conjunction with various storage backends and the fault-domain support implicit within those backends (should probably include local + external backends). Because it's not clear to me that affinity is a CSI problem at this point, but support for Domain almost surely is -- and that could go a long way toward supporting strategic CO-driven volume/workload placement methods.

…

On Tue, Jun 27, 2017 at 6:36 PM, Jgoldick ***@***.***> wrote: Not unless we assume that each storage failure domain matches that of the nodes that can access it. This assumption would hold if the volumes were actually inside nodes, like internal drives. The assumption breaks down if the volumes are external, think NetApp/Dell/etc or there are storage-centric nodes in the cluster and compute-centric ones. This is the reason I separated this issue out from https://github.com/container- storage-interface/spec/issues/43 <http://url> If we want to model V1 on a node internal-only storage model and leave external storage systems support for a future version then node-level Domains would likely be sufficient. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLHuhXHcMG77MYB0opIbVJ1-FJJtMks5sIYPlgaJpZM4N9pus> .

Jgoldick · 2017-06-29T22:09:26Z

I'm struggling a bit here to figure out what would be considered as sufficiently concrete. I can upload pictures showing external storage topologies that are common and don't fit within the node domain failure model described elsewhere, pretty much every NFS/SAN deployment picture would suffice. Alternatively/Additionally I can track down links to fault tolerant MySQL/DB best practices guides that point out that replicas MUST not be on the same storage device, but that seems like noise. If you would be willing to take this off thread I'd be happy to nail this down. Reachable at jgoldick@alumni.cmu.edu

Jgoldick · 2017-06-29T23:32:25Z

Here's a set of slides that describe how Node failure domain != Volume Failure domain with external storage systems.

CSI_Storage_Topology.pdf

jdef · 2017-06-30T13:16:32Z

Thanks for the diagrams, very helpful in terms of clarifying the discussion. If the CSI spec were to (a) clearly distinguish between Node and Volume fault domains; (b) parameterize the create/publish/list calls appropriately, and; (c) perhaps integrate fault domain knowledge into node/node-capabilities .. then it seems possible that a CO could reason about distributing workloads AND storage for fault tolerance across **both** node and volume fault domains. In the above scenario, do you still see a need for affinity/anti-affinity as first-class primitives in CSI? Is there a concern that fault domains for backend storage are too complex to model in CSI API objects, and for CO's to reason about?

…

On Thu, Jun 29, 2017 at 7:32 PM, Jgoldick ***@***.***> wrote: Here's a set of slides that describe how Node failure domain != Volume Failure domain with external storage systems. CSI_Storage_Topology.pdf <https://github.com/container-storage-interface/spec/files/1113773/CSI_Storage_Topology.pdf> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLH4KMljhJSNtkf4tF6L-Qz2WCrA6ks5sJDQJgaJpZM4N9pus> .

Jgoldick · 2017-06-30T21:50:52Z

If the CO knows all that then it could implement affinity/anti-affinity rules at its level without needing it to be part of CSI. That said, my instinct is to move smarts/expertise down the stack rather than keep it all in the CO. My reasoning is that storage fault domains are inherently complex and potentially dynamic, I'd prefer to hide that complexity as much as possible from the upper layers. From a separation of responsibilities point of view I'd rather the CO only know that Volumes a-c are to be kept apart because it was told that by an admin. Giving the CO the ability to ask a CSI plugins to ensure that the domain for Volume b doesn't match Volume a, and that Volume c not match that of a-b is a lot cleaner API. the plugin is free to make that happen if it can, and to maintain that relationship over time under dynamic reconfigurations that would normally be invisible to the CO.

jdef · 2017-07-01T10:55:49Z

OK, thanks. This is great feedback

…

On Fri, Jun 30, 2017 at 5:50 PM, Jgoldick ***@***.***> wrote: If the CO knows all that then it could implement affinity/anti-affinity rules at its level without needing it to be part of CSI. That said, my instinct is to move smarts/expertise down the stack rather than keep it all in the CO. My reasoning is that storage fault domains are inherently complex and potentially dynamic, I'd prefer to hide that complexity as much as possible from the upper layers. From a separation of responsibilities point of view I'd rather the CO only know that Volumes a-c are to be kept apart because it was told that by an admin. Giving the CO the ability to ask a CSI plugins to ensure that the domain for Volume b doesn't match Volume a, and that Volume c not match that of a-b is a lot cleaner API. the plugin is free to make that happen if it can, and to maintain that relationship over time under dynamic reconfigurations that would normally be invisible to the CO. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLBHSooc12KeTppF7Sni3dDuDSa2Fks5sJW28gaJpZM4N9pus> .

saad-ali · 2018-11-01T18:08:39Z

Exposing storage specific topology that does not effect accessibility of volume is on the road map, but not a blocker for v1.0.

julian-hj · 2018-11-01T19:21:55Z

I am wondering if we still need something explicit in the spec for this, now that we have topology constraints baked in.

Assuming that the CO can refer to topology constraints to place work & storage where they can reach each other, and that the remaining concern is with things like storage clustering and failure domains, I guess I don't see why the CO should need to know or care about that type of affinity or anti affinity.

In other words, is there some reason that the SP cannot expose storage affinity/anti-affinity through parameters in CreateVolumeRequest?

jdef · 2018-11-01T23:12:59Z

No, there is no reason that such constraints cannot be exposed through parameters of a create volume request. I think the ask here is to consider first-classing the concept in CSI because we might not want to "burden" the user with the need to express such constraints in SP-specific terms.

…

On Thu, Nov 1, 2018 at 3:24 PM Julian Hjortshoj ***@***.***> wrote: I am wondering if we still need something explicit in the spec for this, now that we have topology constraints baked in. Assuming that the CO can refer to topology constraints to place work & storage where they can reach each other, and that the remaining concern is with things like storage clustering and failure domains, I guess I don't see why the CO should need to know or care about that type of affinity or anti affinity. In other words, is there some reason that the SP cannot expose storage affinity/anti-affinity through parameters in CreateVolumeRequest? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLL_qwVsWl5b9qBSHjCDFNM6rSQZVks5uq0pzgaJpZM4N9pus> .

julian-hj · 2018-11-02T17:41:55Z

I see. My opinion is that in & of itself, that's not really a good enough reason to expand the spec. There's all sorts of storage specific stuff that might fall into the same general category of being opaque to the CO but interesting to the end user, and so far, we've tried to hold the line against adding those things in favor of having the SP document them as options.

saad-ali · 2019-02-21T01:59:00Z

CSI's existing support for topology is centered around "accessibility" -- the idea that a volume is only accessible by certain nodes in a cluster. Sharing this information between SP and CO allows the CO to operate intelligently (e.g. not schedule volumes to nodes that a volume is inaccessible from).

However, CSI does not provide a way to express storage specific topology: imagine a case where a volume is equally accessible by all nodes but has some internal storage system topology that can influence application performance.

So the need here is for an SP to be able to share this information with a CO and to enable a CO to use that information to programatically influence volume provisioning.

An example for clarity: a storage system is broken in to 3 racks, volumes provisioned on any of the 3 racks can be accessed by any node in the CO cluster. But some nodes would be more performant with volumes from a specific rack. If SP could express this internal topology to the CO, the CO could decide to, for example, spread volumes across the racks during provisioning or influence where a workload using a given volume is scheduled.

jdef mentioned this issue Jul 1, 2017

Proposal: support Domain as a first-class field #7

Closed

jdef mentioned this issue Jul 13, 2017

CreateVolumeRequest - support for limited storage topology/connectivity deployments #43

Closed

Jgoldick mentioned this issue Feb 10, 2018

Introduce concept of topology to CSI spec. #188

Merged

saad-ali added the post-v1 label Nov 1, 2018

saad-ali mentioned this issue Feb 21, 2019

Need pvc namespace passed to CSI driver kubernetes-csi/external-provisioner#170

Closed

Akrog mentioned this issue Jun 7, 2019

Support for user options on creation #369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44

CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44

Jgoldick commented Jun 19, 2017

jdef commented Jun 26, 2017 via email

Jgoldick commented Jun 26, 2017

Jgoldick commented Jun 26, 2017

jdef commented Jun 27, 2017 via email

Jgoldick commented Jun 27, 2017

jdef commented Jun 28, 2017 via email

Jgoldick commented Jun 29, 2017

Jgoldick commented Jun 29, 2017

jdef commented Jun 30, 2017 via email

Jgoldick commented Jun 30, 2017

jdef commented Jul 1, 2017 via email

saad-ali commented Nov 1, 2018

julian-hj commented Nov 1, 2018

jdef commented Nov 1, 2018 via email

julian-hj commented Nov 2, 2018

saad-ali commented Feb 21, 2019

CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44

CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44

Comments

Jgoldick commented Jun 19, 2017

jdef commented Jun 26, 2017 via email

Jgoldick commented Jun 26, 2017

Jgoldick commented Jun 26, 2017

jdef commented Jun 27, 2017 via email

Jgoldick commented Jun 27, 2017

jdef commented Jun 28, 2017 via email

Jgoldick commented Jun 29, 2017

Jgoldick commented Jun 29, 2017

jdef commented Jun 30, 2017 via email

Jgoldick commented Jun 30, 2017

jdef commented Jul 1, 2017 via email

saad-ali commented Nov 1, 2018

julian-hj commented Nov 1, 2018

jdef commented Nov 1, 2018 via email

julian-hj commented Nov 2, 2018

saad-ali commented Feb 21, 2019