Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44

Open
Jgoldick opened this issue Jun 19, 2017 · 16 comments
Labels

Comments

@Jgoldick
Copy link

Many/most clustered systems need to specify anti-affinity for new storage. Imagine a MySQL cluster where all volumes end up on the same external storage device. May leverage concept in https://github.com/container-storage-interface/spec/issues/7

Alternatively, if a Container needs to access multiple Volumes then they must be created in such a way that they can both be mounted by the same node at the same time. See https://github.com/container-storage-interface/spec/issues/43

@jdef
Copy link
Member

jdef commented Jun 26, 2017 via email

@Jgoldick
Copy link
Author

The anti-affinity use case is pretty common, not wanting replicas to be on the same storage system or failure domain. This includes pretty much all databases.

At present I see no way the CO can the desired behavior via CreateVolume since it has no failure domain knowledge available to it. The plugins cannot do it either because there's no way to supply them with a list of domains or volumes to avoid.

@Jgoldick
Copy link
Author

A quick link to how we dealt with this in OpenStack may be useful.
https://wiki.openstack.org/wiki/VolumeAffinity as a qualifier on the general topic of VolumeTypeSchedulers https://wiki.openstack.org/wiki/VolumeTypeScheduler

To be clear, I'm not pushing this as a solution, but the problem is real. I'd prefer not to have to pre-create all fault tolerant replica sets outside of CSI and then publish them.

@jdef
Copy link
Member

jdef commented Jun 27, 2017 via email

@Jgoldick
Copy link
Author

Not unless we assume that each storage failure domain matches that of the nodes that can access it. This assumption would hold if the volumes were actually inside nodes, like internal drives. The assumption breaks down if the volumes are external, think NetApp/Dell/etc or there are storage-centric nodes in the cluster and compute-centric ones. This is the reason I separated this issue out from https://github.com/container-storage-interface/spec/issues/43

If we want to model V1 on a node internal-only storage model and leave external storage systems support for a future version then node-level Domains would likely be sufficient.

@jdef
Copy link
Member

jdef commented Jun 28, 2017 via email

@Jgoldick
Copy link
Author

I'm struggling a bit here to figure out what would be considered as sufficiently concrete. I can upload pictures showing external storage topologies that are common and don't fit within the node domain failure model described elsewhere, pretty much every NFS/SAN deployment picture would suffice. Alternatively/Additionally I can track down links to fault tolerant MySQL/DB best practices guides that point out that replicas MUST not be on the same storage device, but that seems like noise. If you would be willing to take this off thread I'd be happy to nail this down. Reachable at jgoldick@alumni.cmu.edu

@Jgoldick
Copy link
Author

Here's a set of slides that describe how Node failure domain != Volume Failure domain with external storage systems.

CSI_Storage_Topology.pdf

@jdef
Copy link
Member

jdef commented Jun 30, 2017 via email

@Jgoldick
Copy link
Author

If the CO knows all that then it could implement affinity/anti-affinity rules at its level without needing it to be part of CSI. That said, my instinct is to move smarts/expertise down the stack rather than keep it all in the CO. My reasoning is that storage fault domains are inherently complex and potentially dynamic, I'd prefer to hide that complexity as much as possible from the upper layers. From a separation of responsibilities point of view I'd rather the CO only know that Volumes a-c are to be kept apart because it was told that by an admin. Giving the CO the ability to ask a CSI plugins to ensure that the domain for Volume b doesn't match Volume a, and that Volume c not match that of a-b is a lot cleaner API. the plugin is free to make that happen if it can, and to maintain that relationship over time under dynamic reconfigurations that would normally be invisible to the CO.

@jdef
Copy link
Member

jdef commented Jul 1, 2017 via email

@saad-ali
Copy link
Member

saad-ali commented Nov 1, 2018

Exposing storage specific topology that does not effect accessibility of volume is on the road map, but not a blocker for v1.0.

@julian-hj
Copy link
Contributor

I am wondering if we still need something explicit in the spec for this, now that we have topology constraints baked in.

Assuming that the CO can refer to topology constraints to place work & storage where they can reach each other, and that the remaining concern is with things like storage clustering and failure domains, I guess I don't see why the CO should need to know or care about that type of affinity or anti affinity.

In other words, is there some reason that the SP cannot expose storage affinity/anti-affinity through parameters in CreateVolumeRequest?

@jdef
Copy link
Member

jdef commented Nov 1, 2018 via email

@julian-hj
Copy link
Contributor

I see. My opinion is that in & of itself, that's not really a good enough reason to expand the spec. There's all sorts of storage specific stuff that might fall into the same general category of being opaque to the CO but interesting to the end user, and so far, we've tried to hold the line against adding those things in favor of having the SP document them as options.

@saad-ali
Copy link
Member

CSI's existing support for topology is centered around "accessibility" -- the idea that a volume is only accessible by certain nodes in a cluster. Sharing this information between SP and CO allows the CO to operate intelligently (e.g. not schedule volumes to nodes that a volume is inaccessible from).

However, CSI does not provide a way to express storage specific topology: imagine a case where a volume is equally accessible by all nodes but has some internal storage system topology that can influence application performance.

So the need here is for an SP to be able to share this information with a CO and to enable a CO to use that information to programatically influence volume provisioning.

An example for clarity: a storage system is broken in to 3 racks, volumes provisioned on any of the 3 racks can be accessed by any node in the CO cluster. But some nodes would be more performant with volumes from a specific rack. If SP could express this internal topology to the CO, the CO could decide to, for example, spread volumes across the racks during provisioning or influence where a workload using a given volume is scheduled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants