-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CreateVolumeRequest - support for affinity/anti-affinity with other volumes #44
Comments
Affinity (and anti-) for volume placement sound like a problem for a CO
and/or plugin backend system (based on some opaque create-vol param), and
not something for CSI to tackle.
If you disagree, can you provide a compelling argument and use case for why
this belongs in CSI?
…On Sun, Jun 18, 2017 at 8:36 PM, Jgoldick ***@***.***> wrote:
Many/most clustered systems need to specify anti-affinity for new storage.
Imagine a MySQL cluster where all volumes end up on the same external
storage device. May leverage concept in https://github.com/container-
storage-interface/spec/issues/7 <http://url>
Alternatively, if a Container needs to access multiple Volumes then they
must be created in such a way that they can both be mounted by the same
node at the same time. See https://github.com/container-
storage-interface/spec/issues/43 <http://url>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#44>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLCleN8bg8s4_emGAtgseXIB9MdUPks5sFcJxgaJpZM4N9pus>
.
|
The anti-affinity use case is pretty common, not wanting replicas to be on the same storage system or failure domain. This includes pretty much all databases. At present I see no way the CO can the desired behavior via CreateVolume since it has no failure domain knowledge available to it. The plugins cannot do it either because there's no way to supply them with a list of domains or volumes to avoid. |
A quick link to how we dealt with this in OpenStack may be useful. To be clear, I'm not pushing this as a solution, but the problem is real. I'd prefer not to have to pre-create all fault tolerant replica sets outside of CSI and then publish them. |
If we had support for `Domain` such that a CO could invoke CreateVolume for
N volumes across X domains (and `Domain` was granular enough, e.g. down to
the "node" level), wouldn't that let the CO tackle placement issues and
avoid baking affinity/anti-affinity APIs into the CSI spec?
…On Mon, Jun 26, 2017 at 6:18 PM, Jgoldick ***@***.***> wrote:
A quick link to how we dealt with this in OpenStack may be useful.
https://wiki.openstack.org/wiki/VolumeAffinity as a qualifier on the
general topic of VolumeTypeSchedulers https://wiki.openstack.org/
wiki/VolumeTypeScheduler
To be clear, I'm not pushing this as a solution, but the problem is real.
I'd prefer not to have to pre-create all fault tolerant replica sets
outside of CSI and then publish them.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLM_S6yCeTgKp33-d-YOv2qgyoyvJks5sIC4_gaJpZM4N9pus>
.
|
Not unless we assume that each storage failure domain matches that of the nodes that can access it. This assumption would hold if the volumes were actually inside nodes, like internal drives. The assumption breaks down if the volumes are external, think NetApp/Dell/etc or there are storage-centric nodes in the cluster and compute-centric ones. This is the reason I separated this issue out from https://github.com/container-storage-interface/spec/issues/43 If we want to model V1 on a node internal-only storage model and leave external storage systems support for a future version then node-level Domains would likely be sufficient. |
My assumption was that the failure domain of the node was related to the
that of the volume, but not necessarily equivalent. Supporting only
local/internal storage is probably not sufficient for v1 - we need to
consider external storage systems.
This still feels very, very strongly related to the Domain conversation.
There's been a bit of back and forth at the WG level about what constitutes
a sufficient failure domain model for v1; it's very easy to start sliding
down the slope of complex topologies .. and not all CO's should necessarily
have to understand the complexity of every storage topology.
I think what would help crystallize things here for me are a set of
concrete use cases, perhaps starting with the MySQL cluster you mentioned
in conjunction with various storage backends and the fault-domain support
implicit within those backends (should probably include local + external
backends). Because it's not clear to me that affinity is a CSI problem at
this point, but support for Domain almost surely is -- and that could go a
long way toward supporting strategic CO-driven volume/workload placement
methods.
…On Tue, Jun 27, 2017 at 6:36 PM, Jgoldick ***@***.***> wrote:
Not unless we assume that each storage failure domain matches that of the
nodes that can access it. This assumption would hold if the volumes were
actually inside nodes, like internal drives. The assumption breaks down if
the volumes are external, think NetApp/Dell/etc or there are
storage-centric nodes in the cluster and compute-centric ones. This is the
reason I separated this issue out from https://github.com/container-
storage-interface/spec/issues/43 <http://url>
If we want to model V1 on a node internal-only storage model and leave
external storage systems support for a future version then node-level
Domains would likely be sufficient.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLHuhXHcMG77MYB0opIbVJ1-FJJtMks5sIYPlgaJpZM4N9pus>
.
|
I'm struggling a bit here to figure out what would be considered as sufficiently concrete. I can upload pictures showing external storage topologies that are common and don't fit within the node domain failure model described elsewhere, pretty much every NFS/SAN deployment picture would suffice. Alternatively/Additionally I can track down links to fault tolerant MySQL/DB best practices guides that point out that replicas MUST not be on the same storage device, but that seems like noise. If you would be willing to take this off thread I'd be happy to nail this down. Reachable at jgoldick@alumni.cmu.edu |
Here's a set of slides that describe how Node failure domain != Volume Failure domain with external storage systems. |
Thanks for the diagrams, very helpful in terms of clarifying the
discussion. If the CSI spec were to (a) clearly distinguish between Node
and Volume fault domains; (b) parameterize the create/publish/list calls
appropriately, and; (c) perhaps integrate fault domain knowledge into
node/node-capabilities .. then it seems possible that a CO could reason
about distributing workloads AND storage for fault tolerance across
**both** node and volume fault domains.
In the above scenario, do you still see a need for affinity/anti-affinity
as first-class primitives in CSI?
Is there a concern that fault domains for backend storage are too complex
to model in CSI API objects, and for CO's to reason about?
…On Thu, Jun 29, 2017 at 7:32 PM, Jgoldick ***@***.***> wrote:
Here's a set of slides that describe how Node failure domain != Volume
Failure domain with external storage systems.
CSI_Storage_Topology.pdf
<https://github.com/container-storage-interface/spec/files/1113773/CSI_Storage_Topology.pdf>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLH4KMljhJSNtkf4tF6L-Qz2WCrA6ks5sJDQJgaJpZM4N9pus>
.
|
If the CO knows all that then it could implement affinity/anti-affinity rules at its level without needing it to be part of CSI. That said, my instinct is to move smarts/expertise down the stack rather than keep it all in the CO. My reasoning is that storage fault domains are inherently complex and potentially dynamic, I'd prefer to hide that complexity as much as possible from the upper layers. From a separation of responsibilities point of view I'd rather the CO only know that Volumes a-c are to be kept apart because it was told that by an admin. Giving the CO the ability to ask a CSI plugins to ensure that the domain for Volume b doesn't match Volume a, and that Volume c not match that of a-b is a lot cleaner API. the plugin is free to make that happen if it can, and to maintain that relationship over time under dynamic reconfigurations that would normally be invisible to the CO. |
OK, thanks. This is great feedback
…On Fri, Jun 30, 2017 at 5:50 PM, Jgoldick ***@***.***> wrote:
If the CO knows all that then it could implement affinity/anti-affinity
rules at its level without needing it to be part of CSI. That said, my
instinct is to move smarts/expertise down the stack rather than keep it all
in the CO. My reasoning is that storage fault domains are inherently
complex and potentially dynamic, I'd prefer to hide that complexity as much
as possible from the upper layers. From a separation of responsibilities
point of view I'd rather the CO only know that Volumes a-c are to be kept
apart because it was told that by an admin. Giving the CO the ability to
ask a CSI plugins to ensure that the domain for Volume b doesn't match
Volume a, and that Volume c not match that of a-b is a lot cleaner API. the
plugin is free to make that happen if it can, and to maintain that
relationship over time under dynamic reconfigurations that would normally
be invisible to the CO.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLBHSooc12KeTppF7Sni3dDuDSa2Fks5sJW28gaJpZM4N9pus>
.
|
Exposing storage specific topology that does not effect accessibility of volume is on the road map, but not a blocker for v1.0. |
I am wondering if we still need something explicit in the spec for this, now that we have topology constraints baked in. Assuming that the CO can refer to topology constraints to place work & storage where they can reach each other, and that the remaining concern is with things like storage clustering and failure domains, I guess I don't see why the CO should need to know or care about that type of affinity or anti affinity. In other words, is there some reason that the SP cannot expose storage affinity/anti-affinity through |
No, there is no reason that such constraints cannot be exposed through
parameters of a create volume request. I think the ask here is to consider
first-classing the concept in CSI because we might not want to "burden" the
user with the need to express such constraints in SP-specific terms.
…On Thu, Nov 1, 2018 at 3:24 PM Julian Hjortshoj ***@***.***> wrote:
I am wondering if we still need something explicit in the spec for this,
now that we have topology constraints baked in.
Assuming that the CO can refer to topology constraints to place work &
storage where they can reach each other, and that the remaining concern is
with things like storage clustering and failure domains, I guess I don't
see why the CO should need to know or care about that type of affinity or
anti affinity.
In other words, is there some reason that the SP cannot expose storage
affinity/anti-affinity through parameters in CreateVolumeRequest?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLL_qwVsWl5b9qBSHjCDFNM6rSQZVks5uq0pzgaJpZM4N9pus>
.
|
I see. My opinion is that in & of itself, that's not really a good enough reason to expand the spec. There's all sorts of storage specific stuff that might fall into the same general category of being opaque to the CO but interesting to the end user, and so far, we've tried to hold the line against adding those things in favor of having the SP document them as options. |
CSI's existing support for topology is centered around "accessibility" -- the idea that a volume is only accessible by certain nodes in a cluster. Sharing this information between SP and CO allows the CO to operate intelligently (e.g. not schedule volumes to nodes that a volume is inaccessible from). However, CSI does not provide a way to express storage specific topology: imagine a case where a volume is equally accessible by all nodes but has some internal storage system topology that can influence application performance. So the need here is for an SP to be able to share this information with a CO and to enable a CO to use that information to programatically influence volume provisioning. An example for clarity: a storage system is broken in to 3 racks, volumes provisioned on any of the 3 racks can be accessed by any node in the CO cluster. But some nodes would be more performant with volumes from a specific rack. If SP could express this internal topology to the CO, the CO could decide to, for example, spread volumes across the racks during provisioning or influence where a workload using a given volume is scheduled. |
Many/most clustered systems need to specify anti-affinity for new storage. Imagine a MySQL cluster where all volumes end up on the same external storage device. May leverage concept in https://github.com/container-storage-interface/spec/issues/7
Alternatively, if a Container needs to access multiple Volumes then they must be created in such a way that they can both be mounted by the same node at the same time. See https://github.com/container-storage-interface/spec/issues/43
The text was updated successfully, but these errors were encountered: