[Proposal] Maintain a stateless CSI implementation #224

ShyamsundarR · 2019-02-22T19:53:45Z

Introduce CSI plugin configuration, per Ceph cluster that is
used for provisioning storage, in lieu of the existing config
map based volume store.

Signed-off-by: ShyamsundarR srangana@redhat.com

Introduce CSI plugin configuration, per Ceph cluster that is used for provisioning storage, in lieu of the existing config map based volume store. Signed-off-by: ShyamsundarR <srangana@redhat.com>

ShyamsundarR · 2019-02-22T20:05:43Z

Related to discussions around #135, #205 and #206

Requesting review comments on approach: @gman0 @travisn @rootfs @humblec @Madhu-1

Tagging folks who helped brainstorm for a review as well @phlogistonjohn @raghavendra-talur @JohnStrunk

Signed-off-by: ShyamsundarR <srangana@redhat.com>

docs/design/proposals/stateless-csi-plugin.md

JohnStrunk · 2019-02-22T22:48:51Z

docs/design/proposals/stateless-csi-plugin.md

+- We cannot, deal with multiple Ceph clusters backed by a single CSI plugin
+  instance, due to the strict 1:1 mapping between the same, and thus do any
+  form of intelligent provisioning across Ceph clusters for a request
+- **Needs Investigation** Does this hinder Topology based volume provisioning?


Is a ceph cluster per topology how we would advocate deployment or would the ceph cluster span all topologies and use the crush map to direct allocations?

I really do not have an answer to the question. I have made references to the same in @liewegas comment as well. Once I learn more about what Ceph can do, I can attempt to answer the question.

In my assumption, a Ceph cluster is per Topology.

liewegas · 2019-02-22T22:54:08Z

docs/design/proposals/stateless-csi-plugin.md

+  form of intelligent provisioning across Ceph clusters for a request
+- **Needs Investigation** Does this hinder Topology based volume provisioning?
+  As each instance is stuck to single Ceph cluster, we cannot provision from
+  different Topologies (each with their own Ceph clusters)


What is a Topology and how does it relate to a StorageClass? Each StorageClass maps to a particular pool or fs in a particular ceph cluster, right?

This surely needs more investigation from my end, so answers maybe sketchy.

First up, IMO even the CSI k8s implementation is sketchy in this regard (it is alpha). (reference).

Interesting point to note is point (2) in the above link that reads,

Ability for Kubernetes (users or components) to influence where a volume is provisioned (e.g. provision new volume in either "zone 1" or "zone 2").

I read the above as, a request may come in with a Topology requirement. If the StorageClass already has the cluster and pool specified, the StorageClass is tied to a Topology in my understanding and if so we cannot decide where to provision from, it is decided by the StorageClass already. Which means we lose the ability to provision from a specific Topology, and each pod has to use a StorageClass that ties it to a Topology.

Unless, as @JohnStrunk asks, and I have questions as well around the same, can a Ceph cluster span so called Typologies? IOW, let's call them data centers (DC) in a federated cluster and can we have pools on DC1 and different pools in DC2, but all part of the same Ceph cluster?

If yes, then specifying the cluster in the StorageClass would do, the request would come in with a Topology requirement, and pool selection should be done based on the same

If no, then CSI driver would have to choose the cluster itself to provision from, and hence even that information should not come from the StorageClass

All this is speculation at best, currently the StorageClass will specify the cluster and pool that the volume should be provisioned from. In the future, based on the above proposal and based on better understanding of what Topology really entails, we can omit the same and add the logic within the CSI driver if needed.

docs/design/proposals/stateless-csi-plugin.md

gman0 · 2019-02-26T17:47:28Z

What are the consequences for NS? What if the image/volume is not provisioned by CS but rather from outside (OpenStack Cinder/Manila)?

ShyamsundarR · 2019-02-27T11:47:08Z

What are the consequences for NS? What if the image/volume is not provisioned by CS but rather from outside (OpenStack Cinder/Manila)?

The NS recieves information either from volume_context and/or publish_context. Secrets were from the storageclass and now would shift to reading the same based on the volume_id and hence respective ceph Cluster.

For non-CS volumes, I assume the PV (in k8s terms) would contain enough information to pass along to the NS, we need the NS verbs to attempt getting information from the passed in parameters and hence detect the cluster and secrets, or fall back to assuming that this is a CS created volume and hence act on it as mentioned.

Can you help me with an annotation of a PV or an example of non-CS generated storage end point that reaches the NS? There are parts of the picture that I am extrapolating and want to be sure of.

ShyamsundarR

Initial PR implementing the configuration changes has been put up here.
Noting some changes, to detailed out constructs in this documentation, as a part of implementation.

docs/design/proposals/stateless-csi-plugin.md

gman0 · 2019-03-12T12:28:10Z

A question about the usage of FSID: are there any technical requirements for this to be an actual FSID as opposed to just some identifier of a particular ceph cluster? IMHO in some situations a more user-friendly name would be more apt.

ShyamsundarR · 2019-03-12T12:42:56Z

A question about the usage of FSID: are there any technical requirements for this to be an actual FSID as opposed to just some identifier of a particular ceph cluster? IMHO in some situations a more user-friendly name would be more apt.

The fsid can be any string that identifies the secret (in k8s) OR the directory name, to query for a particular Ceph clusters details. Further, it is used to encode the same into the VolID. It is not used to validate if said Ceph cluster has the same fsid etc.

Hence, it can be an arbitrary name, unique to each Ceph cluster that a CO uses, length restricted and immutable, as it is encoded into the VolID. The fsid meets these requirements nicely, hence is chosen rather than having the user to come with a name, but from a user friendliness POV it can be named with the above constraints.

I would have to rework the fsid encoding into the VolID, as a length preceded string, to accommodate variable name length strings, but otherwise nothing else changes.

If verbiage needs to change to state, clusterid can be a unique name with above restrictions, or for example can be the fsid in the spec (and resulting PRs), I can do the same. Let me know.

gman0 · 2019-03-12T13:04:23Z

@ShyamsundarR in that case please do, also in your PR (and others)

ShyamsundarR · 2019-03-12T16:05:14Z

@ShyamsundarR in that case please do, also in your PR (and others)

Done in the code PR, changing this doc is pending (as are other changes to the doc as per the implementation), and will be done subsequently as other PRs land that may require further doc changes. @gman0 if you require that doc reflect the state of submitted PRs as soon as possible, let me know, so that I can redo my priorities between the the code and doc.

Signed-off-by: ShyamsundarR <srangana@redhat.com>

ShyamsundarR · 2019-03-18T01:16:58Z

Updated section on how image names and volume_name idempotence would be maintained using Ceph omaps instead of volume_name hashes and image instance IDs and encoding the same into the image and volume IDs.

Thanks to @dillaman for course correcting and suggesting omaps instead of the older scheme.

The older scheme was unwieldy in implementation and required that an in memory cache of all image hashes and its instances be created during startup by the CSI plugin, to ensure that newer volume name hashes use different instance IDs when there are collisions.

@liewegas @rootfs @gman0 @Madhu-1 @dillaman request a re-look of the new scheme to catch any potential issues. I have a working version of the new scheme for create/delete/mount of rbd, and am working on snapshots and cloning to reach feature parity with current implementation before posting it.

gman0 · 2019-03-18T12:28:01Z

@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps?

ShyamsundarR · 2019-03-18T13:34:47Z

@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps?

Currently none, that a combination of image-meta keys and rbd info does not get us. BUT, if we needed to stash away per image omaps for additional data we could extend it to create such omaps for attribute storage in the future.

I am working through snapshot code changes to use this scheme, maybe that will reveal information that we want to stash in addition to what we store now. Although my first inclination would be to store these as image-meta keys than a new omap.

ShyamsundarR · 2019-03-18T14:52:11Z

@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps?

Currently none, that a combination of image-meta keys and rbd info does not get us. BUT, if we needed to stash away per image omaps for additional data we could extend it to create such omaps for attribute storage in the future.

I am working through snapshot code changes to use this scheme, maybe that will reveal information that we want to stash in addition to what we store now. Although my first inclination would be to store these as image-meta keys than a new omap.

Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys.

dillaman · 2019-03-18T15:01:51Z

Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys.

They aren't "shared" across clones, but they are copied upon creation of the clone. WRT snapshots, if each snapshot volume were to be stored as its own (cloned) image, they again would have their own image-meta.

ShyamsundarR · 2019-03-18T15:27:01Z

Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys.

They aren't "shared" across clones, but they are copied upon creation of the clone. WRT snapshots, if each snapshot volume were to be stored as its own (cloned) image, they again would have their own image-meta.

Thanks, makes sense. Tried changing it for the cloned image and things work as expected.

For snapshots, currently we do not want to lose the parent relation, possibly even when cloned, and also, I do not want to tie the design to either possibility.

Hence, instead of the backpointer to the k8s volname omap, being an image-meta value on the image, I think a per image/snap/clone omap to store the same sounds better (I think this is what you were referring to as well in our off-list conversation about how rbd itself does ""rbd_id." object per image" if I am not mistaken)

dillaman · 2019-03-18T16:52:59Z

For snapshots, currently we do not want to lose the parent relation, possibly even when cloned, and also, I do not want to tie the design to either possibility.

I think it's important to not let the current RBD implementation details drive the design of the CSI. As discussed in issue #227, if the CSI spec seems to prefer that snapshots be independent, shouldn't that be the goal? You would still need to maintain metadata about which snapshots are actually owned by the CO and be able to efficiently list them. i.e. you would also want to create a directory listing for snapshots to RBD images (or an attribute on the csi-vol-[uuid] object omap keys to indicate that it's a PV or snapshot).

Also, thinking ahead to the "cloning PVs from a golden image (i.e. a snapshot)", since there is an effective limit for how many parents an image can have, having the ability to flatten a snapshot upon its first clone usage might be a nice optimization to avoid potentially needing to flatten every derived PV clone.

ShyamsundarR · 2019-03-18T20:56:47Z

For snapshots, currently we do not want to lose the parent relation, possibly even when cloned, and also, I do not want to tie the design to either possibility.

I think it's important to not let the current RBD implementation details drive the design of the CSI. As discussed in issue #227, if the CSI spec seems to prefer that snapshots be independent, shouldn't that be the goal?

Agree, I have not yet chased the snapshot to clone and CSI/RBD perspective, to make further observations at this point in time.

You would still need to maintain metadata about which snapshots are actually owned by the CO and be able to efficiently list them. i.e. you would also want to create a directory listing for snapshots to RBD images (or an attribute on the csi-vol-[uuid] object omap keys to indicate that it's a PV or snapshot).

I agree the metadata around snapshots is required, and my comment is ill formed and stated. What I really meant was, I would not attempt and pursue image-meta based tags (even though these are copied to a clone, and can be set to different values), but rather use a per-image omap instead, to store additional attributes as required.

Also, as stated above, snapshot metadata would also need to be preserved.

Also, thinking ahead to the "cloning PVs from a golden image (i.e. a snapshot)", since there is an effective limit for how many parents an image can have, having the ability to flatten a snapshot upon its first clone usage might be a nice optimization to avoid potentially needing to flatten every derived PV clone.

docs/design/proposals/stateless-csi-plugin.md

- Added fsname based provisioning - Added fscid usage instead of pool ID in VolID encoding - Added usage of CephFS metadata pool and namespace for omap maintenance Additionally as the solution has changed to adopt omaps and not using secrets mapped into the CSI plugins, the solution is elaborated with the current scheme. Signed-off-by: ShyamsundarR <srangana@redhat.com>

ShyamsundarR · 2019-05-08T00:18:35Z

Updated the design for CephFS related changes. The changes in short are,

Use fsname in the CephFS StorageClass to direct provisioning to a specific fsname instance
Use the fsname-metadata pool to store the omaps that retain volume name to image name and volume ID representation
Store the csi objects and omaps in a RADOS namespace unique to CSI, to safeguard from other objects in the fsname-metadata pool
The VolumeID would be encoded with the fscid of the fsname for CephFS, same as the poolID in case of RBD

The omap scheme remains the same as implemented for RBD.

Sharding the omap remains a future concern, with the ability to shard the objects into buckets based on the CSI generated volume name.

NOTE: Using ceph volume based commands to manage subvolumes creation and deletion would be done post the current implementation.

@batrick @ajarr Please check if the above addresses concerns around implementing the same scheme with CephFS, as per conversations over mail. If so please drop a comment, such that we can unfreeze the RBD implementation (#312 ) from a "do not merge" state to review pending state, as it is waiting on design acceptance of CephFS changes, as one of the criteria to be merged. (tagging @dillaman as well for comments provided).

docs/design/proposals/stateless-csi-plugin.md

Signed-off-by: ShyamsundarR <srangana@redhat.com>

Madhu-1

LGTM

Madhu-1 · 2020-05-14T08:43:36Z

we can update the design doc if something needed or we have enhanced the design may be in the followup PR

humblec · 2020-05-14T14:53:54Z

@ShyamsundarR is this proposal matches with the current code or implementation in place? just making sure both are in parity. If you can confirm we can get this merged.

stale · 2020-09-26T11:06:43Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in a month if no further activity occurs. Thank you for your contributions.

stale · 2020-10-31T04:26:28Z

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

Syncing latest changes from upstream devel for ceph-csi

[Proposal] Maintain a stateless CSI implementation

81cdb58

Introduce CSI plugin configuration, per Ceph cluster that is used for provisioning storage, in lieu of the existing config map based volume store. Signed-off-by: ShyamsundarR <srangana@redhat.com>

Address markdown linter issues

7c2f755

Signed-off-by: ShyamsundarR <srangana@redhat.com>

travisn reviewed Feb 22, 2019

View reviewed changes

JohnStrunk reviewed Feb 22, 2019

View reviewed changes

liewegas reviewed Feb 22, 2019

View reviewed changes

Madhu-1 reviewed Feb 25, 2019

View reviewed changes