Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Maintain a stateless CSI implementation #224

Closed

Conversation

ShyamsundarR
Copy link
Contributor

Introduce CSI plugin configuration, per Ceph cluster that is
used for provisioning storage, in lieu of the existing config
map based volume store.

Signed-off-by: ShyamsundarR srangana@redhat.com

Introduce CSI plugin configuration, per Ceph cluster that is
used for provisioning storage, in lieu of the existing config
map based volume store.

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@ShyamsundarR
Copy link
Contributor Author

Related to discussions around #135, #205 and #206

Requesting review comments on approach: @gman0 @travisn @rootfs @humblec @Madhu-1

Tagging folks who helped brainstorm for a review as well @phlogistonjohn @raghavendra-talur @JohnStrunk

Signed-off-by: ShyamsundarR <srangana@redhat.com>
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
- We cannot, deal with multiple Ceph clusters backed by a single CSI plugin
instance, due to the strict 1:1 mapping between the same, and thus do any
form of intelligent provisioning across Ceph clusters for a request
- **Needs Investigation** Does this hinder Topology based volume provisioning?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a ceph cluster per topology how we would advocate deployment or would the ceph cluster span all topologies and use the crush map to direct allocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not have an answer to the question. I have made references to the same in @liewegas comment as well. Once I learn more about what Ceph can do, I can attempt to answer the question.

In my assumption, a Ceph cluster is per Topology.

form of intelligent provisioning across Ceph clusters for a request
- **Needs Investigation** Does this hinder Topology based volume provisioning?
As each instance is stuck to single Ceph cluster, we cannot provision from
different Topologies (each with their own Ceph clusters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a Topology and how does it relate to a StorageClass? Each StorageClass maps to a particular pool or fs in a particular ceph cluster, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surely needs more investigation from my end, so answers maybe sketchy.

First up, IMO even the CSI k8s implementation is sketchy in this regard (it is alpha). (reference).

Interesting point to note is point (2) in the above link that reads,

Ability for Kubernetes (users or components) to influence where a volume is
provisioned (e.g. provision new volume in either "zone 1" or "zone 2").

I read the above as, a request may come in with a Topology requirement. If the StorageClass already has the cluster and pool specified, the StorageClass is tied to a Topology in my understanding and if so we cannot decide where to provision from, it is decided by the StorageClass already. Which means we lose the ability to provision from a specific Topology, and each pod has to use a StorageClass that ties it to a Topology.

Unless, as @JohnStrunk asks, and I have questions as well around the same, can a Ceph cluster span so called Typologies? IOW, let's call them data centers (DC) in a federated cluster and can we have pools on DC1 and different pools in DC2, but all part of the same Ceph cluster?

  • If yes, then specifying the cluster in the StorageClass would do, the request would come in with a Topology requirement, and pool selection should be done based on the same
  • If no, then CSI driver would have to choose the cluster itself to provision from, and hence even that information should not come from the StorageClass

All this is speculation at best, currently the StorageClass will specify the cluster and pool that the volume should be provisioned from. In the future, based on the above proposal and based on better understanding of what Topology really entails, we can omit the same and add the logic within the CSI driver if needed.

@gman0
Copy link
Contributor

gman0 commented Feb 26, 2019

What are the consequences for NS? What if the image/volume is not provisioned by CS but rather from outside (OpenStack Cinder/Manila)?

@ShyamsundarR
Copy link
Contributor Author

What are the consequences for NS? What if the image/volume is not provisioned by CS but rather from outside (OpenStack Cinder/Manila)?

The NS recieves information either from volume_context and/or publish_context. Secrets were from the storageclass and now would shift to reading the same based on the volume_id and hence respective ceph Cluster.

For non-CS volumes, I assume the PV (in k8s terms) would contain enough information to pass along to the NS, we need the NS verbs to attempt getting information from the passed in parameters and hence detect the cluster and secrets, or fall back to assuming that this is a CS created volume and hence act on it as mentioned.

Can you help me with an annotation of a PV or an example of non-CS generated storage end point that reaches the NS? There are parts of the picture that I am extrapolating and want to be sure of.

Copy link
Contributor Author

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial PR implementing the configuration changes has been put up here.
Noting some changes, to detailed out constructs in this documentation, as a part of implementation.

docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
docs/design/proposals/stateless-csi-plugin.md Outdated Show resolved Hide resolved
@gman0
Copy link
Contributor

gman0 commented Mar 12, 2019

A question about the usage of FSID: are there any technical requirements for this to be an actual FSID as opposed to just some identifier of a particular ceph cluster? IMHO in some situations a more user-friendly name would be more apt.

@ShyamsundarR
Copy link
Contributor Author

A question about the usage of FSID: are there any technical requirements for this to be an actual FSID as opposed to just some identifier of a particular ceph cluster? IMHO in some situations a more user-friendly name would be more apt.

The fsid can be any string that identifies the secret (in k8s) OR the directory name, to query for a particular Ceph clusters details. Further, it is used to encode the same into the VolID. It is not used to validate if said Ceph cluster has the same fsid etc.

Hence, it can be an arbitrary name, unique to each Ceph cluster that a CO uses, length restricted and immutable, as it is encoded into the VolID. The fsid meets these requirements nicely, hence is chosen rather than having the user to come with a name, but from a user friendliness POV it can be named with the above constraints.

I would have to rework the fsid encoding into the VolID, as a length preceded string, to accommodate variable name length strings, but otherwise nothing else changes.

If verbiage needs to change to state, clusterid can be a unique name with above restrictions, or for example can be the fsid in the spec (and resulting PRs), I can do the same. Let me know.

@gman0
Copy link
Contributor

gman0 commented Mar 12, 2019

@ShyamsundarR in that case please do, also in your PR (and others)

@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR in that case please do, also in your PR (and others)

Done in the code PR, changing this doc is pending (as are other changes to the doc as per the implementation), and will be done subsequently as other PRs land that may require further doc changes. @gman0 if you require that doc reflect the state of submitted PRs as soon as possible, let me know, so that I can redo my priorities between the the code and doc.

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@ShyamsundarR
Copy link
Contributor Author

Updated section on how image names and volume_name idempotence would be maintained using Ceph omaps instead of volume_name hashes and image instance IDs and encoding the same into the image and volume IDs.

Thanks to @dillaman for course correcting and suggesting omaps instead of the older scheme.

The older scheme was unwieldy in implementation and required that an in memory cache of all image hashes and its instances be created during startup by the CSI plugin, to ensure that newer volume name hashes use different instance IDs when there are collisions.

@liewegas @rootfs @gman0 @Madhu-1 @dillaman request a re-look of the new scheme to catch any potential issues. I have a working version of the new scheme for create/delete/mount of rbd, and am working on snapshots and cloning to reach feature parity with current implementation before posting it.

@gman0
Copy link
Contributor

gman0 commented Mar 18, 2019

@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps?

@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps?

Currently none, that a combination of image-meta keys and rbd info does not get us. BUT, if we needed to stash away per image omaps for additional data we could extend it to create such omaps for attribute storage in the future.

I am working through snapshot code changes to use this scheme, maybe that will reveal information that we want to stash in addition to what we store now. Although my first inclination would be to store these as image-meta keys than a new omap.

@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps?

Currently none, that a combination of image-meta keys and rbd info does not get us. BUT, if we needed to stash away per image omaps for additional data we could extend it to create such omaps for attribute storage in the future.

I am working through snapshot code changes to use this scheme, maybe that will reveal information that we want to stash in addition to what we store now. Although my first inclination would be to store these as image-meta keys than a new omap.

Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys.

@dillaman
Copy link

Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys.

They aren't "shared" across clones, but they are copied upon creation of the clone. WRT snapshots, if each snapshot volume were to be stored as its own (cloned) image, they again would have their own image-meta.

@ShyamsundarR
Copy link
Contributor Author

Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys.

They aren't "shared" across clones, but they are copied upon creation of the clone. WRT snapshots, if each snapshot volume were to be stored as its own (cloned) image, they again would have their own image-meta.

Thanks, makes sense. Tried changing it for the cloned image and things work as expected.

For snapshots, currently we do not want to lose the parent relation, possibly even when cloned, and also, I do not want to tie the design to either possibility.

Hence, instead of the backpointer to the k8s volname omap, being an image-meta value on the image, I think a per image/snap/clone omap to store the same sounds better (I think this is what you were referring to as well in our off-list conversation about how rbd itself does ""rbd_id." object per image" if I am not mistaken)

@dillaman
Copy link

For snapshots, currently we do not want to lose the parent relation, possibly even when cloned, and also, I do not want to tie the design to either possibility.

I think it's important to not let the current RBD implementation details drive the design of the CSI. As discussed in issue #227, if the CSI spec seems to prefer that snapshots be independent, shouldn't that be the goal? You would still need to maintain metadata about which snapshots are actually owned by the CO and be able to efficiently list them. i.e. you would also want to create a directory listing for snapshots to RBD images (or an attribute on the csi-vol-[uuid] object omap keys to indicate that it's a PV or snapshot).

Also, thinking ahead to the "cloning PVs from a golden image (i.e. a snapshot)", since there is an effective limit for how many parents an image can have, having the ability to flatten a snapshot upon its first clone usage might be a nice optimization to avoid potentially needing to flatten every derived PV clone.

@ShyamsundarR
Copy link
Contributor Author

For snapshots, currently we do not want to lose the parent relation, possibly even when cloned, and also, I do not want to tie the design to either possibility.

I think it's important to not let the current RBD implementation details drive the design of the CSI. As discussed in issue #227, if the CSI spec seems to prefer that snapshots be independent, shouldn't that be the goal?

Agree, I have not yet chased the snapshot to clone and CSI/RBD perspective, to make further observations at this point in time.

You would still need to maintain metadata about which snapshots are actually owned by the CO and be able to efficiently list them. i.e. you would also want to create a directory listing for snapshots to RBD images (or an attribute on the csi-vol-[uuid] object omap keys to indicate that it's a PV or snapshot).

I agree the metadata around snapshots is required, and my comment is ill formed and stated. What I really meant was, I would not attempt and pursue image-meta based tags (even though these are copied to a clone, and can be set to different values), but rather use a per-image omap instead, to store additional attributes as required.

Also, as stated above, snapshot metadata would also need to be preserved.

Also, thinking ahead to the "cloning PVs from a golden image (i.e. a snapshot)", since there is an effective limit for how many parents an image can have, having the ability to flatten a snapshot upon its first clone usage might be a nice optimization to avoid potentially needing to flatten every derived PV clone.

- Added fsname based provisioning
- Added fscid usage instead of pool ID in VolID encoding
- Added usage of CephFS metadata pool and namespace for
omap maintenance

Additionally as the solution has changed to adopt omaps and
not using secrets mapped into the CSI plugins, the solution is
elaborated with the current scheme.

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@ShyamsundarR
Copy link
Contributor Author

Updated the design for CephFS related changes. The changes in short are,

  • Use fsname in the CephFS StorageClass to direct provisioning to a specific fsname instance
  • Use the fsname-metadata pool to store the omaps that retain volume name to image name and volume ID representation
  • Store the csi objects and omaps in a RADOS namespace unique to CSI, to safeguard from other objects in the fsname-metadata pool
  • The VolumeID would be encoded with the fscid of the fsname for CephFS, same as the poolID in case of RBD

The omap scheme remains the same as implemented for RBD.

Sharding the omap remains a future concern, with the ability to shard the objects into buckets based on the CSI generated volume name.

NOTE: Using ceph volume based commands to manage subvolumes creation and deletion would be done post the current implementation.

@batrick @ajarr Please check if the above addresses concerns around implementing the same scheme with CephFS, as per conversations over mail. If so please drop a comment, such that we can unfreeze the RBD implementation (#312 ) from a "do not merge" state to review pending state, as it is waiting on design acceptance of CephFS changes, as one of the criteria to be merged. (tagging @dillaman as well for comments provided).

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@humblec humblec added the release-1.1.0 Track release 1.1 items label May 9, 2019
@humblec humblec added release-1.2.0 and removed release-1.1.0 Track release 1.1 items labels Jul 13, 2019
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Madhu-1
Copy link
Collaborator

Madhu-1 commented May 14, 2020

we can update the design doc if something needed or we have enhanced the design may be in the followup PR

@humblec
Copy link
Collaborator

humblec commented May 14, 2020

@ShyamsundarR is this proposal matches with the current code or implementation in place? just making sure both are in parity. If you can confirm we can get this merged.

@stale
Copy link

stale bot commented Sep 26, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in a month if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 26, 2020
@stale
Copy link

stale bot commented Oct 31, 2020

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

@stale stale bot closed this Oct 31, 2020
nixpanic pushed a commit to nixpanic/ceph-csi that referenced this pull request Mar 4, 2024
Syncing latest changes from upstream devel for ceph-csi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants