-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Maintain a stateless CSI implementation #224
Conversation
Introduce CSI plugin configuration, per Ceph cluster that is used for provisioning storage, in lieu of the existing config map based volume store. Signed-off-by: ShyamsundarR <srangana@redhat.com>
Related to discussions around #135, #205 and #206 Requesting review comments on approach: @gman0 @travisn @rootfs @humblec @Madhu-1 Tagging folks who helped brainstorm for a review as well @phlogistonjohn @raghavendra-talur @JohnStrunk |
Signed-off-by: ShyamsundarR <srangana@redhat.com>
- We cannot, deal with multiple Ceph clusters backed by a single CSI plugin | ||
instance, due to the strict 1:1 mapping between the same, and thus do any | ||
form of intelligent provisioning across Ceph clusters for a request | ||
- **Needs Investigation** Does this hinder Topology based volume provisioning? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a ceph cluster per topology how we would advocate deployment or would the ceph cluster span all topologies and use the crush map to direct allocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not have an answer to the question. I have made references to the same in @liewegas comment as well. Once I learn more about what Ceph can do, I can attempt to answer the question.
In my assumption, a Ceph cluster is per Topology.
form of intelligent provisioning across Ceph clusters for a request | ||
- **Needs Investigation** Does this hinder Topology based volume provisioning? | ||
As each instance is stuck to single Ceph cluster, we cannot provision from | ||
different Topologies (each with their own Ceph clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a Topology and how does it relate to a StorageClass? Each StorageClass maps to a particular pool or fs in a particular ceph cluster, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This surely needs more investigation from my end, so answers maybe sketchy.
First up, IMO even the CSI k8s implementation is sketchy in this regard (it is alpha). (reference).
Interesting point to note is point (2) in the above link that reads,
Ability for Kubernetes (users or components) to influence where a volume is
provisioned (e.g. provision new volume in either "zone 1" or "zone 2").
I read the above as, a request may come in with a Topology requirement. If the StorageClass already has the cluster and pool specified, the StorageClass is tied to a Topology in my understanding and if so we cannot decide where to provision from, it is decided by the StorageClass already. Which means we lose the ability to provision from a specific Topology, and each pod has to use a StorageClass that ties it to a Topology.
Unless, as @JohnStrunk asks, and I have questions as well around the same, can a Ceph cluster span so called Typologies? IOW, let's call them data centers (DC) in a federated cluster and can we have pools on DC1 and different pools in DC2, but all part of the same Ceph cluster?
- If yes, then specifying the cluster in the StorageClass would do, the request would come in with a Topology requirement, and pool selection should be done based on the same
- If no, then CSI driver would have to choose the cluster itself to provision from, and hence even that information should not come from the StorageClass
All this is speculation at best, currently the StorageClass will specify the cluster and pool that the volume should be provisioned from. In the future, based on the above proposal and based on better understanding of what Topology really entails, we can omit the same and add the logic within the CSI driver if needed.
What are the consequences for NS? What if the image/volume is not provisioned by CS but rather from outside (OpenStack Cinder/Manila)? |
The NS recieves information either from volume_context and/or publish_context. Secrets were from the storageclass and now would shift to reading the same based on the volume_id and hence respective ceph Cluster. For non-CS volumes, I assume the PV (in k8s terms) would contain enough information to pass along to the NS, we need the NS verbs to attempt getting information from the passed in parameters and hence detect the cluster and secrets, or fall back to assuming that this is a CS created volume and hence act on it as mentioned. Can you help me with an annotation of a PV or an example of non-CS generated storage end point that reaches the NS? There are parts of the picture that I am extrapolating and want to be sure of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial PR implementing the configuration changes has been put up here.
Noting some changes, to detailed out constructs in this documentation, as a part of implementation.
A question about the usage of FSID: are there any technical requirements for this to be an actual FSID as opposed to just some identifier of a particular ceph cluster? IMHO in some situations a more user-friendly name would be more apt. |
The fsid can be any string that identifies the secret (in k8s) OR the directory name, to query for a particular Ceph clusters details. Further, it is used to encode the same into the VolID. It is not used to validate if said Ceph cluster has the same fsid etc. Hence, it can be an arbitrary name, unique to each Ceph cluster that a CO uses, length restricted and immutable, as it is encoded into the VolID. The fsid meets these requirements nicely, hence is chosen rather than having the user to come with a name, but from a user friendliness POV it can be named with the above constraints. I would have to rework the fsid encoding into the VolID, as a length preceded string, to accommodate variable name length strings, but otherwise nothing else changes. If verbiage needs to change to state, |
@ShyamsundarR in that case please do, also in your PR (and others) |
Done in the code PR, changing this doc is pending (as are other changes to the doc as per the implementation), and will be done subsequently as other PRs land that may require further doc changes. @gman0 if you require that doc reflect the state of submitted PRs as soon as possible, let me know, so that I can redo my priorities between the the code and doc. |
Signed-off-by: ShyamsundarR <srangana@redhat.com>
Updated section on how image names and volume_name idempotence would be maintained using Ceph omaps instead of volume_name hashes and image instance IDs and encoding the same into the image and volume IDs. Thanks to @dillaman for course correcting and suggesting omaps instead of the older scheme. The older scheme was unwieldy in implementation and required that an in memory cache of all image hashes and its instances be created during startup by the CSI plugin, to ensure that newer volume name hashes use different instance IDs when there are collisions. @liewegas @rootfs @gman0 @Madhu-1 @dillaman request a re-look of the new scheme to catch any potential issues. I have a working version of the new scheme for create/delete/mount of rbd, and am working on snapshots and cloning to reach feature parity with current implementation before posting it. |
@ShyamsundarR nice! Is there anything else we could offload to Ceph omaps instead of storing it in k8s ConfigMaps? |
Currently none, that a combination of image-meta keys and rbd info does not get us. BUT, if we needed to stash away per image omaps for additional data we could extend it to create such omaps for attribute storage in the future. I am working through snapshot code changes to use this scheme, maybe that will reveal information that we want to stash in addition to what we store now. Although my first inclination would be to store these as image-meta keys than a new omap. |
Hmmm... looks like image-meta keys are shared across clones and cannot be set for snapshots, so I guess we will be using a per image/snap/clone omap to store the csi.volume.name and also if required other csi keys. |
They aren't "shared" across clones, but they are copied upon creation of the clone. WRT snapshots, if each snapshot volume were to be stored as its own (cloned) image, they again would have their own image-meta. |
I think it's important to not let the current RBD implementation details drive the design of the CSI. As discussed in issue #227, if the CSI spec seems to prefer that snapshots be independent, shouldn't that be the goal? You would still need to maintain metadata about which snapshots are actually owned by the CO and be able to efficiently list them. i.e. you would also want to create a directory listing for snapshots to RBD images (or an attribute on the Also, thinking ahead to the "cloning PVs from a golden image (i.e. a snapshot)", since there is an effective limit for how many parents an image can have, having the ability to flatten a snapshot upon its first clone usage might be a nice optimization to avoid potentially needing to flatten every derived PV clone. |
Agree, I have not yet chased the snapshot to clone and CSI/RBD perspective, to make further observations at this point in time.
I agree the metadata around snapshots is required, and my comment is ill formed and stated. What I really meant was, I would not attempt and pursue image-meta based tags (even though these are copied to a clone, and can be set to different values), but rather use a per-image omap instead, to store additional attributes as required. Also, as stated above, snapshot metadata would also need to be preserved.
|
- Added fsname based provisioning - Added fscid usage instead of pool ID in VolID encoding - Added usage of CephFS metadata pool and namespace for omap maintenance Additionally as the solution has changed to adopt omaps and not using secrets mapped into the CSI plugins, the solution is elaborated with the current scheme. Signed-off-by: ShyamsundarR <srangana@redhat.com>
Updated the design for CephFS related changes. The changes in short are,
The omap scheme remains the same as implemented for RBD. Sharding the omap remains a future concern, with the ability to shard the objects into buckets based on the CSI generated volume name. NOTE: Using @batrick @ajarr Please check if the above addresses concerns around implementing the same scheme with CephFS, as per conversations over mail. If so please drop a comment, such that we can unfreeze the RBD implementation (#312 ) from a "do not merge" state to review pending state, as it is waiting on design acceptance of CephFS changes, as one of the criteria to be merged. (tagging @dillaman as well for comments provided). |
Signed-off-by: ShyamsundarR <srangana@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
we can update the design doc if something needed or we have enhanced the design may be in the followup PR |
@ShyamsundarR is this proposal matches with the current code or implementation in place? just making sure both are in parity. If you can confirm we can get this merged. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in a month if no further activity occurs. Thank you for your contributions. |
This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required. |
Syncing latest changes from upstream devel for ceph-csi
Introduce CSI plugin configuration, per Ceph cluster that is
used for provisioning storage, in lieu of the existing config
map based volume store.
Signed-off-by: ShyamsundarR srangana@redhat.com