What is the advantage of maintaining the rbdVolumes map? #135

aibarbetta · 2019-01-17T19:16:31Z

Hi all, I've read the conversation on #66 and the cm fix but I still don't understand the advantage of having maps like rbdVolumes and rbdSnapshots.
For example, the rbdVolumes map is used just 2 times:

On CreateVolume to check if the given volume name already exist, could't the same thing be done with a call to rbd info or rbd status?
On CreateSnapshot seems to be used to retrieve the volume name with the volume ID. I understand that according to the CSI spec volName and volID MUST be different, but couldn't we just use a volName hash as volID to avoid saving it?

The text was updated successfully, but these errors were encountered:

rootfs · 2019-01-17T19:51:05Z

When calling rbd info or rbd status, we have to supply mon/pool/user. But such info is not in CSI volume handle. So we have to persistent volume map, which includes meta info such as mon/pool/user. This is crucial when the compute node on which provisioners runs is down.

aibarbetta · 2019-01-18T13:57:34Z

But both CreateVolumeRequest and CreateSnapshotRequest have a parameters map that includes that info (mon/pool/user), right? Why not use it from there?

rootfs · 2019-01-18T14:30:07Z

yes, but these info are not in delete and stage. So csi drivers have to persist the info when volumes are created, and retrieve these info from persistent storage during delete and stage.

aibarbetta · 2019-01-18T21:35:45Z

Ohh, I see that DeleteVolumeRequest does not have that info.
Another question then, I don't see the map getting updated between nodes on runtime (besides from the call to LoadExDataFromMetadataStore in the initialization). If a volume gets created on node A and a snapshot for that volume is requested on node B, how does B have the vol info on it's map?

rootfs · 2019-01-19T13:01:41Z

vol info should be retrieved from configmaps if not in local cache

gman0 · 2019-02-01T14:07:49Z

If a volume gets created on node A and a snapshot for that volume is requested on node B, how does B have the vol info on it's map?

If I'm not mistaken, such situation shouldn't occur because both the provisioner and snapshotter are deployed within the same pod:

ceph-csi/deploy/rbd/kubernetes/csi-rbdplugin-provisioner.yaml

Line 41 in 832e0bb

- name: csi-snapshotter

ShyamsundarR · 2019-02-06T23:49:53Z

I had the same observation reading the code as the initial reporter, that the additional maps are actually not required and a concern.

Thanks to @JohnStrunk for help with understanding k8s plumbing in this regard, and disclaimer ahead that I am a Ceph noob!

The problem as I see, storing these config maps are,

The maps can go out of sync with reality, as it really is a duplicate config being maintained, one at Ceph and the other in the config maps

Say, a rbd instance was created, but the config map update failed, there is a chance that this instance would be leaked at the backend
This requires some form of reconciliation with the actual state as known by Ceph, under such situations

Any change to a global in the map (say monitor list or secrets) requires that then entire map be updated, and stored again
NOTE: This is under the assumption that such events are possible under Ceph, i.e changes to monitors or such

The solution rather seems to be in how we abstract the various configuration elements into either the StorageClass (SC) or the CSI configuration. If the required parameters to reach the Ceph pool/cluster are part of the CSI configuration, these need not be stored in the said maps. Further, the StorageClass can now have options that helps create different rbd images with differing properties, as part of its "parameters" than information about the pool/cluster.

The Node Service anyway receives required information to mount and such from the RPC payload itself, hence this does not change the same.

This leads to a few things,
a) A single CSI per Ceph pool/cluster (for now let's keep FS and RBD separate in the discussion)
b) Hence its provisioner name would distinguish multiple such CSI provisioners for each Ceph pool/cluster in the deployment
c) ControllerService should have all the information, for operations like Delete and Snap, as part of the CSI configuration and the volumeID (and if needed the name) should suffice to operate on the same
d) There can be multiple StorageClasses using the same CSI instance, but with differing parameters for rbd creation. This also makes it easier to manage these instances, as the Ceph pool configuration, if changed, need not be reflected in every SC that refers to the same.

If the abstraction is changed as above, it seems better aligned for the overall purpose and avoids the maps. Also, it makes the CSI talk to a Ceph cluster based on the configuration, than CSI being the funnel to talk to any Ceph cluster. Further, it makes the CSI driver nearly stateless, it just acts as a pass through based on the configuration and hence no reconciliation if ever needed (other than the core cluster parameters).

Thoughts?

rootfs · 2019-02-07T13:02:44Z

The maps can go out of sync with reality, as it really is a duplicate config being maintained, one at Ceph and the other in the config maps

Agree. Split brain is a constant theme in storage orchestration sadly.

Any change to a global in the map (say monitor list or secrets) requires that then entire map be updated, and stored again
NOTE: This is under the assumption that such events are possible under Ceph, i.e changes to monitors or such

This is being addressed. The drivers can store both monitors and keyrings in the secret. When monitors or keyring change, controller only needs to update the secret that is used by PVs rather than all PVs.

The solution rather seems to be in how we abstract the various configuration elements into either the StorageClass (SC) or the CSI configuration. If the required parameters to reach the Ceph pool/cluster are part of the CSI configuration, these need not be stored in the said maps. Further, the StorageClass can now have options that helps create different rbd images with differing properties, as part of its "parameters" than information about the pool/cluster.

If there are multiple Ceph clusters, we have to create one storageclass per ceph cluster.
CSI spec doesn't limit how the volumes are resolved, through a configmap or the volume handle itself. But in kubernetes csi implementation, the backend connection info is populated as run time through external controllers.

ShyamsundarR · 2019-02-08T12:14:33Z

Had an offline discussion with @rootfs on the topic.

What it came down to was,

Remove the existing rbdVolumes map, both in memory and as a config map store

This is to keep the CSI driver content lean, and make it more a pass-thru than retain state that may go stale
Further, when volume counts increase, this map is also going to grow, stashing this as a config map may prove challenging at scale
- Even if the map contents were thinned out, at scale the number of entries would be large
Removing the map leaves us with 2 problems,
- Where to persist the VolumeID to rbd relation for future reference (i.e other operations like delete and snapshot)
- Where to persist the required credentials and other Ceph cluster related data for delete and snapshot operations

We were divided on the CSI per ceph cluster Vs single CSI with cluster information stored in a map with indexes (say) and index encoded into the volumeID for reference on related operations (IOW VoID encoded with enough information to find required cluster information from a lookup table of sorts).

The multiple CSI drivers (i.e one per Ceph cluster) needs analysis to understand if it is an anti-pattern, and how it would work
- We also need to determine if we need to have as many NodeServices as CSI Controller instances, this is one aspect to consider to remain a single instance on the NodeService (so that we are not having a NodeService per Controller, and occupying more resources on each node)
The config store for different Ceph clusters, and VolID encoding the index (say), needs further nailing down of the specifics

Irrespective of the above, the VolID will need to be persisted on Ceph, either as part of the image name or as an attribute

An attribute makes finding an rbd image given a VolID tedious(?)
Naming the rbd image based on the VolID (and other fuzz) makes it more directly addressable given a VolID for most other operations
Further thought and clarity is required to close out the specifics around the same

@rootfs add any details that I may have missed out, thanks.

rootfs · 2019-02-08T12:20:02Z

@ShyamsundarR looking forward to your PR 👍

Madhu-1 · 2019-07-16T13:41:13Z

This has been fixed in several PR's Thanks @ShyamsundarR for fixing this.

closing this one

…ck-134-to-release-4.12 [release-4.12] sync to upstream devel

ShyamsundarR mentioned this issue Feb 18, 2019

use req.Name for generating ID #205

Closed

Madhu-1 mentioned this issue Feb 19, 2019

rbd: don't delete volume/snapshot if metadata creation fails #201

Merged

ShyamsundarR mentioned this issue Feb 22, 2019

[Proposal] Maintain a stateless CSI implementation #224

Closed

Madhu-1 closed this as completed Jul 16, 2019

Madhu-1 pushed a commit to Madhu-1/ceph-csi that referenced this issue Oct 10, 2022

Merge pull request ceph#135 from openshift-cherrypick-robot/cherry-pi…

93a80ab

…ck-134-to-release-4.12 [release-4.12] sync to upstream devel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the advantage of maintaining the rbdVolumes map? #135

What is the advantage of maintaining the rbdVolumes map? #135

aibarbetta commented Jan 17, 2019

rootfs commented Jan 17, 2019

aibarbetta commented Jan 18, 2019

rootfs commented Jan 18, 2019 •

edited

Loading

aibarbetta commented Jan 18, 2019

rootfs commented Jan 19, 2019

gman0 commented Feb 1, 2019

ShyamsundarR commented Feb 6, 2019

rootfs commented Feb 7, 2019

ShyamsundarR commented Feb 8, 2019

rootfs commented Feb 8, 2019

Madhu-1 commented Jul 16, 2019

What is the advantage of maintaining the rbdVolumes map? #135

What is the advantage of maintaining the rbdVolumes map? #135

Comments

aibarbetta commented Jan 17, 2019

rootfs commented Jan 17, 2019

aibarbetta commented Jan 18, 2019

rootfs commented Jan 18, 2019 • edited Loading

aibarbetta commented Jan 18, 2019

rootfs commented Jan 19, 2019

gman0 commented Feb 1, 2019

ShyamsundarR commented Feb 6, 2019

rootfs commented Feb 7, 2019

ShyamsundarR commented Feb 8, 2019

rootfs commented Feb 8, 2019

Madhu-1 commented Jul 16, 2019

rootfs commented Jan 18, 2019 •

edited

Loading