Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the advantage of maintaining the rbdVolumes map? #135

Closed
aibarbetta opened this issue Jan 17, 2019 · 11 comments
Closed

What is the advantage of maintaining the rbdVolumes map? #135

aibarbetta opened this issue Jan 17, 2019 · 11 comments

Comments

@aibarbetta
Copy link

Hi all, I've read the conversation on #66 and the cm fix but I still don't understand the advantage of having maps like rbdVolumes and rbdSnapshots.
For example, the rbdVolumes map is used just 2 times:

  • On CreateVolume to check if the given volume name already exist, could't the same thing be done with a call to rbd info or rbd status?
  • On CreateSnapshot seems to be used to retrieve the volume name with the volume ID. I understand that according to the CSI spec volName and volID MUST be different, but couldn't we just use a volName hash as volID to avoid saving it?
@rootfs
Copy link
Member

rootfs commented Jan 17, 2019

When calling rbd info or rbd status, we have to supply mon/pool/user. But such info is not in CSI volume handle. So we have to persistent volume map, which includes meta info such as mon/pool/user. This is crucial when the compute node on which provisioners runs is down.

@aibarbetta
Copy link
Author

But both CreateVolumeRequest and CreateSnapshotRequest have a parameters map that includes that info (mon/pool/user), right? Why not use it from there?

@rootfs
Copy link
Member

rootfs commented Jan 18, 2019

yes, but these info are not in delete and stage. So csi drivers have to persist the info when volumes are created, and retrieve these info from persistent storage during delete and stage.

@aibarbetta
Copy link
Author

Ohh, I see that DeleteVolumeRequest does not have that info.
Another question then, I don't see the map getting updated between nodes on runtime (besides from the call to LoadExDataFromMetadataStore in the initialization). If a volume gets created on node A and a snapshot for that volume is requested on node B, how does B have the vol info on it's map?

@rootfs
Copy link
Member

rootfs commented Jan 19, 2019

vol info should be retrieved from configmaps if not in local cache

@gman0
Copy link
Contributor

gman0 commented Feb 1, 2019

If a volume gets created on node A and a snapshot for that volume is requested on node B, how does B have the vol info on it's map?

If I'm not mistaken, such situation shouldn't occur because both the provisioner and snapshotter are deployed within the same pod:

@ShyamsundarR
Copy link
Contributor

I had the same observation reading the code as the initial reporter, that the additional maps are actually not required and a concern.

Thanks to @JohnStrunk for help with understanding k8s plumbing in this regard, and disclaimer ahead that I am a Ceph noob!

The problem as I see, storing these config maps are,

  1. The maps can go out of sync with reality, as it really is a duplicate config being maintained, one at Ceph and the other in the config maps
  • Say, a rbd instance was created, but the config map update failed, there is a chance that this instance would be leaked at the backend
  • This requires some form of reconciliation with the actual state as known by Ceph, under such situations
  1. Any change to a global in the map (say monitor list or secrets) requires that then entire map be updated, and stored again
    NOTE: This is under the assumption that such events are possible under Ceph, i.e changes to monitors or such

The solution rather seems to be in how we abstract the various configuration elements into either the StorageClass (SC) or the CSI configuration. If the required parameters to reach the Ceph pool/cluster are part of the CSI configuration, these need not be stored in the said maps. Further, the StorageClass can now have options that helps create different rbd images with differing properties, as part of its "parameters" than information about the pool/cluster.

The Node Service anyway receives required information to mount and such from the RPC payload itself, hence this does not change the same.

This leads to a few things,
a) A single CSI per Ceph pool/cluster (for now let's keep FS and RBD separate in the discussion)
b) Hence its provisioner name would distinguish multiple such CSI provisioners for each Ceph pool/cluster in the deployment
c) ControllerService should have all the information, for operations like Delete and Snap, as part of the CSI configuration and the volumeID (and if needed the name) should suffice to operate on the same
d) There can be multiple StorageClasses using the same CSI instance, but with differing parameters for rbd creation. This also makes it easier to manage these instances, as the Ceph pool configuration, if changed, need not be reflected in every SC that refers to the same.

If the abstraction is changed as above, it seems better aligned for the overall purpose and avoids the maps. Also, it makes the CSI talk to a Ceph cluster based on the configuration, than CSI being the funnel to talk to any Ceph cluster. Further, it makes the CSI driver nearly stateless, it just acts as a pass through based on the configuration and hence no reconciliation if ever needed (other than the core cluster parameters).

Thoughts?

@rootfs
Copy link
Member

rootfs commented Feb 7, 2019

  1. The maps can go out of sync with reality, as it really is a duplicate config being maintained, one at Ceph and the other in the config maps

Agree. Split brain is a constant theme in storage orchestration sadly.

  1. Any change to a global in the map (say monitor list or secrets) requires that then entire map be updated, and stored again
    NOTE: This is under the assumption that such events are possible under Ceph, i.e changes to monitors or such

This is being addressed. The drivers can store both monitors and keyrings in the secret. When monitors or keyring change, controller only needs to update the secret that is used by PVs rather than all PVs.

The solution rather seems to be in how we abstract the various configuration elements into either the StorageClass (SC) or the CSI configuration. If the required parameters to reach the Ceph pool/cluster are part of the CSI configuration, these need not be stored in the said maps. Further, the StorageClass can now have options that helps create different rbd images with differing properties, as part of its "parameters" than information about the pool/cluster.

If there are multiple Ceph clusters, we have to create one storageclass per ceph cluster.
CSI spec doesn't limit how the volumes are resolved, through a configmap or the volume handle itself. But in kubernetes csi implementation, the backend connection info is populated as run time through external controllers.

@ShyamsundarR
Copy link
Contributor

Had an offline discussion with @rootfs on the topic.

What it came down to was,

  1. Remove the existing rbdVolumes map, both in memory and as a config map store
  • This is to keep the CSI driver content lean, and make it more a pass-thru than retain state that may go stale
  • Further, when volume counts increase, this map is also going to grow, stashing this as a config map may prove challenging at scale
    • Even if the map contents were thinned out, at scale the number of entries would be large
  • Removing the map leaves us with 2 problems,
    • Where to persist the VolumeID to rbd relation for future reference (i.e other operations like delete and snapshot)
    • Where to persist the required credentials and other Ceph cluster related data for delete and snapshot operations
  1. We were divided on the CSI per ceph cluster Vs single CSI with cluster information stored in a map with indexes (say) and index encoded into the volumeID for reference on related operations (IOW VoID encoded with enough information to find required cluster information from a lookup table of sorts).
  • The multiple CSI drivers (i.e one per Ceph cluster) needs analysis to understand if it is an anti-pattern, and how it would work
    • We also need to determine if we need to have as many NodeServices as CSI Controller instances, this is one aspect to consider to remain a single instance on the NodeService (so that we are not having a NodeService per Controller, and occupying more resources on each node)
  • The config store for different Ceph clusters, and VolID encoding the index (say), needs further nailing down of the specifics
  1. Irrespective of the above, the VolID will need to be persisted on Ceph, either as part of the image name or as an attribute
  • An attribute makes finding an rbd image given a VolID tedious(?)
  • Naming the rbd image based on the VolID (and other fuzz) makes it more directly addressable given a VolID for most other operations
  • Further thought and clarity is required to close out the specifics around the same

@rootfs add any details that I may have missed out, thanks.

@rootfs
Copy link
Member

rootfs commented Feb 8, 2019

@ShyamsundarR looking forward to your PR 👍

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 16, 2019

This has been fixed in several PR's Thanks @ShyamsundarR for fixing this.

closing this one

@Madhu-1 Madhu-1 closed this as completed Jul 16, 2019
Madhu-1 pushed a commit to Madhu-1/ceph-csi that referenced this issue Oct 10, 2022
…ck-134-to-release-4.12

[release-4.12] sync to upstream devel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants