OCPEDGE-1147: feat!: init kubesan csi driver integration proposal #1620

jakobmoellerdev · 2024-05-02T16:17:54Z

KubeSAN is a CSI plugin for Kubernetes that enables you to provision Block volumes
backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN).

Logical Volume Manager Storage (LVMS) uses the TopoLVM CSI driver to dynamically provision local storage on the OpenShift Container Platform clusters.

This proposal is about integrating the Subprovisioner CSI driver into the LVMS operator to enable the provisioning of
shared block devices on the OpenShift Container Platform clusters.

This enhancement will significantly increase scope of LVMS, but allows LVMS to gain the unique value proposition
of serving as a valid layered operator that offers LUN synchronization and provisioning capabilities.

openshift-ci · 2024-05-02T16:17:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ebblake · 2024-05-02T19:55:41Z

enhancements/local-storage/subprovisioner-integration-into-lvms.md

+[Subprovisioner](https://gitlab.com/subprovisioner/subprovisioner) 
+is a CSI plugin for Kubernetes that enables you to provision Block volumes 
+backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN).
+


Is it worth calling out that subprovisioner actually takes a shared Volume Group, and that it is therefore up to the cluster admin whether to create that VG on top of a single LUN or alternatively to create the VG by concatenating several smaller LUNs each serving as an lvm PV?

For now, subprovisioner does expect that the entire VG is visible from each node; I don't know if lvm supports a degraded-mode shared VG where only portions of the VG are usable from a given node provided that the shared global lockspace is still visible to all nodes, but if we ever need to get into that level of granularity, the complexity multiplies. So sticking with the simpler "big LUN on a SAN as a single VG" is good enough.

(Alas - the confusion between lvm Physical Volume and kubernetes Persistent Volume will make it harder to use the term PV without qualification)

Is it worth calling out that subprovisioner actually takes a shared Volume Group, and that it is therefore up to the cluster admin whether to create that VG on top of a single LUN or alternatively to create the VG by concatenating several smaller LUNs each serving as an lvm PV?

This abstraction is something we should take away from the admin. We will likely have a 90% case (e.g. one big LUN but I dont know) that we should default to and allow a reconfiguration to smaller LUNs manually. Thats how most OpenShift APIs are designed.

This concurs with

So sticking with the simpler "big LUN on a SAN as a single VG" is good enough.

(Alas - the confusion between lvm Physical Volume and kubernetes Persistent Volume will make it harder to use the term PV without qualification)

for LVMS users, a PV is always a PersistentVolume (in our docs), whereas lvm2 physical volumes always get written out. This Enhancement is kind of hacked so if you spot any consistencies please point them out for correction!

ebblake · 2024-05-02T19:58:52Z

enhancements/local-storage/subprovisioner-integration-into-lvms.md

+- Create an enhancement to the "LVMCluster" CRD that is able to differentiate a deviceClass into a new
+  type of shared storage that can be provisioned side-by-side or in alternative to regular LVMS device-classes managed by TopoLVM.
+- Create a productization for a LUN-backed CSI driver alternative to TopoLVM that allows for shared vg usage, especially in the context of virtualization.
+


Hmm. While subprovisioner itself wants a VG, I could see it being useful at the LVMS layer to want a bare block device visible to all nodes in the cluster, which LVMS will then format into the shared VG that subprovisioner wants. That is, while subprovisioner doesn't want to create the VG, LVMS might be the right place to set it up so that the cluster admin doesn't have to do any work for the VG either because LVMS does it under the hood.

LVMS itself consists of 3 layers

The Operator handling the LVMCluster and deploying high-level resources

The vgmanager, operating on each node and preparing the nodes for the CSI driver where necessary

The CSI driver, which takes a configuration prepared by operator+vgmanager and reacts to CSI calls made

For this setup, subprovisioner would need to be able to take care of formatting calls, because Filesystem Formatting on Mount is handled in the CSI driver and the operator or vgmanager cannot know when formatting / wiping must be run.

enhancements/local-storage/subprovisioner-integration-into-lvms.md

ebblake · 2024-05-02T20:08:43Z

enhancements/local-storage/subprovisioner-integration-into-lvms.md

+
+  type ThinPoolConfig struct {
+    // Name of the thin pool to be created. Will only be used for node-local storage, 
+    // since shared volume groups will create a thin pool with the same name as the volume group.


Subprovisioner creates its own thin pools within the shared volume group, but it is more along the lines of one thin pool per PVC rather than one for the entire volume group. Is this comment trying to give too much internal detail about what subprovisioner will do, in relation to the higher-level notion that a client of LVMS merely needs to choose between local (existing TopoLVM) and shared (new subprovisioner) implementations?

I think this would eventually also result in #1620 (comment)

enhancements/local-storage/subprovisioner-integration-into-lvms.md

stefanha · 2024-05-07T20:22:00Z

enhancements/local-storage/subprovisioner-integration-into-lvms.md

+
+    // NodeSelector chooses nodes on which to create the deviceclass
+    // +optional
+    NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"`


Will node selection work properly with the Subprovisioner CSI plugin? My understanding is that TopoLVM has implemented support to make pod scheduling work when volumes are available only on specific nodes. Subprovisioner has no scheduler awareness and just runs a CSI Node Plugin DaemonSet on all nodes, expecting any node to be able to run pods that access any volume.

I see three possibilities:

NodeSelector already works with Subprovisioner, somehow. There's no work to do.

NodeSelector doesn't work with Subprovisioner and must be nil for now.

NodeSelector doesn't work with Subprovisioner yet but we figure out how to make Subprovisioner support it, so this syntax is allowed with Subprovisioner.

The way to implement this is to implement a Node accessibility constraint in the CSI driver and to start supporting node topologies via CSI and at the same time use the node selector to only deploy on the nodes we want.

It's possible that users would want the ability to use a subset of nodes, so I guess Subprovisioner should implement this.

enhancements/local-storage/subprovisioner-integration-into-lvms.md

stefanha · 2024-05-27T16:39:02Z

Looks good to me.

enhancements/local-storage/subprovisioner-integration-into-lvms.md

openshift-ci-robot · 2024-06-03T07:35:42Z

@jakobmoellerdev: This pull request references OCPEDGE-1107 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.17.0" version, but no target version was set.

In response to this:

Subprovisioner is a CSI plugin for Kubernetes that enables you to provision Block volumes
backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN).

Logical Volume Manager Storage (LVMS) uses the TopoLVM CSI driver to dynamically provision local storage on the OpenShift Container Platform clusters.

This proposal is about integrating the Subprovisioner CSI driver into the LVMS operator to enable the provisioning of
shared block devices on the OpenShift Container Platform clusters.

This enhancement will significantly increase scope of LVMS, but allows LVMS to gain the unique value proposition
of serving as a valid layered operator that offers LUN synchronization and provisioning capabilities.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-18T13:23:14Z

@jakobmoellerdev: This pull request references OCPEDGE-1147 which is a valid jira issue.

In response to this:

Subprovisioner is a CSI plugin for Kubernetes that enables you to provision Block volumes
backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN).

Logical Volume Manager Storage (LVMS) uses the TopoLVM CSI driver to dynamically provision local storage on the OpenShift Container Platform clusters.

This proposal is about integrating the Subprovisioner CSI driver into the LVMS operator to enable the provisioning of
shared block devices on the OpenShift Container Platform clusters.

This enhancement will significantly increase scope of LVMS, but allows LVMS to gain the unique value proposition
of serving as a valid layered operator that offers LUN synchronization and provisioning capabilities.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-20T13:58:34Z

@jakobmoellerdev: This pull request references OCPEDGE-1147 which is a valid jira issue.

In response to this:

KubeSAN is a CSI plugin for Kubernetes that enables you to provision Block volumes
backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN).

Logical Volume Manager Storage (LVMS) uses the TopoLVM CSI driver to dynamically provision local storage on the OpenShift Container Platform clusters.

This proposal is about integrating the Subprovisioner CSI driver into the LVMS operator to enable the provisioning of
shared block devices on the OpenShift Container Platform clusters.

This enhancement will significantly increase scope of LVMS, but allows LVMS to gain the unique value proposition
of serving as a valid layered operator that offers LUN synchronization and provisioning capabilities.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

suleymanakbas91 · 2024-06-25T14:37:36Z

/lgtm

enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md

jerpeter1 · 2024-07-10T19:17:00Z

enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md

+
+- **Downgrade**:
+  - Allow safe downgrades by maintaining backward compatibility. Downgrading from a kubesan enabled version to a purely topolvm enabled version should be a no-break operation for the topolvm part. For the kubesan part, the operator should ensure that the shared VGs can be cleaned up manually
+  - Provide rollback mechanisms and detailed instructions to revert to previous versions. Ensure that downgrades do not result in data loss or service interruptions.


It would be nice if we had a more detailed description of how data loss during rollbacks will be avoided (this is just stating that we will do so). I don't think we necessarily need to have a complete design in this EP, but I think that having a complete story for rollback and DR should be part of the graduation criteria for GA.

I added a rollback routine along with both automated as well as manual expected rollback procedures and impacts on data in the design details section.

jerpeter1 · 2024-07-10T19:18:58Z

enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md

+
+#### Hypershift / Hosted Control Planes
+
+N/A


I think that this technology might be useful for Hypershift. Aside from the additional complexity involved, is there any reason that it's excluded?

I dont think I follow. Hypershift as of now uses LVMS for its etcd, however the only way a shared VG would be useful would be if Hypershift would be deployed on a cluster where etcd would be stored on a LUN, which is kind of counter intuitive to having a distributed database in the first place.

Could you elaborate what you think would be useful for Hypershift?

If youre talking about having local storage access on all virtualized nodes, this is true for any ReadWriteMany CSI drivers, however it doesnt really impact the design of Hypershift in any way.

where etcd would be stored on a LUN

Off the top of my head, a SAN might have performance, availability, or cost characteristics which make it more attractive than local storage. Local storage is probably going to win on performance if cost is no object, but a customer already has a high performance SAN, I can see that being an option they might be interested in considering if we made it available.

At that point in time, Hypershift becomes yet another cluster for us, so I dont think this will change the design. Of course we can notify them that this is coming but I dont think they need special consideration

I guess to narrow this down: What would you expect me to add to this section? Possible benefits for usage in Hypershift?

I asked the hypershift team for their opinion on slack, but I haven't received a response. I don't think this enhancement should be blocked on that though.

jakobmoellerdev · 2024-07-11T07:26:51Z

/test markdownlint

suleymanakbas91 · 2024-07-16T12:36:20Z

/lgtm

jerpeter1 · 2024-07-16T15:14:31Z

/approve

openshift-ci · 2024-07-16T15:14:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerpeter1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jerpeter1]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-07-16T15:31:39Z

@jakobmoellerdev: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2024

ebblake reviewed May 2, 2024

View reviewed changes

stefanha reviewed May 7, 2024

View reviewed changes