feature: zvol as block volume without attaching to a pod #502

aep · 2024-02-08T18:24:48Z

Describe the problem/challenge you have

we're building a multi node replicated storage system on top of zfs.
ideally, we'd be able to just reuse zfs-localpv for the zfs part,
since it already works well and we'd effectively just redo all the work.

Describe the solution you'd like

the simple idea is that zfs-localpv would create a zvol with no filesystem on top,
and our custom daemonset takes over from there.

that should probably just be a few lines of code changes to zfs-localpv, which i will gladly figure out myself and open a PR
if the feature is acceptable.

or maybe its already possible and i just dont understand how to force creating the actual volume without attaching it to a pod.

can i just create a PersistentVolume directly? should i just attach all PVCs to my custom daemonset?
or is there another way to coerce zfs-localpv into directly creating the zvols?

the second, much grander idea would be to integrate the feature directly into zfs-localpv somehow,
but i'm not sure if there's much room in openebs, given that maya is probably the same product (except we want zfs underneath, not spdk).

orville-wright · 2024-02-08T20:41:32Z

Hi @aep - I run Prod Mgmt for the OpenEBS team. - You seem to have a strong opinion that you want the underlying storage block mode kernel to be ZFS and that you do not want SPDK as the block allocator/LVol/LVstor layer. - I'd like to understand why you want FS underneath?

What is it in ZFS that you like?
What is it in SPDK that you do not like?
What problem does ZFS solve for you that SPDK does not solve?

I know both technology stacks and Storage Data/Mgmt capabilities very well. - So feel free to be as technical as you need to be.

Thanks.

aep · 2024-02-08T21:15:15Z

hey @orville-wright

on a high level, spdk is fairly young and unproven, while ZFS has decades of proven stability.
We do mission critical system and ZFS is a well understood system for disaster recovery. We can restore any ZFS cluster from offsite within minutes, even after a building flooding.

Neither maya nor SPDK itself have the necessary tooling yet, like online snapshots , send/recv for offsite backup and recovery, encryption, bitrot protection. Even if they where done tomorrow, they need to be proven first. We're keeping an eye on maya in parallel. It's clearly the future. just not yet.

We currently run a classic multi-path SAS cluster with active-passive ZFS, but it needs to scale out and replaced with NVME. ceph is not a match for performance reasons.

Specifically the architecture currently planned is to have zfs on each node, then expose 3 zvols per volume over NVMEoF (we have a 400G RDMA fabric), mirroring them with mdraid or something similar on the node that is currently accessing it

orville-wright · 2024-02-09T18:57:58Z

As you probably know, SPDK is a very modern but complex storage/Data Mgmt tech stack thet is designed to run in Userspace.
It was developed by a very mature and credible NVMe storage engineering team who have world class NVMe Storage expertise. - a 60-person team at INTEL (NVMe software group). They continue to actively co-maintain & develop it.

Additionally, there are other companies that actively support fund large Engineering team to it - such as Nvidia (acquired Mellanox and inherited their NVME & RDMA SPDK storage software team), Nutanix, Microsoft, Samsung, SUSE, Oracle, RedHat and IBM. - So there's about +100 hardcore storage engineers from hardcore storage companies actively directing & developing SPDK. - That's pretty impressive IMHO.

Yes, the ZFS community is more mature thank SPDK, as ZFS has been around for longer. ZFS was first released in 2006, whereas SPDK was first released in 2013.

Re: Tooling
Mayastor has Snapshots and Clones in the current product today. We added that in August 2023. Via our SPDK integrated stack.

We are currently adding the ability to enable users to choose what type of backend 'Block Datastore' you want to managed storage from. The new options will be...

SPDK - Very High performance NVME optimized - (in the product today)
ZFS - Modern kernel native Data Mgmt with ZFS Streaming snapshots, compress, dedupe, encrypt, Zpool
LVM - Traditional kernel native Data Mgmt with Shared-Cluster-VG

LVM mode is being coded at the moment and is nearing completion. ZFS will come soon after that (in a few months).
Both ZFS mode and LVM mode will allow Local-PV and/or Replicated Mayastor vols to be provisioned throughout the Mayastor cluster. This has already been prototyped and proven to work as of today (in-house).

We're doing this to allow users like yourself to have the choice as to what Block Mgmt Back-end you want to use. (which one you are more comfortable with). Some folks like SPDK, others want ZFS and come prefer LVM.

We have around 70,000 users that have deployed our current ZFS-LocalPV stack as of today. So its been Battle tested in PROD, and well used. We're comfortable integrating that code into our Mayastor Nexus Fabric. (not a big eng task for us).

Our LVM stack has about 30,000 users that have deployed it globally as LVM-LocalPV. LVM isn't as popular as ZFS, but its older and more mature and slightly more I/O optimized in kernel scsi-layer performance (RAID & md layers). We see that very conservative users have a slight preference for LVM, along with users that cant/dont want to enable ZFS in their distro build.

Hope this helps

aep · 2024-02-09T20:13:52Z

ZFS - Modern kernel native Data Mgmt with ZFS Streaming snapshots, compress, dedupe, encrypt, Zpool

wait what, how did i miss that? thats amazing.
we would essentially rebuild half of maya anyway, so it makes MUCH more sense to contribute to maya.
how do i make this happen?

orville-wright · 2024-02-13T22:35:34Z

Well... you didn't miss it.
Our team has been internally discussing & working on it quietly for a while. We haven't been very public about it as we've been dealing with a bunch of CNCF admin & governance stuff.
** We'd love you to contribute - If your team knows the Mayastor code and knows ZFS ... then we'd love to have to participate on the ZFS project.

This is all new Mayastor functionality and comes under our new DiskPool concept...

Today a Mayastor DiskPool can only have SPDK LVols/BDev devices as its backend storage media.
We recently made the decision to extend the DiskPool to allow ZFS and LVM managed disks. We've had a PR ready for a while that focuses on extending DiskPools with LVM. For the last few weeks... our Head Architect @tiagolobocastro has been working on integrating the LVM code into Mayastor. He's made great progress. - Hopefully he can chime in and chat about his recent progress on this. @tiagolobocastro ?

We also want to integrate ZFS ZPools into Mayastor DiskPools.

When a user creates a Mayastor DiskPool, they'll basically declare it as being one of '3 possible storage backend types (SPDK, ZFS or LVM).... and we'll expose / integrate as much of that native functionality as possible via our Mayastor NVMe-TCP Fabric and Nexus.
The user will also declare whether they want the DiskPool volumes to be Replicated or Non-Replicated. As of today... all Mayastor volumes are Replicated; But, we want to offer users the Non-Replicated option (i.e. leverage our wildly successful Local-PV code natively in Mayastor which +200,000 users).

LVM is easier as it more mature and in every LINUX Kernel. The ZFS option is more complicated as we're not sure how many LINUX distro's include ZFS in the kernel by default? or can have ZFS installed by the user as Kernel mode. - (our preference is Kernel mode ZFS and not User-Mode ZFS, because user-mode ZFS is known to be slow and resource hungry). - We're still making some final decisions on this, but @tiagolobocastro 's LVM project is helping us to figure things out.

aep · 2024-02-13T23:46:54Z

All major distros have zfs packages but it will of course never be as well adopted as in-tree options.
Iirc Alpine and Ubuntu even ship it by default now.

btrfs and bcachefs are both intending to replace zfs with in tree options. We are keeping a close eye on bcachefs development.

Personally I feel like LVM offers no benefits over spkd. It's essentially the same design from an operational perspective. But I understand people may prefer it over spkd due to familiarity and existing tooling. Also might be actually more power efficient for low utilization scenarios? In our testing, the in kernel nvmeof target performed marginally better than spdk, but that might just be lack of tuning.

We have low familiarity with the Maya code base, but that's a matter of investment, which will start this summer. We're mostly doing golang, but we'll handle rust just fine I hope. Since we're already heavily invested into zfs, I'd love to become a major contributor specifically in that area. My guess is that once the LVM part is done, it's just a matter of adopting it to zfs.

In roughly 2 weeks I hope I can find some time to dig deeper into Maya. Would be great to have some pointers into the diskpool architecture. Or we can wait for it to be done. Not really much time.presure here.

orville-wright · 2024-02-14T18:40:27Z

I'll chat with @tiagolobocastro and @avishnu about the schedule for starting the ZFS work.

BTWE... Tiago pinged me last night to say that he finished the Phase-1 integration coding for LVM and its now live in Mayastor !!

Some capabilities are not fully enabled yet in the initial code, but its very usable and deployable as a new 2nd DiskPool type and Back-end... so that we can start getting feedback from users.

It gives us a good feel for how heavy the work is to enhance the DiskPool with new storage mgmt back-ends and expose the features of LVM and ZFS through Mayastor.

We will start the internal eval work on the ZFS code. There's some additional issues we have to answer surrounding ZFS... and a few key tech issues that are a somewhat deep down in stack regarding CPU + Mem + Polling/Interrupt resource mgmt for volumes that are Local-PV (no replicated) and not SPDK managed (ZFS or LVM managed). - That stuff requires very high familiarity with Maystor code and the low level architecture.

We should have some decisions & direction on ZFS within the next 2 weeks.

BTW... are you attending KubeCon Paris? - Our team is.

avishnu · 2024-10-01T11:40:29Z

Describe the problem/challenge you have

we're building a multi node replicated storage system on top of zfs. ideally, we'd be able to just reuse zfs-localpv for the zfs part, since it already works well and we'd effectively just redo all the work.

Describe the solution you'd like

the simple idea is that zfs-localpv would create a zvol with no filesystem on top, and our custom daemonset takes over from there.

that should probably just be a few lines of code changes to zfs-localpv, which i will gladly figure out myself and open a PR if the feature is acceptable.

or maybe its already possible and i just dont understand how to force creating the actual volume without attaching it to a pod.

can i just create a PersistentVolume directly? should i just attach all PVCs to my custom daemonset? or is there another way to coerce zfs-localpv into directly creating the zvols?

the second, much grander idea would be to integrate the feature directly into zfs-localpv somehow, but i'm not sure if there's much room in openebs, given that maya is probably the same product (except we want zfs underneath, not spdk).

@aep will it work if you specify volume mode as 'block' in the PVC and use the PVC from your custom daemon-set?
In the longer run, you could still contribute to ZFS support for Mayastor pools !!

aep changed the title ~~feature: zvol as block volume~~ feature: zvol as block volume without attaching to a pod Feb 8, 2024

orville-wright added the feature label Feb 13, 2024

Abhinandan-Purkait added help wanted Need help from community contributors. New Feature Request for new feature. and removed feature labels Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: zvol as block volume without attaching to a pod #502

feature: zvol as block volume without attaching to a pod #502

aep commented Feb 8, 2024 •

edited

Loading

orville-wright commented Feb 8, 2024 •

edited

Loading

aep commented Feb 8, 2024

orville-wright commented Feb 9, 2024

aep commented Feb 9, 2024

orville-wright commented Feb 13, 2024

aep commented Feb 13, 2024

orville-wright commented Feb 14, 2024

avishnu commented Oct 1, 2024

feature: zvol as block volume without attaching to a pod #502

feature: zvol as block volume without attaching to a pod #502

Comments

aep commented Feb 8, 2024 • edited Loading

orville-wright commented Feb 8, 2024 • edited Loading

aep commented Feb 8, 2024

orville-wright commented Feb 9, 2024

aep commented Feb 9, 2024

orville-wright commented Feb 13, 2024

aep commented Feb 13, 2024

orville-wright commented Feb 14, 2024

avishnu commented Oct 1, 2024

aep commented Feb 8, 2024 •

edited

Loading

orville-wright commented Feb 8, 2024 •

edited

Loading