Using uZFS for storing cStor Volume Data

Overview

OpenEBS helps with orchestrating several data engines that are implemented using the Container Attached Storage(CAS) architectural pattern for delivering data services like replication, snapshot, backup, restore, encryption and so forth to the Kubernetes Stateful workloads. The CAS storage is in itself a Kubernetes native solution and is delivered and deployed as containers.

In this document, we focus on Replica aspects of the cStor Data Engine. For details on how to use cStor Data Engine refer to the documentation.

cStor Data Engine comprises of two core layers called:

A Target that receives a Block IO, which replicates the IO to one or more Replicas running on different nodes over network.
A Replica that receives the IO from Target over the network and persists on to local storage.

A Target is typically associated with one or more Replicas. Target maintains two connections to each of its Replicas - one for management and another for data. The IO are sent over the data connection while the management connection is used for registration commands, snapshot/resize related operations and so forth.

The Target adds metadata to each IO that helps in recovering from node failures (where a replica may be unavailable for a time) or to add new replicas and help them to reconstruct the data from existing replicas. It uses a modified RAFT algorithm to maintain the quorum and help termine the data returned to the client is in fact the last written data.

The Replica API are exposed via the libcstor library that provides the required primitives for registering with a given target and implements the data services API. cStor is an implementation of libstor with iSCSI as the Target and uZFS as the Replica backends.

uZFS project was started with the intent of using the advanced file system capabilities available in ZFS to be used by cStor Data Engine.

ZFS is designed to run as a kernel module and run locally in a single node. cStor Data engine makes it possible to run ZFS in user space and use a collection of such ZFS instances running on multiple nodes to provide a replicated storage resilient against node failures. uZFS refers to the modifications done to the ZFS to enable running without kernel dependencies. This document presents some of the design aspects behind the implementation of uZFS.

libCStor Replica API

The Replica supports the following API:

Read/Write IO
Create or Delete Snapshot
Sync/Flush
Resize the Block Volume Size
Rebuild from peer Replica
Replica/Target connection management along with setting the correct state for the replica.

The Replica supports the following states:

Init
Degraded
Rebuilding
Healthy

uZFS as cStor Replica

ZFS is a battle tested software RAID filesystem that is by far the most resilient against many underlying device failures and most efficient in terms of supporting snapshots and clones using its COW support. ZFS support both filesystem and block targets that are exposed via the kernel. The block targets are also called as ZVOLs.

uZFS is the modified version ZFS, along with the changes required to make uZFS as the cStor Replica. uZFS comprises of a zrepl binary that is modified ZFS with implementation of cStor Replica API and the associated management binaries like zfs and zpool. All the artifacts of uZFS - namely zrepl, zpool and zfs are all packaged into a container image called openebs/cstor-pool.

uZFS Implementation

The following changes have been made to the forked ZFS repository in uZFS:

Updated libzpool to include the zvol related capabilities, so that zvol related functionality can be invoked via user space.
Modify the zpool and zfs management API like create/delete/update pool or zvols to accept the commands via API / unix domain sockets instead of IOCTL. Similarly update the zpool and zfs libraries and binaries to invoke the API / unix domain sockets for performing management operations.
Added wrapper APIs around the ZFS DMU layer for handling read/write of Block IO with cstor metadata.
New object has been added to ZVOL to store IO number per block. ZIL support has been added to store IO number along with data in the ZIL, and performs ZIL replay. This introduced on-disk format change for ZIL data.
Implementation of the libcstor Replica API that in turn calls the ZFS API. The changes include:
- IO operations received at Replica(libcstor) layer over network will be mapped to the cStor DMU wrapper API mentioned above.
- Management API related to resize or snapshot will be mapped to the corresponding ZFS API
- Store the metadata information into the ZFS attributes as received from the target or management layer like the target IP Address etc.,
Added ability to perform build IO using AIO by introducing a new type of vdev IO layer
Implement the Rebuilding of Replica data by reading from peer Replicas. The IO block can be reconstructed by looking at the IO numbers from the available replicas. This also involves making sure the snapshots are created/deleted or the replica is resized if such operations were performed when the replica was down.
Enhanced the zpool and zfs stats commands to include details like replica status, management and connection status with the target and so forth.
Modify the zfs send/recv commands to work with the modified uZFS

uZFS dependents or usage

uZFS is used by the OpenEBS cStor volumes as explained here. The container is mainly used as follows:

A cStor Pool Instance Pod is created on a node by providing a set of block devices on which it can store the ZFS data - pools and zvols.
Management sidecars (cstor-pool-mgmt and m-exporter) that are built on top of the binaries generated in this repository for:
- CRUD operations on pools and volume - via zpool and zfs binary that interface with uZFS via unix domain sockets
- Extracting metrics and status and exporting as prometheus metrics

uZFS CI and CD

In addition, this repository also contains changes related to:

Docker images for production and test builds: Two kinds of docker images are generated - one for production usage and other for testing usage.
Automated test suites: Tests has been added to cover use cases, scenarios including error cases, CLI, protocol related ones. Automated tests are built using scripts, uzfs_test binary similar to ztest, gtest framework and mocked client/server. Tests has been at multiple levels, i.e.,
- At API level to perform IOs, calls functions with dummy data
- At protocol level by having mocked client that sends commands over network
- CLI tests
- Checking data integrity by writing the data
- Error injection based test cases
Continuous Integration with Travis: Every PR raised goes through all the above automated tests in Travis.
Continuous Integration with K8S environment: Any merged PRs triggers pipeline to verify the production images in Kubernetes environment with applications being deployed using cStor volumes.

Provide feedback

Saved searches