Measure dm-integrity IO performance using Fio.
This research is a personal project for summer vacation 2023. π
No one wants to lose their data. However, gigantic cloud providers ruthlessly close your account one day and do not respond to your complaint. Or, like many cloud storage services that were popular for a while, the next thing you know, that service has already shut down and gone with the data. So, everyone should have their own storage to prepare for the coming day.
Linux offers a variety of options to meet our needs. We can build a RAID for redundancy and make consistent backups from LVM snapshots. Our and our families' data is safe. Is it true? dm-raid, the software RAID implemented in Linux kernel, can detect bad sectors and repair them, but cannot correct bad data in good sectors occured by some reason. If left as is, the correct backup could be overwritten with bad data. Just thinking of it is scary.
This phenomenon is called silent data corruption (as known as bit rot). It is caused by incomplete insulation, high temperature environment, cosmic radiation impact, etc. Although the probability is low, it is a non-negligible cause of failure for us who wants to maintain data reliably over the long term.
dm-integrity is a Linux device mapper that provides per-block integrity checking. When dm-integrity detects bad data, it is detected by dm-raid as a bad sector. This allows dm-raid to repair bad data by rebuilding the disk array. In addition, dm-integrity provides tamper detection when combined with dm-crypt. However, this feature is still experimental. In any case, you can choose any top-level file system you like. Many people may choose ext4, but it can be anything else.
Some filesystems have data correction capability (called scrubbing) by themselves. On Linux, btrfs and OpenZFS (the Linux port of ZFS) are the most famous. They have RAID and snapshot capabilities and are expected to be simpler to configure and faster than a combination of device mappers. However, btrfs has long been said to have problems about stability and performance. On the other hand, ZFS will not be merged into the Linux kernel due to licensing issues.
The long-term storage we are aiming for has the following characteristics:
- It is the source storage not the backup storage (to not backup corruput data)
- It reads more frequent than writes
- Its amount of data increases endlessly
We assume several storage configurations that can be easily deployed at home and measures their performance (especially throughput). Other non-functional requirements are out of scope. Because, what I could not find was the comparison on performance. That is why I started this study.
Of course, each configuration should not be evaluated based solely on throughput. What matters most for long-term storage is reliability and maintainability rather than throughput. For example, the maturity of each component can be estimated from the initial release. The table below shows that ZFS is 18 years old since its first release, while dm-integrity is only 6 years old. It goes without saying which is more "stable".
Component | Initial release |
---|---|
LVM2 | 2002 (merged into Linux 2.5.45) |
dm-crypt | 2004 (merged into Linux 2.6.4) |
ZFS | 2005 (with OpenSolaris) |
dm-raid | 2006 (merged into Linux 2.6.18) |
ext4 | 2008 (merged into Linux 2.6.28) |
btrfs | 2009 (merged into Linux 2.6.29) |
OpenZFS | 2013 (first stable on Linux) |
dm-integrity | 2017 (merged into Linux 4.12) |
See other studies for non-functional requirements other than performance. The following articles will be very helpful for reliability.
In this study, I use Lenovo ThinkStation P500 workstation. It was a bit old, but a great bargain that sold for Β₯30,000 in 2020. Additional storages are much more expensive.
Component | Description | Test drive |
---|---|---|
CPU | 4 Core 8 Threads (Intel Xeon E5-1620 v3) | - |
RAM | 48 GB DDR4 ECC RDIMM | β
(/dev/ram0 ) |
SATA SSD #1 | 500 GB SATA SSD (Trancend SSD370 TS512GSSD370S) | β |
SATA HDD #1 | 12 TB SATA HDD (WD Red Plus WD120EFBX) | β
(/dev/sdb ) |
SATA HDD #2 | 12 TB SATA HDD (WD Red Plus WD120EFBX) | β
(/dev/sdc ) |
SATA HDD #3 | 12 TB SATA HDD (WD Red Plus WD120EFBX) | β |
SATA HDD #4 | 12 TB SATA HDD (WD Red Plus WD120EFBX) | β |
NVMe SSD #1 | 1 TB NVMe SSD (WD Red SN700) | β
(/dev/nvme0n1 ) |
OS | Linux 6.1.0 (Debian 12.1 Bookworm) | - |
Install additional packages.
$ sudo apt install nvme-cli zfs-dkms zfsutils-linux
...
The versions of the package are as follows.
$ apt list --installed 2>/dev/null | grep -E '^(btrfs-progs|cryptsetup-bin|lvm2|mdadm|zfs*)/'
btrfs-progs/stable,now 6.2-1 amd64 [installed,automatic]
cryptsetup-bin/stable,now 2:2.6.1-4~deb12u1 amd64 [installed,automatic]
lvm2/stable,now 2.03.16-2 amd64 [installed,automatic]
mdadm/stable,now 4.2-5 amd64 [installed,automatic]
zfs-dkms/stable,now 2.1.11-1 all [installed]
zfs-zed/stable,now 2.1.11-1 amd64 [installed,automatic]
zfsutils-linux/stable,now 2.1.11-1 amd64 [installed]
Disable write cache for SATA HDD #1~2.
$ sudo hdparm -W0 /dev/{sdb,sdc}
/dev/sdb:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
/dev/sdc:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
Disable write cache for NVMe SSD #1.
$ sudo nvme get-feature /dev/nvme0 -f 6
get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
$ sudo nvme set-feature /dev/nvme0 -f 6 -v 0
set-feature:0x06 (Volatile Write Cache), value:00000000, cdw12:00000000, save:0
$ sudo nvme get-feature /dev/nvme0 -f 6
get-feature:0x06 (Volatile Write Cache), Current value:00000000
Change the sector size of the NVMe SSD #1 to 4096 bytes.
$ sudo nvme id-ns -H /dev/nvme0n1 | grep LBA
[6:5] : 0 Most significant 2 bits of Current LBA Format Selected
[3:0] : 0 Least significant 4 bits of Current LBA Format Selected
[0:0] : 0 Metadata as Part of Extended Data LBA Not Supported
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better
$ sudo nvme format --lbaf=1 /dev/nvme0n1
...
sudo nvme id-ns -H /dev/nvme0n1 | grep LBA
[6:5] : 0 Most significant 2 bits of Current LBA Format Selected
[3:0] : 0x1 Least significant 4 bits of Current LBA Format Selected
[0:0] : 0 Metadata as Part of Extended Data LBA Not Supported
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)
Set IO scheduler to none
for SATA HDD #1~2.
$ cat /sys/block/{sdb,sdc,nvme0n1,ram0}/queue/scheduler
none [mq-deadline]
none [mq-deadline]
[none] mq-deadline
none
$ echo none | sudo tee /sys/block/{sdb,sdc}/queue/scheduler
none
$ cat /sys/block/{sdb,sdc,nvme0n1,ram0}/queue/scheduler
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
none
For saving time, create small partitions on SATA HDD #1~2 and NVMe SSD #1. Also create a RAM drive.
$ sudo fdisk /dev/sdb
...
$ sudo fdisk /dev/sdc
...
$ sudo fdisk /dev/nvme0n1
...
$ sudo modprobe brd rd_nr=1 rd_size=8388608
...
All drives are ready.
$ sudo fdisk -l /dev/{sdb,sdc,nvme0n1,ram0}
Disk /dev/sdb: 10.91 TiB, 12000138625024 bytes, 23437770752 sectors
Disk model: WDC WD120EFBX-68
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: E0E7AD31-257B-644E-9741-F36E908EE3F5
Device Start End Sectors Size Type
/dev/sdb1 2048 16779263 16777216 8G Linux filesystem
Disk /dev/sdc: 10.91 TiB, 12000138625024 bytes, 23437770752 sectors
Disk model: WDC WD120EFBX-68
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: E291F514-A310-0649-A398-42A8DF431E0E
Device Start End Sectors Size Type
/dev/sdc1 2048 16779263 16777216 8G Linux filesystem
Disk /dev/nvme0n1: 931.51 GiB, 1000204886016 bytes, 244190646 sectors
Disk model: WD Red SN700 1000GB
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 66216399-9372-9A4C-B17D-5173CAE18E07
Device Start End Sectors Size Type
/dev/nvme0n1p1 256 2097407 2097152 8G Linux filesystem
Disk /dev/ram0: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
The following are candidate configurations suitable for long-term storage combined filesystems widely used in Linux.
# | Configuration | Encryption | Redundancy | Scrubbing | Snapshot |
---|---|---|---|---|---|
1 | ext4 | β | β | β | β |
2 | ext4 on dm-integrity | β | β | β | β |
3 | ext4 on dm-integrity (no journal) | β | β | β | β |
4 | ext4 on dm-integrity (bitmap mode) | β | β | β | β |
5 | ext4 on dm-crypt | β | β | β | β |
6 | ext4 on dm-crypt (with dm-integrity, HMAC-SHA256) | β | β | β | β |
7 | ext4 on dm-crypt on dm-integrity | β | β | β | β |
8 | ext4 on dm-raid (RAID 1) | β | β | β | |
9 | ext4 on dm-raid (RAID 1) on dm-integrity | β | β | β | β |
10 | ext4 on LVM | β | β | β | β |
11 | ext4 on LVM on dm-raid (RAID 1) | β | β | β | |
12 | ext4 on LVM on dm-raid (RAID 1) on dm-integrity | β | β | β | β |
13 | ext4 on LVM on dm-raid (RAID 1) on dm-integrity on dm-crypt | β | β | β | β |
14 | btrfs | β | β | β | β |
15 | btrfs (RAID 1) | β | β | β | β |
16 | btrfs (RAID 1) on dm-crypt | β | β | β | β |
17 | ZFS | β | β | β | β |
18 | ZFS (RAID 1) | β | β | β | β |
19 | ZFS (RAID 1) on dm-crypt | β | β | β | β |
ext4 is the most common filesystem in Linux. In this study, ext4 will be used as a benchmark against other configurations. The above list states that ext4 does not have encryption capability, but in fact, ext4 has per-file encryption capability using fscrypt. However, it is out of scope due to differences in granularity.
dm-integrity has three methods to guarantee write integrity (journal, no journal, and bitmap mode). By default, journal is enabled, which is the most reliable but slower. No-journal is not crash-tolerant, and bitmap mode is less reliable than journal. Since reliability is the primary concern in this study, the journal is used in combination with other device mappers.
The default sector size is 512 bytes, regardless of the drive. Change to 4096 bytes to match the drive.
The default journal commit interval is 10 seconds. Change to 5 seconds to match ext4.
dm-crypt performance related parameters are tuned based on the Cloudflare blog post. The default encryption algorithm is aes-xts-plain64
and hardware acceleration would be available on many CPUs.
There is a special combination of dm-crypt and dm-integrity (#6). It is capable of both encryption and tamper detection using HMAC. However, the hash function used in HMAC (usually SHA-256) is not hardware-accelerated and strongly depends on CPU performance. Furthermore, TRIM on SSDs is not available for security reason. This configuration is considered experimental and is only for reference in this study.
dm-raid can be built on top of dm-integrity as described in the introduction to detect and correct silent data corruption. In this study, the RAID level is RAID 1 because we have two HDDs. Since there is only one SSD and one RAM drive, RAID performance is not measured.
LVM can be combined with dm-raid and dm-integrity to have capabilities similar to btrfs and ZFS described below. LVM has also lvraid
command to build RAID on LV, but RAID configuration is out of scope in this study.
The default commit interval is 30 seconds. This is longer than other filesystems (5 seconds for both ext4 and ZFS). To be fair, change to 5 seconds.
ZFS actively uses RAM as a cache. I measured the throughput beforehand, and it was as fast as a RAM disk even with HDD. However, this is not a realistic scenario for long-term storage. Therefore, only metadata is cached in the test configuration.
Measure the following throughput performance for each drive and configuration using fio. Parameters are based on CrystalDiskMark's "Peak Performance" profile.
Test | Read/Write | Block Size | Queue Size | Threads |
---|---|---|---|---|
seq-1m-q8-t1-read |
Sequential Read | 1 MiB | 8 | 1 |
seq-1m-q8-t1-write |
Sequential Write | 1 MiB | 8 | 1 |
rnd-4k-q32-t16-read |
Random Read | 4 KiB | 32 | 16 |
rnd-4k-q32-t16-write |
Random Write | 4 KiB | 32 | 16 |
See tools/test.fio
for details.
Install additional packages.
$ sudo apt install fio make
Run all tests.
$ export HDD1_DEV=/dev/sdb1 # WARNING: Change to your environment.
$ export HDD2_DEV=/dev/sdc1 # WARNING: Change to your environment.
$ export SSD1_DEV=/dev/nvme0n1p1 # WARNING: Change to your environment.
$ export RAM1_DEV=/dev/ram0 # WARNING: Change to your environment.
$ make test
...
Nothing special. All I do is plot the throughput per test result on a bar chart using R.
Install additional packages.
$ sudo apt install jq r-base r-cran-dplyr r-cran-ggplot2 r-cran-gridextra
Generate CSV and SVG from JSON of test results.
$ make report
...
For details, see out.sample/
directory.
HDDs are still the major storage device. Let's take a quick look.
- Combined with dm-integrity, sequential writes are 60% slower. When using no journal or bitmap mode, sequential writes are 10% slower.
- dm-crypt does not affect performance.
- Combined with RAID 1, random reads are 100% faster.
- LVM does not affect performance.
- btrfs is as good as ext4 (at least on HDDs).
- ZFS is generally slower. Especially for random access, it is 40~60% slower than ext4.
The slowdown in dm-integrity due to journaling is to be expected.
Contrary to general reputation, ZFS is slow. One possible reason for this may be that the cache is limited to metadata in this test.
Fewer results were obtained because RAID was not tested on the SSD and RAM drives.
In contrast to HDDs, there are large performance differences between configurations. Note some configurations with few differences between SSDs and RAM. Despite the large performance difference between SSD and RAM, the fact that it is not proportional to the bandwidth difference indicates that the bottleneck has shifted from IO to CPU.
- dm-integrity has peaked.
- dm-crypt's performance has also peaked, but it is even slower than dm-integrity.
- btrfs is slower than ext4.
- ZFS is even slower than btrfs.
Fast IO requires fast CPU. Resources at home are very limited, so there will be a strong tradeoff.
Let's compare the differences between the dm-integrity options. For practical comparison, RAM drives are excluded.
- The performance difference between the options is seen only in sequential writes.
- Sequential writes are 100% faster when using no journal or bitmap mode.
The performance of dm-integrity is practical enough. Writes are not critical for our long-term storage, as reads are more important. It is not worth sacrificing reliability.
Let's compare filesystems with data correction capabilities. Again, RAM drives are excluded.
- ext4 on dm-integrity is more than 100% slower than ext4 on sequential writes, but otherwise close.
- btrfs is about the same as ext4 except for random writes on SSDs.
- ZFS is slower than all the others. The improvement on SSDs is not worth the IO performance.
ext4 on dm-integrity and btrfs are good for our long term storage. On HDDs, btrfs is the best. Random writes are slow on SSDs, but could be improved by increasing the commit interval ( we reduced it from the default of 30 seconds to 5 seconds in this test).
ZFS should be used for other workloads with high cache hit ratio.
We will call a configuration that has all the features (encryption, redundancy, scrubbing, snapshots) required for our long-term storage "full-featured". The following three fall into this.
# | Configuration |
---|---|
13 | ext4 on LVM on dm-raid (RAID 1) on dm-integrity on dm-crypt |
16 | btrfs (RAID 1) on dm-crypt |
17 | ZFS (RAID1 ) on dm-crypt |
As you can see, btrfs is the best.
In recent years, Internet connection speeds in excess of 1 Gbps have become common. In the near future, cloud storage services that are faster than home storage may appear. But for now, home storage is still faster and cheaper. And it's yours forever.
Performance is only one of the metrics to evaluate, but it is the metric you care about most during normal operations. Hopefully this study will be useful for someone building better long term storage.