Highly inefficient use of space observed when using raidz2 with ashift=12 #548

ryao · 2012-02-02T03:21:17Z

I am using the latest GIT code, e29be02, on a VMWare Player VM. I booted the VM using the Ubuntu Linux 11.10 LiveCD, with Linux 3.0. The VM contains 6x1GB disks in a raidz2 pool with ashift=12. The pool reports 4GB of space formatted. I unpacked a copy of the portage-tree, which requires about 672M on ext4 with a 4K block size, on a ZFS dataset in the pool, which required 1.5GB of space.

I then tried using zvolumes, and I had similar results. The storage requirements on the pool are consistently double that of the storage requirements of the actual hosted filesystems:

(df -h reported size) - (zfs list reported size)/(size provided at creation time) (mkfs. options)
743688 - 1.58G/1G - ext4 (normal, zvol) -E discard -N 1048576
688912 - 1.58G/1G - ext4 (extra options, zvol) -E discard -N 1048576 -I 128 -m 0 -O ^ext_attr,^resize_inode,^has_journal,^large_file,^huge_file,^dir_nlink,^extra_isize
687856 - 1.40G/1G - ext4 (extra options, zvol) -E discard -N 262144 -I 128 -m 0 -O ^ext_attr,^resize_inode,^has_journal,^large_file,^huge_file,^dir_nlink,^extra_isize
301684 - 607M/1G - reiserfs (default options, zvol)

You can obtain a snapshot of the portage tree at the following site to verify my results from the following link:

http://mirrors.rit.edu/gentoo/snapshots/portage-latest.tar.xz

I am linking to the latest tarball rather than the current one, mostly because dated tarballs are not hosted particularly long. I expect that others can reproduce my findings regardless of whether or not the same exact snapshot is used.

I also tried making a 512MB file. I then formatted it reiserfs and mounted it on the loopback device. I proceeded to mount it and extract the portage tree. Afterward, I examined the disk space used, and it used 513MB.

Lastly, I tried this on my physical system with a 2GB parse file formatted ZFS on top of ext4 with ashift=12 and without any raidz, mirroring or striping. The occupied space reported by df was 830208 bytes, which is a dramatic improvement over raidz2.

I thought I paid the price of parity at the beginning when 1/3 of my array's space was missing, but it seems that I am paying for it twice, even when using zvols which I would expect space-wise to be the equivalent of a giant file. I pay once at pool creation and then again when I do many small writes. Does anyone have any idea why?

The text was updated successfully, but these errors were encountered:

ryao · 2012-02-02T05:27:28Z

rlaager and I were able to narrow things down in IRC. It seems that a single disk pool is fine with ashift=9 and ashift=12. raidz2 is also fine when ashift=9, but when ashift=12, space requirements explode.

I did an unpack of the portage tree on a raidz2 ashift=9 pool that I made on my VM host. It used only 436MB:

rpool 437M 3.49G 436M /rpool

I also tried a 1GB zpool formatted ext4 with 2^20 inodes:

/dev/zd0 786112 743000 0 100% /rpool/portage

That is consistent with the host usage and it seems that the 5% space reserved for root enabled the extraction to run to completion.

The explosion in disk usage seems to be caused by a bad interaction between ashift=12 and raidz.

behlendorf · 2012-02-02T17:38:01Z

Your absolutely right, this is something of a known, although not widely discussed, issue for ZFS and it's one of the reasons why we've left ashift=9 the default. Small files will balloon the storage requirements, for large files things should be much more reasonable.

ryao · 2012-02-02T20:11:59Z

behlendorf, would you clarify why this affects not only small files on a ZFS dataset, but small files stored on a zvol formatted with a completely different filesystem?

My understanding of a zvol was that it should reserve all of the space that it would ever use and never grow past that unless explicitly resized.

behlendorf · 2012-02-02T21:15:17Z

You see the impact when using a zvol because they default to an 8k block size. If you increase the zvol block size the overhead will decrease. It's analogous to creating files in a zfs filesystem with an 8k block size, which is what happens when you create a small file.

As for zvols they don't reserve their space at creation time. While you do set a maximum volume size they should only allocate space as they are written too like any other object. If you want the behavior your describing for your zvol you need to set a reservation, zfs set reservation=N dataset.

ryao · 2012-02-02T22:04:50Z

That would explain why there appears to be a factor of 2 discrepancy between what the filesystem on the zvol reported and what the zvol actually used.

I suspect that when ashift=12, the zvol will allocate two blocks and only use one (i.e. zero pad it), as opposed to the typical unaligned write behavior where a 4KB logical sector would map to either the upper or lower half of a 8KB physical sector.

rlaager · 2012-02-02T22:47:33Z

behlendorf: When creating a zvol, a refreservation (reservation on zpool <= 8) is created by default. This is covered in the zfs man page and matches my experience with Solaris 11 Express as well as a quick test just now with ZFS on Linux. To get the behavior you describe, you need to add the -s option to: zfs create -V <dataset_name>

rlaager · 2012-02-02T23:02:31Z

gentoofan: Are you seeing writes to a zvol (with the default refreservation) fail with ENOSPC? If so, that seems like a bug (i.e. the reservation isn't properly accounting for the worst-case overhead).

ryao · 2012-02-02T23:42:12Z

I did some more tests. The following is on ashift=12, and I can't see any difference in terms of reported available space when setting a reservation:

localhost ~ # zfs create -V 400M -o reservation=400M rpool/test
localhost ~ # zfs list rpool/test
NAME USED AVAIL REFER MOUNTPOINT
rpool/test 413M 2.08G 144K -
localhost ~ # zfs destroy rpool/test
localhost ~ # zfs create -V 400M rpool/test
localhost ~ # zfs list rpool/test
NAME USED AVAIL REFER MOUNTPOINT
rpool/test 413M 2.08G 144K -

I also tested making a zvol on ashift=9 and I observed the 1.5GB space usage that I had seen on ashift=12. I then repeated with ext4 on top of 'create -V 1G -o volblocksize=4K rpool/ROOT/portage' and the space usage of the zvol correlated to the space usage reported by ext4. Specifically:

root@ubuntu:# df /dev/zd0
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/zd0 917184 687852 229332 75% /mnt/gentoo/usr/portage
root@ubuntu:# zfs list rpool/ROOT/portage
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/portage 1.03G 1.35G 817M -

Accounting for FS overhead, 2^20 KB - 229332 KB = 819244 KB or approximately 800M. Permitting for ZFS' internal book-keeping, 17MB overhead seems reasonable.

I will retest with ashift=12 soon, although given that I had reproduced the 1.5GB zvol usage with ashift=9, I suspect that toggling this switch will fix things. There is still the issue of why an 1GB zvol with ashift=9 uses 1.5GB to store an ext4 filesystem containing files that a ZFS data with ashift=9 only needs 347MB to store

ryao · 2012-02-03T00:24:34Z

Dagger2, rlaager and dajhorn worked this out in IRC. The issue is that ashift=12 is enforcing a 4KB restriction. The two parity blocks required by raidz2 are both 4KB in size because of ashift=12.

The zvol has a default block size of 8KB, so 2x4KB are written as data with 2x4KB parity. Since the corresponding parity blocks have been consumed, the other two data blocks are marked as in use, even though they aren't doing anything. The consequence is that the smallest amount of data that can be written to a raidz pool is (disks - raidz level) * 2^ashift, which in my situation, is 16KB.

This explains why filesystems on the zvol and on a ZFS dataset would require roughly the same amount of space, despite requiring much less on a a single physical disk.

ryao · 2012-02-03T00:49:53Z

rlaager, I have not observed any write failures, although I imagine one would occur if I made a zvol with 3GB in my configuration with the default 8KB volblocksize and then proceeded to fill it.

By the way, I had been sitting on the comment I made immediately after yours while I was running tests, so I didn't see your comment until now.

rlaager · 2012-02-03T01:59:57Z

I justed test this scenario. I created a zpool with ashift=9 on six 300 GiB files. I ran zfs create -V 1G tank foo. I filled it with dd if=/dev/zero of=/dev/zd0. This worked (I believe), as it showed me having 111M AVAIL left on tank/foo. I then repeated the process with ashift=12. Partway through, I started seeing errors like this:
[85659.830922] Buffer I/O error on device zd0, logical block 37960
[85659.830923] lost page write due to I/O error on zd0

dd completed without error. So, as far as I can tell, there are two bugs here:

when this error occurs, it's not being relayed to the application as a failed write.
reservations do not properly take into account the blocksize and ashift values under some conditions (this being an example). In the ashift=12 case, the reservation would have to be much larger with the default volblocksize of 8K, which means that the zvol creation should have failed.

ryao · 2012-02-03T02:27:06Z

rlaager, I believe your issue is what I described earlier, which involves a bad configuration. Once a zvol with such a configuration exists, there is not much that the code can do about it. If the actual volblocksize < (disks - raidz level) * 2^ashift, things like this can happen when you try to fill the zvol.

It might be worthwhile to make the default volblocksize vary depending on the raidz level and number of disks to prevent users from hitting these configurations by default. It might also be worthwhile to refuse to make zvols that violate this constraint. Of course, that won't help people who have pre-existing zvols that were made by older modules or other implementations.

ryao · 2012-02-03T03:07:18Z

After talking about this with rlaager in IRC for a bit, I would like to suggest that the zvol code be patched to accomplish 4 things, where formula = (disks - raidz_level) * 2^ashift:

Set the default volblocksize to max(2^(4 + ashift), formula[vdev0], formula[vdev1]...).
Refuse to create a zvol that violates volblocksize < formula for any vdev in the pool.
Make zvols that violate volblocksize < formula for a vdev readonly devices at pool import time.
Refuse to add a vdev if it will cause a zvol to violate volblocksize < formula.

That would prevent the issues we encountered from occurring.

rlaager · 2012-02-03T03:45:44Z

Assuming that our idea of "formula" is correct (which probably needs more testing):

3: we should print a kernel message. Also, we should implement the "readonly" part by setting readonly=on on the zvol, which would allow the admin to override it. Imagine, "I upgraded ZFS on Linux and rebooted. All of my virtual machines failed because their zvols went readonly." They can continue at the same risk as before (though now it's known) until they have time to recreate the zvols with a different volblocksize and transfer the data.

4: If you want this to be constant-time (and not have to iterate over the zvols to check), then make the condition "if it would raise the default volblocksize". This is just as safe, but may have false positives (i.e. no zvols exist with a volblocksize less than the new value).

We might want to relax the "zvol" conditions to "non-sparse zvols", where "non-sparse" means "with no reservation or refreservation". If I have a 6 disk raidz ashift=12 pool with a volblocksize=4k zvol for a VM's swap or volblocksize=8k for a database, I might want to waste space in exchange for the performance advantage of avoiding read-modify-write at the zvol level. Sparse zvols are already subject to failure if the pool fills up (and thus discouraged by the man page), so while the increase disk consumption might be a surprise, it's not violating any guarantee that ZFS made.

Rudd-O · 2012-04-25T00:40:35Z

Is there going to be a bug fix for this issue? I am being affected by the, in such a way that in a pool with compression (1.7x) and dedupe (1.55x) enabled, the storage size is about THE SAME as it was on the old NetApp Filer (which is outrageously big).

Six-disk RAIDZ2 pool here.

Rudd-O · 2012-04-25T00:41:00Z

Also I see a discrepancy between the total pool size in zpool list and zfs list (used + avail).

behlendorf · 2012-04-27T22:35:22Z

No work is currently planned to address this issue.

pyavdr · 2012-12-13T16:37:39Z

@behlendorf
4K Disks are mainstream now, even on the enterprise level. Vanished diskspace (see also issue #1089) is simply a costly issue preventing the use of 4K Disks with raidz2 even not using zvols. Inefficiency in using disk space is a very weak point for a filesystem even it is ZFS. With larger disks (5TB and growing) this issue becomes severe, because small pools need the raidz2/raidz3 configuration and other files systems can do it cheaper, while not vasting diskspace. I would like to see an efficent ZFS file system so please plan ahead to solve this for the next year.

fa2k · 2013-04-04T19:28:23Z

I thought that if a process wrote a 16K buffer to a zvol with volblocksize=4K, it would be considered a single block and spread over to 4 drives if available. My testing shows something else: it seems to split the data into 4K blocks and use short stripes even when writing 16K buffers.

E.g. on a 5 disk raidz, create 3 test volumes; one with volblocksize=16K and 2 with volblocksize=4K, then use dd to write in 4K or 16K blocks:

zfs create -V 40G -o volblocksize=16K ypool/test_vb16_dd16

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb16_dd16 bs=16K

zfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd16

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd16 bs=16K

zfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd4

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd4 bs=4K

zfs list

ypool/test_vb16_dd16 48.4G 2.56T 48.4G -
ypool/test_vb4_dd16 65.9G 2.56T 65.9G -
ypool/test_vb4_dd4 65.9G 2.56T 65.9G -

This may be what is expected, but then it seems bad to use the default volblocksize of 8K. I get 30 % better random 4K read IOPS when using volblocksize=4K instead of 16K, so there is an argument for using small volblocksize, but the hybrid approach of combining large write buffers seems better if possible.
(Sorry if this was incomprehensible.)

Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1381 Closes openzfs#1328 Issue openzfs#967 Issue openzfs#548

barrkel · 2013-11-14T00:39:26Z

Last weekend I created a 12x4T raidz2 array and streamed across the contents of my old Nexenta fs to test the viability of zfs on Linux.

Imagine my surprise when I noticed fs size jumped from 3.27T to 5.73T in the transition! The combo of 128k blocks, ashift=12 and raidz2 meant a 75% space overhead, almost entirely eliminating any space savings from using raidz2 rather than mirroring.

This is an issue.

Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1381 Closes openzfs#1328 Issue openzfs#967 Issue openzfs#548

byteharmony · 2014-08-21T11:27:56Z

Testing today on the latest zfs release on centos 6 shows the same space usage problem on raidz3 with 4k drives.

FYI
BK

behlendorf · 2014-10-06T20:57:59Z

This is just something people need to be aware of. ZoL behaves the same in this regard as all the other OpenZFS implementations. The only real difference is that ZoL is much more likely to default to ashift=12. However, the default ashift can always be overridden at pool creation time if this is an issue. Since no work is planned to change this behavior I'm closing the issue.

NoAgendaIT · 2014-11-28T16:39:57Z

Even though this bug has been closed, could someone please share recommendations for creating an ext4 zvol with a pool that has either ashift=9 or ashift=12? Does the above ineffeciency only affect pools that have raidz or are regular mirrors also affected?

Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#548

isopix · 2021-11-13T09:51:00Z

Is it this still an issue for zfs 2.1.1?
Does compression fix that?
Is it only related to raidz2/2 and not original raid-z1?

sabirovrinat85 · 2022-01-23T11:27:48Z

Is it this still an issue for zfs 2.1.1? Does compression fix that? Is it only related to raidz2/2 and not original raid-z1?

as I understand this have the same problem as with raidz1. I have a copied comments about this from somewhere (specifically about VMs in datasets and ZVOLs):

"This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K"

"please add -o volblocksize= while creating the volume.
If you have x + parity HDDs then
blocksize = 2 ^ floor(log2(x)) * 2 ^ ashift

If you have 16 disk with RAIDZ3 and ashift=12 => x=(16-3)=13 =>
floor(log2(13)) = 3 =>
blocksize = 2 ^ 3 * 2 ^ 12 =>
blocksize = 32k"

"I found this chart that showed that that the default 8k volblocksize was indeed a problem. For Raidz1 with ashift of 12 (4K LBA) you need atleast:
-for 3 discs a volblocksize of 4x LBA = 16K
-for 4 discs a volblocksize of 3x or 16x LBA = 12K (be aware, not 2^n) or 64K
-for 5 discs a volblocksize of 8x LBA = 32K
-for 6 discs a volblocksize of 5x or 8x LBA = 20K (be aware, not 2^n) or 32K"

"The key insight is that normally, datasets are used for files of varying sizes. As such, when you write a small 16k file, ZFS can use a small 16k record. recordsize is a limit on the max record size. Smaller records are allowed.
When storing large unitary VM images, all the data is in a few big files. Since those files are very large, they only use the defined max recordsize. Even when you only create a small 8k file on your VM, at the ZFS layer, that's a 8k section of a 128k record of a 10+GB file. As such, you set recordsize smaller to prevent lots of read-modify-write overhead when editing those large records.
I usually use a 32k recordsize on my vm filesystems, as I want some ability to benefit from compression when using 4k and 8k sector sizes (ashift).

That said, to correctly compare zvols vs dataset, I suggest you to test the following three configurations:

zvol virtual machine: default zvol parameters, disk configured with cache=none (to bypass the pagecache, the hypervisor must issue O_DIRECT writes);

dataset virtual machine: set recordsize=8K, atime=off, xattr=off, use raw file disk image with cache=writeback (note: datasets do not engage the linux pagecache, nor they support O_DIRECT - unless you are using zfs 0.8.x, were a "fake" support for direct writes was added)"

behlendorf mentioned this issue Sep 18, 2012

default to ashift=12 even for devices reporting 512B sectors #967

Closed

dweeezil mentioned this issue Oct 23, 2013

zvol on RAIDZ2 takes up double the expected space #1807

Closed

behlendorf added the Performance label Aug 25, 2014

behlendorf removed this from the 0.7.0 milestone Oct 6, 2014

behlendorf closed this as completed Oct 6, 2014

grizzlyfred mentioned this issue Dec 30, 2015

Ordinary dd from a disk device can kill the ARC even after it finishes #3680

Closed

ilovezfs mentioned this issue Mar 19, 2016

ashift= doesn't make sense for zpool attach #4435

Closed

oetiker mentioned this issue Apr 11, 2016

volblocksize and 4kn dirves hadfl/kvmadm#91

Closed

behlendorf pushed a commit to behlendorf/zfs that referenced this issue May 21, 2018

Add cv_timedwait_sig_hires to allow interruptible sleep

39cd90e

Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#548

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highly inefficient use of space observed when using raidz2 with ashift=12 #548

Highly inefficient use of space observed when using raidz2 with ashift=12 #548

ryao commented Feb 2, 2012

ryao commented Feb 2, 2012

behlendorf commented Feb 2, 2012

ryao commented Feb 2, 2012

behlendorf commented Feb 2, 2012

ryao commented Feb 2, 2012

rlaager commented Feb 2, 2012

rlaager commented Feb 2, 2012

ryao commented Feb 2, 2012

ryao commented Feb 3, 2012

ryao commented Feb 3, 2012

rlaager commented Feb 3, 2012

ryao commented Feb 3, 2012

ryao commented Feb 3, 2012

rlaager commented Feb 3, 2012

Rudd-O commented Apr 25, 2012

Rudd-O commented Apr 25, 2012

behlendorf commented Apr 27, 2012

pyavdr commented Dec 13, 2012

fa2k commented Apr 4, 2013

barrkel commented Nov 14, 2013

byteharmony commented Aug 21, 2014

behlendorf commented Oct 6, 2014

NoAgendaIT commented Nov 28, 2014

isopix commented Nov 13, 2021

sabirovrinat85 commented Jan 23, 2022

Highly inefficient use of space observed when using raidz2 with ashift=12 #548

Highly inefficient use of space observed when using raidz2 with ashift=12 #548

Comments

ryao commented Feb 2, 2012

ryao commented Feb 2, 2012

behlendorf commented Feb 2, 2012

ryao commented Feb 2, 2012

behlendorf commented Feb 2, 2012

ryao commented Feb 2, 2012

rlaager commented Feb 2, 2012

rlaager commented Feb 2, 2012

ryao commented Feb 2, 2012

ryao commented Feb 3, 2012

ryao commented Feb 3, 2012

rlaager commented Feb 3, 2012

ryao commented Feb 3, 2012

ryao commented Feb 3, 2012

rlaager commented Feb 3, 2012

Rudd-O commented Apr 25, 2012

Rudd-O commented Apr 25, 2012

behlendorf commented Apr 27, 2012

pyavdr commented Dec 13, 2012

fa2k commented Apr 4, 2013

zfs create -V 40G -o volblocksize=16K ypool/test_vb16_dd16

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb16_dd16 bs=16K

zfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd16

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd16 bs=16K

zfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd4

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd4 bs=4K

zfs list

barrkel commented Nov 14, 2013

byteharmony commented Aug 21, 2014

behlendorf commented Oct 6, 2014

NoAgendaIT commented Nov 28, 2014

isopix commented Nov 13, 2021

sabirovrinat85 commented Jan 23, 2022