-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
16K volblocksize should be recommended in the ZFS documentation? #14771
Comments
zfs create
command with 4K volblocksize for new zvol
You are incorrect about it being incorrect. The default was actually raised to 16k in 2021, shortly after 2.1 branched - 72f0521. The problem is that, while yes, you're avoiding having the overhead of > disk block size allocations causing RMW, you're also effectively disabling compression on everything but empty blocks, and bottlenecking in high-throughput scenarios on things that have performance implications per-record. You can do it, as you've illustrated. But it's not the advised configuration, and the advice is working as intended. |
Yes, you are right, 16K volblocksize give me very good compression ratio on the fresh installed Rocky Linux 9.1 in the almost minimal configuration - but ref compression ratio grow from 1.07x to 1.94x and used by dataset disk space decreased from 2.03G to 1.27G in the almost identical operating system configuration and installed software. This is very good result.
P.S. ZFS documentation recommends to use 4K volblocksize - this is a bug in the documentation? Virtual machines If 16K is default volblocksize - this is means what 16K is the most optimal volblocksize for most usage scenarios? And ZFS documentation and zfs create should recomment to use the same volblocksize, 16K ? But right now - ZFS documentation recommend to use 4K volblocksize, May be all three sources of information should talk about the same recommended volblocksize, 16K ? To prevent users confusion and misunderstanding? |
zfs create
command with 4K volblocksize for new zvolzfs create
command warning message?
zfs create
command warning message?
That is recommending specifically that you ensure the VM knows what the blocksize is, so that it doesn't make allocations smaller than it and have surprises from RMW, like if you told a VM the blocksize was 512 and stored it on 4k drives. The fact that it suggests it's typically 4k is potentially stale, I'm not actually sure, because... A complication of that is that on Linux, at least, the VFS limits you to block sizes <= PAGE_SIZE, unless you architect things to more cleverly handle that, and last I knew, most FSes had not. (This also means you can't mount some ext4 filesystems on x86, since AArch64 or sparc64, for instance, are not 4k all the time. For example, this trivial ext4 image mounts fine some places, but on x86, we get So for those cases, you won't have a choice but to lie to it or use 4k The man page (and, indeed, a block comment in zfs_main.c, I believe) should be updated for the 16k change and were not, even in git. |
Probably you are talking about logical block size of block devicein Linux - it can't be in Linux on x86_64 greater then 4096 bytes? But Linux normally understand large physical_block_size of block devices, even greater then 4096 bytes. I can install Linux inside virtual machine using /usr/bin/virt-install with --disk blockio.logical_block_size=512,blockio.physical_block_size=16384 - system installed and works fine. lsblk inside virtual machine also see what physical block size is 16K. But in this case - operating system install time was more than 3 minutes. When I install same system with same kickstart using --disk blockio.logical_block_size=512,blockio.physical_block_size=4096 - install time was only 2 min 35 sec. Difference - almost 30 seconds, in my experiment. From performance point of view - is is better to use inside virtual machine emulated block device with blockio.physical_block_size=4096 --disk option. in both cases, reported 4K physical_block_size and reported 16K physical_block_size ext4 filesystem has the same properties:
But if I report to Linux inside virtual machine, what physical_block_size is 16K - tune2fs -l /dev/vda3 display additional info:
on real hardware - RAID array this probably can help to improve performance (on rotating mechanical HDD), but in case of zfs zvols as block device - this does not help, it only makes situation worse. Why? I don`t know. May be zfs something use ARC to improve performance of "partial writes" of 4K blocks to storage with volblocksize 16K, but I am not sure which optimizations exists here from the zfs side. |
I am talking about the physical block size the filesystem uses in Linux. Not the bigalloc cluster size on ext4 or agsize on XFS or anything else. ARC is, as the name says, a read cache. |
write amplification work as RMW operations (Read/Modify/Write) and ZFS ARC can help - to read full 16K zvol block from ARC, but not from the persistent storage on the NVMe / SSD ? ARC means "Adaptive Replacement Cache", not "Adaptive Read Cache", as you think from the ARC abbreviature name.
"physical block size the filesystem uses in Linux" - filesystems in Linux address storage blocks using only the logical block addressing (LBA). I know at least three variants: 512 logical / 512 physical. 512 logical / 4094 physical. 4096 logical / 4096 physical. Even if block device reports 16384 physical block size - Linux filesystem use only LBA 512 sectors or 4096 sectors logical addresses for addressing such logical blocks/sectors on the storage device with internal 16K physical block size. So I can`t understand about which "physical block size the filesystem uses in Linux" you are talking? |
We already have great description about block size in |
Ok, your PR #430 merged, OpenZFS documentation about virtual machines workload tuning updated, so I probably should close this issue as completed. @gmelikov, can you please clarify situation - probably I wrote my issue #11481 in the wrong bug tracker, I should get my issue #11481 from zfs bug tracker and create 1:1 copy of this issue in the openzfs-docs bugtracker?
From my point of view - this is critical information for production use of different versions of OpenZFS and for planning upgrades of older versions of the zfs at the production servers, but I can't find in one place in documentation description of all zfs critical bugs with data corruption and so on, something like at nginx security advisories page. Issue #11481 already exists for more than two years, without any progress on it... Am I doing something wrong? Probably I use the wrong bug tracker for asking for OpenZFS documentation changes? I don't know English very well - it is also not my native language, so I probably can't update OpeZFS documentation by myself, - also I don't know very well all OpenZFS critical bugs, which can lead to data corruption and denial of service / kernel panics. This is the reason why I ask OpenZFS developers to update OpenZFS documentation and clarify the situation with OpenZFS bugs, which can lead to data corruption and denial of service / kernel panic. Or such information already exists in one place in the OpenZFS documentation, and I just have not found it? Or OpenZFS right now is just in the development/beta quality phase and currently OpenZFS not indented for production use, and when I want rock solid and bulletproof filesystem for production use (like the nginx web server in the web servers area) - I should look for something else, for example well known OpenZFS alternative - BTRFS filesystem from the Oracle corporation? (It would be funny if it weren't so sad) |
About what I talking in the my #11481 issue? Fresh example: Recently releases version zfs-2.1.11 contains a fix for a possible data corruption bug - this is means, what all production servers whih use 2.1.10 version of OpenZFS should be updated as soon as possible to latest released 2.1.11 version of OpenZFS. But which versions of OpenZFS are affected to this possible data corruption bug and which versions of OpenZFS are not affected to this possible data corruption bug - I can't understand from just reading list of changes from the 2.1.11 version of OpenZFS. Probably not only me, but also all other OpenZFS users, which are try to use the OpenZFS filesystem in the production. OpenZFS filesystem is the best non-clustered filesystem in the entire world. But OpenZFS documentation can be better, from my point of view. |
System information
Describe the problem you're observing
Wrong warning about wasted space from
zfs create
command:This is wrong warning/recommendation, because NVMe formatted with 4K block size, virtual machine use this zvol for ext4 filesystem with 4K block size, zpool is just a mirror of two NVMe, about which kind of wasted space this warning talking?
If I create zvol with 8K volblocksize - it will be very inefficient, because virtual machine read and write 4K blocks, and each read/write operation from virtual machine will be write amplification - read 8K block, modify 4K data, write 8K block - so, instead of just writing 4K block at once and forget about it zfs will read 8K block, modify it and write 8K block, - read extra 8K and write 8K instead of 4K - this is optimal?
volblocksize 4K bytes is the best fit optimal solution for mirror zpool created on top of two NVMe formatted with 4K block size.
So, this warning / recommendation from the
zfs create
command in totally useless and wrong, it should be removed from zfs.Only on SUN Microsystems hardware physical page size of processor is 8K, so only on the SUN Microsystems hardware this recommendation of using volblocksize 8K have any sense.
On the x86_64 hardware with 4K processor page size and NVMe 4K page size recommendation to use 8K volblock size will lead ti write amplification and it doing nothing good.
zpool is just a mirror of two NVMe:
Describe how to reproduce the problem
dnf install zfs
)zfs create
command.The text was updated successfully, but these errors were encountered: