Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16K volblocksize should be recommended in the ZFS documentation? #14771

Closed
makhomed opened this issue Apr 19, 2023 · 9 comments
Closed

16K volblocksize should be recommended in the ZFS documentation? #14771

makhomed opened this issue Apr 19, 2023 · 9 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@makhomed
Copy link

System information

Type Version/Name
Distribution Name Rocky Linux
Distribution Version 9.1
Kernel Version 5.14.0-162.23.1.el9_1.x86_64
Architecture x86_64
OpenZFS Version 2.1.10-1

Describe the problem you're observing

Wrong warning about wasted space from zfs create command:

# zfs create -s -b 4K -V 51200G tank/kvm101-vm111-mysql
Warning: volblocksize (4096) is less than the default minimum block size (8192).
To reduce wasted space a volblocksize of 8192 is recommended.

This is wrong warning/recommendation, because NVMe formatted with 4K block size, virtual machine use this zvol for ext4 filesystem with 4K block size, zpool is just a mirror of two NVMe, about which kind of wasted space this warning talking?

If I create zvol with 8K volblocksize - it will be very inefficient, because virtual machine read and write 4K blocks, and each read/write operation from virtual machine will be write amplification - read 8K block, modify 4K data, write 8K block - so, instead of just writing 4K block at once and forget about it zfs will read 8K block, modify it and write 8K block, - read extra 8K and write 8K instead of 4K - this is optimal?

# zfs create -s -b 4K -V 51200G tank/kvm101-vm111-mysql
Warning: volblocksize (4096) is less than the default minimum block size (8192).
To reduce wasted space a volblocksize of 8192 is recommended.

volblocksize 4K bytes is the best fit optimal solution for mirror zpool created on top of two NVMe formatted with 4K block size.

# lsblk --nodeps --topology /dev/nvme{0,1}n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0 131072 131072    4096     512    0 none     1023 256    0B
nvme1n1         0 131072 131072    4096     512    0 none     1023 256    0B

So, this warning / recommendation from the zfs create command in totally useless and wrong, it should be removed from zfs.

Only on SUN Microsystems hardware physical page size of processor is 8K, so only on the SUN Microsystems hardware this recommendation of using volblocksize 8K have any sense.

On the x86_64 hardware with 4K processor page size and NVMe 4K page size recommendation to use 8K volblock size will lead ti write amplification and it doing nothing good.

zpool is just a mirror of two NVMe:

# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:00:09 with 0 errors on Wed Apr 19 02:40:31 2023
config:

        NAME                                                      STATE     READ WRITE CKSUM
        tank                                                      ONLINE       0     0     0
          mirror-0                                                ONLINE       0     0     0
            nvme-SAMSUNG_MZQL23T8HCLS-00A07_S64HNE0T406214-part4  ONLINE       0     0     0
            nvme-SAMSUNG_MZQL23T8HCLS-00A07_S64HNJ0T700965-part4  ONLINE       0     0     0

errors: No known data errors

Describe how to reproduce the problem

  1. install Rocky Linux 9.1
  2. install zfs official repo and install zfs using dkms (dnf install zfs)
  3. create zpool of two NVMe devices:
# zpool create -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O acltype=posix tank mirror /dev/disk/by-id/nvme-SAMSUNG_MZQL23T8HCLS-00A07_S64HNE0T406214-part4 /dev/disk/by-id/nvme-SAMSUNG_MZQL23T8HCLS-00A07_S64HNJ0T700965-part4
  1. create 50TiB zvol for virtual machine:
# zfs create -s -b 4K -V 51200G tank/kvm101-vm111-mysql
Warning: volblocksize (4096) is less than the default minimum block size (8192).
To reduce wasted space a volblocksize of 8192 is recommended.
  1. observe wrong and unexpected warning about wrong 4K volblocksize from zfs create command.
  2. see what zvol with volblocksize created successfully and works fine:
# zfs get volblocksize tank/kvm101-vm111-mysql
NAME                     PROPERTY      VALUE     SOURCE
tank/kvm101-vm111-mysql  volblocksize  4K        -
@makhomed makhomed added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 19, 2023
@makhomed makhomed changed the title Wrong recommendation about wasted space in the zfs create command with 4K volblocksize for new zvol Wrong recommendation about wasted space in the zfs create command with 4K volblocksize for new zvol Apr 19, 2023
@rincebrain
Copy link
Contributor

rincebrain commented Apr 19, 2023

You are incorrect about it being incorrect.

The default was actually raised to 16k in 2021, shortly after 2.1 branched - 72f0521.

The problem is that, while yes, you're avoiding having the overhead of > disk block size allocations causing RMW, you're also effectively disabling compression on everything but empty blocks, and bottlenecking in high-throughput scenarios on things that have performance implications per-record.

You can do it, as you've illustrated. But it's not the advised configuration, and the advice is working as intended.

@makhomed
Copy link
Author

Yes, you are right, 16K volblocksize give me very good compression ratio on the fresh installed Rocky Linux 9.1 in the almost minimal configuration - but ref compression ratio grow from 1.07x to 1.94x and used by dataset disk space decreased from 2.03G to 1.27G in the almost identical operating system configuration and installed software. This is very good result.

# zfs get all -t volume | grep -v swap | grep -P '(refcompressratio|usedbydataset|logicalused)'
tank/kvm101-vm001-router-16K-volblocksize      usedbydataset         1.27G                  -
tank/kvm101-vm001-router-16K-volblocksize      refcompressratio      1.94x                  -
tank/kvm101-vm001-router-16K-volblocksize      logicalused           2.25G                  -
tank/kvm101-vm001-router-4K-volblocksize       usedbydataset         2.03G                  -
tank/kvm101-vm001-router-4K-volblocksize       refcompressratio      1.07x                  -
tank/kvm101-vm001-router-4K-volblocksize       logicalused           2.21G                  -

P.S. ZFS documentation recommends to use 4K volblocksize - this is a bug in the documentation?

Virtual machines
Virtual machine images on ZFS should be stored using either zvols or raw files to avoid unnecessary overhead. The recordsize/volblocksize and guest filesystem should be configured to match to avoid overhead from partial record modification. This would typically be 4K.

If 16K is default volblocksize - this is means what 16K is the most optimal volblocksize for most usage scenarios?

And ZFS documentation and zfs create should recomment to use the same volblocksize, 16K ?

But right now - ZFS documentation recommend to use 4K volblocksize,
zfs create command talk about "default minimum block size" 8K volblocksize,
and ZFS code say what the default volblocksize is really 16K volblocksize.

May be all three sources of information should talk about the same recommended volblocksize, 16K ? To prevent users confusion and misunderstanding?

@makhomed makhomed changed the title Wrong recommendation about wasted space in the zfs create command with 4K volblocksize for new zvol 16K volblocksize should be recommended in the ZFS documentation, and also in the zfs create command warning message? Apr 19, 2023
@makhomed makhomed changed the title 16K volblocksize should be recommended in the ZFS documentation, and also in the zfs create command warning message? 16K volblocksize should be recommended in the ZFS documentation? Apr 19, 2023
@rincebrain
Copy link
Contributor

rincebrain commented Apr 20, 2023

That is recommending specifically that you ensure the VM knows what the blocksize is, so that it doesn't make allocations smaller than it and have surprises from RMW, like if you told a VM the blocksize was 512 and stored it on 4k drives. The fact that it suggests it's typically 4k is potentially stale, I'm not actually sure, because...

A complication of that is that on Linux, at least, the VFS limits you to block sizes <= PAGE_SIZE, unless you architect things to more cleverly handle that, and last I knew, most FSes had not. (This also means you can't mount some ext4 filesystems on x86, since AArch64 or sparc64, for instance, are not 4k all the time. For example, this trivial ext4 image mounts fine some places, but on x86, we get EXT4-fs (loop6): bad block size 8192)

So for those cases, you won't have a choice but to lie to it or use 4k volblocksize. Womp womp.

The man page (and, indeed, a block comment in zfs_main.c, I believe) should be updated for the 16k change and were not, even in git.

@makhomed
Copy link
Author

makhomed commented Apr 20, 2023

Probably you are talking about logical block size of block devicein Linux - it can't be in Linux on x86_64 greater then 4096 bytes?

But Linux normally understand large physical_block_size of block devices, even greater then 4096 bytes.

I can install Linux inside virtual machine using /usr/bin/virt-install with --disk blockio.logical_block_size=512,blockio.physical_block_size=16384 - system installed and works fine. lsblk inside virtual machine also see what physical block size is 16K. But in this case - operating system install time was more than 3 minutes.

When I install same system with same kickstart using --disk blockio.logical_block_size=512,blockio.physical_block_size=4096 - install time was only 2 min 35 sec.

Difference - almost 30 seconds, in my experiment.

From performance point of view - is is better to use inside virtual machine emulated block device with blockio.physical_block_size=4096 --disk option.

in both cases, reported 4K physical_block_size and reported 16K physical_block_size ext4 filesystem has the same properties:

Block size:               4096
Fragment size:            4096

But if I report to Linux inside virtual machine, what physical_block_size is 16K - tune2fs -l /dev/vda3 display additional info:

RAID stride:              4
RAID stripe width:        4

on real hardware - RAID array this probably can help to improve performance (on rotating mechanical HDD), but in case of zfs zvols as block device - this does not help, it only makes situation worse.

Why? I don`t know.

May be zfs something use ARC to improve performance of "partial writes" of 4K blocks to storage with volblocksize 16K, but I am not sure which optimizations exists here from the zfs side.

@rincebrain
Copy link
Contributor

rincebrain commented Apr 20, 2023

I am talking about the physical block size the filesystem uses in Linux. Not the bigalloc cluster size on ext4 or agsize on XFS or anything else.

ARC is, as the name says, a read cache.

@makhomed
Copy link
Author

makhomed commented Apr 20, 2023

ARC is, as the name says, a read cache.

write amplification work as RMW operations (Read/Modify/Write) and ZFS ARC can help - to read full 16K zvol block from ARC, but not from the persistent storage on the NVMe / SSD ?

ARC means "Adaptive Replacement Cache", not "Adaptive Read Cache", as you think from the ARC abbreviature name.

I am talking about the physical block size the filesystem uses in Linux. Not the bigalloc cluster size on ext4 or agsize on XFS or anything else.

"physical block size the filesystem uses in Linux" - filesystems in Linux address storage blocks using only the logical block addressing (LBA).

I know at least three variants:

512 logical / 512 physical.

512 logical / 4094 physical.

4096 logical / 4096 physical.

Even if block device reports 16384 physical block size - Linux filesystem use only LBA 512 sectors or 4096 sectors logical addresses for addressing such logical blocks/sectors on the storage device with internal 16K physical block size.

So I can`t understand about which "physical block size the filesystem uses in Linux" you are talking?

@gmelikov
Copy link
Member

gmelikov commented Apr 20, 2023

We already have great description about block size in hardware workload https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#zvol-volblocksize , I made a PR with it's reusage openzfs/openzfs-docs#430 . Thank you, good catch @makhomed !

@makhomed
Copy link
Author

Ok, your PR #430 merged, OpenZFS documentation about virtual machines workload tuning updated, so I probably should close this issue as completed.

@gmelikov, can you please clarify situation - probably I wrote my issue #11481 in the wrong bug tracker, I should get my issue #11481 from zfs bug tracker and create 1:1 copy of this issue in the openzfs-docs bugtracker?

Describe in documentation all known data corruption bugs and affected OpenZFS versions #11481

From my point of view - this is critical information for production use of different versions of OpenZFS and for planning upgrades of older versions of the zfs at the production servers, but I can't find in one place in documentation description of all zfs critical bugs with data corruption and so on, something like at nginx security advisories page.

Issue #11481 already exists for more than two years, without any progress on it... Am I doing something wrong? Probably I use the wrong bug tracker for asking for OpenZFS documentation changes?

I don't know English very well - it is also not my native language, so I probably can't update OpeZFS documentation by myself, - also I don't know very well all OpenZFS critical bugs, which can lead to data corruption and denial of service / kernel panics.

This is the reason why I ask OpenZFS developers to update OpenZFS documentation and clarify the situation with OpenZFS bugs, which can lead to data corruption and denial of service / kernel panic. Or such information already exists in one place in the OpenZFS documentation, and I just have not found it?

Or OpenZFS right now is just in the development/beta quality phase and currently OpenZFS not indented for production use, and when I want rock solid and bulletproof filesystem for production use (like the nginx web server in the web servers area) - I should look for something else, for example well known OpenZFS alternative - BTRFS filesystem from the Oracle corporation? (It would be funny if it weren't so sad)

@makhomed
Copy link
Author

About what I talking in the my #11481 issue?

Fresh example:

Recently releases version zfs-2.1.11 contains a fix for a possible data corruption bug - this is means, what all production servers whih use 2.1.10 version of OpenZFS should be updated as soon as possible to latest released 2.1.11 version of OpenZFS.

But which versions of OpenZFS are affected to this possible data corruption bug and which versions of OpenZFS are not affected to this possible data corruption bug - I can't understand from just reading list of changes from the 2.1.11 version of OpenZFS.

Probably not only me, but also all other OpenZFS users, which are try to use the OpenZFS filesystem in the production.

OpenZFS filesystem is the best non-clustered filesystem in the entire world.

But OpenZFS documentation can be better, from my point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants