Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default volblocksize from 8KB to 16KB. #12406

Merged
merged 1 commit into from
Aug 17, 2021

Conversation

amotin
Copy link
Member

@amotin amotin commented Jul 21, 2021

Many things has changed since previous default was set many years ago.
Nowadays 8KB does not allow adequate compression or even decent space
efficiency on many of pools due to 4KB disk physical block rounding,
especially on RAIDZ and DRAID. It effectively limits write throughput
to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation,
vdev queue and other block rate bottlenecks. It keeps L2ARC expensive
despite many optimizations and dedup just unrealistic.

In FreeNAS/TrueNAS we for years default to at least 16KB volblocksize
for mirror pools and even bigger (32-64KB) for RAIDZ, and so far we
can find very few scenarios (not synthetic benchmarks) when smaller
blocks would show sufficient benefits.

It was discussed on today's OpenZFS meeting and got no objections.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Jul 21, 2021
@amotin amotin requested review from behlendorf and ahrens July 21, 2021 03:22
@amotin amotin force-pushed the volblocksize branch 2 times, most recently from 6289205 to 4988fb6 Compare July 21, 2021 13:50
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing this default makes good sense to me. Let's just make sure to review the logic in the default_volblocksize() function to make sure it still works as intended. This code was added as part of the dRAID changes to make sure a reasonable default volblocksize was used. It looks like it should be fine but I didn't do any actual manual testing.

@amotin
Copy link
Member Author

amotin commented Jul 22, 2021

@behlendorf I don't see how this change can make default_volblocksize() any worse than it already is. But I see it very confusing already:

  • Constants of 2 and 4 sectors for RAIDZs are based on loosing half of capacity. But people who would like like to loose half of capacity should just use mirrors for much better performance and space efficiency. I am OK with this used as a warning threshold, but the code then reports it as "minimum allocation unit", whatever it may mean for RAIDZ, and calculates waste space based on it, that is a very rough estimation.
  • The message reported in the most severe case (volblocksize < ZVOL_DEFAULT_BLOCKSIZE < tgt_volblocksize) looks less severe to me than the second one (ZVOL_DEFAULT_BLOCKSIZE < volblocksize < tgt_volblocksize). And to me they do not adequately represent all combinations of space waste, low performance or both.

@ahrens
Copy link
Member

ahrens commented Jul 22, 2021

It makes sense to me that most zvols should use volblocksize=16K or more. As you probably know, we originally defaulted to volblocksize=8K, compression=off, and refreservation=~volsize to minimize surprises when using it as replacement for existing volume managers. volblocksize=8k matched the pagesize of the hardware (SPARC) and default blocksize of the predominant filesystem (Solaris UFS) that might be used on zvols.

The most important use cases for zvols has changed a lot since then. My understanding is that they are used primarily either for iSCSI (or FC?) targets, or for (local) VM disk images, and the space is thin-provisioned. For this use case you probably want a sparse/thin-provisioned volume, with compression (zfs create -s -o compression=on -o volblocksize=16K+ -V ...).

My only concern about this change is in the potentially surprising impact of changing the default. I think that changing it as proposed here is OK, but maybe we could do even better by leaving zfs create -V ... alone, and introducing a new kind of zvol that has defaults that make more sense for the modern use case, which would be created with its own flag or subcommand. e.g. zfs create -L <volsize> <dataset> could create a "space-efficient, large-write-efficient, zvol", with appropriate default properties (e.g. volblocksize=64K, compression=on, refquota=none), leaving zfs create -V for creating a "zvol of least surprise".

@gmelikov
Copy link
Member

gmelikov commented Jul 22, 2021

@ahrens remain non-modern variant as default has a major drawback - you need to know more than zfs create -V to use zvols effectively, so:

  • newbies will suffer
  • benchmarks will suffer
  • so ZFS for non-ZFSers will still be "slow"
  • my context switch will suffer too about one more zvol type (one, two, many :) )

If someone really wants small volblocksize - they should just set it. IMHO nowadays 8k doesn't have any pros against 16k, even on NVMEs.

Best what we can is to run bare benchmarks with 8k and 16k volblocksizes with comparison for safety.

@amotin
Copy link
Member Author

amotin commented Jul 22, 2021

@ahrens I agree with @gmelikov on all points that it would be an overkill. Sure, we have 8KB mentioned in many materials published over the years, but I don't think it is good enough reason to go that complicated and publish more materials. On top of that I am not aware of any software in mentioned iSCSI/FC/VM realm of TrueNAS that would strongly depend on 8KB "physical sector size" reporting, not allowing 16KB. Solaris on SPARC was pretty unique in its page size. There is some that prefer 4KB (MS SQL, partially VMware), from which we have to hide the truth to make them live happily, but nothing I know prefers the 8KB more than "optimization".

Many things has changed since previous default was set many years ago.
Nowadays 8KB does not allow adequate compression or even decent space
efficiency on many of pools due to 4KB disk physical block rounding,
especially on RAIDZ and DRAID.  It effectively limits write throughput
to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation,
vdev queue and other block rate bottlenecks.  It keeps L2ARC expensive
despite many optimizations and dedup just unrealistic.

In FreeNAS/TrueNAS we for years default to at least 16KB volblocksize
for mirror pools and even bigger (32-64KB) for RAIDZ, and so far we
can find very few scenarios (not synthetic benchmarks) when smaller
blocks would show sufficient benefits.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
@mmaybee mmaybee added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 17, 2021
@mmaybee mmaybee merged commit 72f0521 into openzfs:master Aug 17, 2021
@amotin amotin deleted the volblocksize branch August 24, 2021 20:17
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Sep 22, 2021
Many things has changed since previous default was set many years ago.
Nowadays 8KB does not allow adequate compression or even decent space
efficiency on many of pools due to 4KB disk physical block rounding,
especially on RAIDZ and DRAID.  It effectively limits write throughput
to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation,
vdev queue and other block rate bottlenecks.  It keeps L2ARC expensive
despite many optimizations and dedup just unrealistic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12406
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Feb 10, 2022
Many things has changed since previous default was set many years ago.
Nowadays 8KB does not allow adequate compression or even decent space
efficiency on many of pools due to 4KB disk physical block rounding,
especially on RAIDZ and DRAID.  It effectively limits write throughput
to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation,
vdev queue and other block rate bottlenecks.  It keeps L2ARC expensive
despite many optimizations and dedup just unrealistic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12406
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants