Increase L2ARC write rate and headroom #15457

shodanshok · 2023-10-26T20:43:29Z

Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Todays we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2023-10-26T20:54:27Z

I support write size increase, it is definitely overdue. The question is only how high to set it. Would be good to recalculate it into TBW value for a typical drive over its life time, considering worst case 24x7 operation.
Complete disable of headroom though I consider dangerous, we do not want to repeatedly scan a a terabyte of ARC in 4KB blocks where almost nothing is L2ARC-eligible. There should remain some safety barrier.

shodanshok · 2023-10-26T20:55:26Z

This patch is in draft form not because it does anything complex (it just changes two constants used for default values), but because I would like to get feedback from others.

I deployed increased l2arc_write_max and l2arc_headroom=0 on some KVM servers with very good results. Current enterprise TLC SSDs have endurance to spare as L2ARC devices and even consumer SSDs are more than enough, so these changes should not pose any practical issues to device lifetime. L2ARC hit rate is very good, for example:

L2ARC breakdown:                                                    4.0M
        Hit ratio:                                     89.7 %       3.6M
        Miss ratio:                                    10.3 %     411.2k
        Feeds:                                                    780.2k

VMs deployed on such servers "feel" much more like SSDs-backed even after an host reboot.

However, these KVM hosts have 64-192 GB of RAM only - so I don't really know if l2arc_headroom=0 is appropriate (or not) for bigger machines.

Thanks.

amotin · 2023-10-26T21:20:04Z

I think we've already discussed that before. L2ARC was designed to cache data that are going to be evicted from ARC. Headroom controls how much more data we expect to be evicted from ARC per second, that L2ARC should care. If the first in the row of eviction are data of some other pool or data that are not L2ARC-eligible (IIRC we've discussed that during ARC warmup too much data were not eligible due to ongoing prefetch), then L2ARC does not need to write anything and should not need to look deeper, it should stop. This logic is still valid when ARC is warm. It can be discussed how good idea it is to write L2ARC while ARC is not full yet and what should we do with prefetched data and headroom in that case, but the fix here would likely be not a blind headroom disable, but some changes to the code logic. That is your feedback from me.

amotin · 2023-10-26T21:26:11Z

Just on a level of ideas: in case persistent L2ARC is enabled, while ARC is still cold and L2ARC is not full, L2ARC could write only MFU buffers and without headroom. It would give persistent L2ARC a boost of the most useful data in case of reboot. After ARC warmed up operation could return to original algorithm, including heardoom.

shodanshok · 2023-10-26T21:45:54Z

I support write size increase, it is definitely overdue. The question is only how high to set it. Would be good to recalculate it into TBW value for a typical drive over its life time, considering worst case 24x7 operation.

I find a plain TBW value to be overly pessimistic, as the cache device is not going to write at full-speed all the time. At the current 8 MB/s, a worst case estimate is 8 * 86400 * 365 = 240 TB/year, while the SSDs of one KVM server (2x 500 GB Samsung 850 EVO) are 6 years old and each has written a total of ~60.5 TB (10 TB/year only). Since last reboot, 9 days ago:

                                      capacity     operations     bandwidth 
pool                                alloc   free   read  write   read  write
cache                                   -      -      -      -      -      -
  pci-0000:01:00.1-ata-3.0-part5     261G   159G      2      2   166K   136K
  pci-0000:01:00.1-ata-4.0-part5     255G   165G      2      2   161K   136K

As a side note, on this server l2arc_write_max=268435456 (256M) since at least 6 months.

Complete disable of headroom though I consider dangerous, we do not want to repeatedly scan a a terabyte of ARC in 4KB blocks where almost nothing is L2ARC-eligible. There should remain some safety barrier.

I share that concern, even if on these 64-192 GB servers I did not see anything wrong. Maybe because I am using 128K recordsize? Anyway anything scanning 1-4 GB ARC should be ok asl2arc_headroom.

I think we've already discussed that before.

Maybe in #15201?

If the first in the row of eviction are data of some other pool or data that are not L2ARC-eligible (IIRC we've discussed that during ARC warmup too much data were not eligible due to ongoing prefetch), then L2ARC does not need to write anything and should not need to look deeper, it should stop. This logic is still valid when ARC is warm.

Is this the current logic? I don't remember the feed thread doing that (stopping after some ineligible buffers are found).

the fix here would likely be not a blind headroom disable

I agree. At the same time, I remember this very useful comment #15201 (comment) stating that the ARC sublists only contains eligible buffers, so the feed thread should not really scan the entire ARC.

Thanks.

amotin · 2023-10-26T21:53:42Z

If the first in the row of eviction are data of some other pool or data that are not L2ARC-eligible (IIRC we've discussed that during ARC warmup too much data were not eligible due to ongoing prefetch), then L2ARC does not need to write anything and should not need to look deeper, it should stop. This logic is still valid when ARC is warm.

Is this the current logic? I don't remember the feed thread doing that (stopping after some ineligible buffers are found).

Feed thread scans up to headroom, but skips ineligible buffers. If none of scanned buffers are eligible -- nothing will be written.

the fix here would likely be not a blind headroom disable

I agree. At the same time, I remember this very useful comment #15201 (comment) stating that the ARC sublists only contains eligible buffers, so the feed thread should not really scan the entire ARC.

The sublists contain buffers eligible for eviction. It does not mean they all are eligible for L2ARC -- some may already be in L2ARC, some may belong to a different pool, some are from dataset with disabled secondarycache, some are prefetches.

shodanshok · 2023-10-27T05:43:24Z

Feed thread scans up to headroom, but skips ineligible buffers. If none of scanned buffers are eligible -- nothing will be written.

Ok, sure, I misunderstood the previous post.

The sublists contain buffers eligible for eviction. It does not mean they all are eligible for L2ARC -- some may already be in L2ARC, some may belong to a different pool, some are from dataset with disabled secondarycache, some are prefetches.

You are right.

I agree that completely disabling headroom limit can be too much. At the same time, I am somewhat surprised that I did never see the feed thread to cause any significant load even on servers with l2arc_headroom=0. This is probably due to limited memory and default recordsize (128K).

What about setting l2arc_headroom=32? If you feel that reasonable, I can update this patch.

Thanks.

amotin · 2023-10-27T16:31:35Z

What about setting l2arc_headroom=32? If you feel that reasonable, I can update this patch.

With the new write limit it would mean up to 1GB/s of scanned buffers, or up to 4GB/s considering boosts due to compressed and cold ARC, or up to 16GB/s considering all traversed lists. Sure such write speeds are reachable in real life, but not by every system. Also not every system has so much ARC in general. This value would not be completely insane, but feels quite aggressive.

But before it I would prefer some code review/cleanup to be done there. I am not getting sense of l2arc_headroom_boost these days. I think in case of compressed ARC we should just measure the headroom in terms of HDR_GET_PSIZE(), not HDR_GET_LSIZE(). That would match both how much do we write to the L2ARC and how much do we evict from ARC. Doing better math we could reduce headroom by dropping compression boost and only adjusting the general one.

shodanshok · 2023-10-27T17:07:33Z

With the new write limit it would mean up to 1GB/s of scanned buffers, or up to 4GB/s considering boosts due to compressed and cold ARC, or up to 16GB/s considering all traversed lists. Sure such write speeds are reachable in real life, but not by every system. Also not every system has so much ARC in general. This value would not be completely insane, but feels quite aggressive.

Yes, it would be remains quite aggressive. Maybe a safer approach is the simpler one - as I increasedl2arc_write_max by 4x, let l2arc_headroom be increased by the same 4x (instead of the proposed 16x). This means 256 MB/s of scanned buffers per-sublist in steady state, and up to 1 GB per-sublist in case of cold and compressed ARC.

But before it I would prefer some code review/cleanup to be done there. I am not getting sense of l2arc_headroom_boost these days. I think in case of compressed ARC we should just measure the headroom in terms of HDR_GET_PSIZE(), not HDR_GET_LSIZE(). That would match both how much do we write to the L2ARC and how much do we evict from ARC. Doing better math we could reduce headroom by dropping compression boost and only adjusting the general one.

I think the general idea was "if compression is enabled, consider a 2x data reduction rate". Better math would be fine, but as an hand-wave rule I find it quite reasonable.

As current values are so undersized, I am upgrading this PR with l2arc_write_max=32M and l2arc_headroom=8 hoping they would be more appropriate for modern SSDs.

Thanks.

amotin · 2023-10-27T17:12:46Z

I think the general idea was "if compression is enabled, consider a 2x data reduction rate". Better math would be fine, but as an hand-wave rule I find it quite reasonable.

If ARC is compressed, then we write the data to L2ARC exactly as they are in ARC. We do not need to guess, we know the exact physical size.

As current values are so undersized, I am upgrading this PR with l2arc_write_max=32M and l2arc_headroom=8 hoping they would be more appropriate for modern SSDs.

I have no objections.

shodanshok · 2023-10-27T17:42:38Z

I just updated the man page.

The above "cold and compressed ARC" calculation was done considering a 2x boot from a cold ARC, which is not actually true. Do you think I should set l2arc_write_boost the same as l2arc_write_max (32M) ?

EDIT: no, I'm wrong, l2arc_write_boost is defined the same as l2arc_write_max. I will re-update the man page to reflect the new values.

Current L2ARC write rate and headroom parameters are very conservative: l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at 8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs at 2x these rates). These values were selected 15+ years ago based on then-current SSDs size, performance and endurance. Todays we have multi-TB, fast and cheap SSDs which can sustain much higher read/write rates. For this reason, this patch increases l2arc_write_max to 32M and l2arc_headroom to 8 (4x increase for both). Signed-off-by: Gionatan Danti <g.danti@assyoma.it>

shodanshok · 2023-10-28T16:47:22Z

I see some CI tests failing... can the failures be related to this patch?

behlendorf · 2023-10-30T23:44:28Z

I see some CI tests failing... can the failures be related to this patch?

It looks like it could be due to the pool layouts for some of the test cases. I do see the following warning in the CI console logs before the failures. Although, based on the log message it should have capped this to something safe.

[ 3254.330747] NOTICE: l2arc_write_max or l2arc_write_boost plus the overhead of
log blocks (persistent L2ARC, 0 bytes) exceeds the size of the cache device (guid 14312536273815924369),
resetting them to the default (33554432)

shodanshok · 2023-10-31T06:28:40Z

I see some CI tests failing... can the failures be related to this patch?

It looks like it could be due to the pool layouts for some of the test cases. I do see the following warning in the CI console logs before the failures. Although, based on the log message it should have capped this to something safe.
[ 3254.330747] NOTICE: l2arc_write_max or l2arc_write_boost plus the overhead of
log blocks (persistent L2ARC, 0 bytes) exceeds the size of the cache device (guid 14312536273815924369),
resetting them to the default (33554432)

Interesting. Do you think it is an issue with the test suite, or should I implement a cap for l2arc_write_max in the code itself?

Thanks.

behlendorf · 2023-11-01T22:41:05Z

It's surprising. I've resubmitting those CI runs, let see how reproducible it is.

PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

PR #15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15519

Current L2ARC write rate and headroom parameters are very conservative: l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at 8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs at 2x these rates). These values were selected 15+ years ago based on then-current SSDs size, performance and endurance. Today we have multi-TB, fast and cheap SSDs which can sustain much higher read/write rates. For this reason, this patch increases l2arc_write_max to 32M and l2arc_headroom to 8 (4x increase for both). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes openzfs#15457

PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#15519

PR #15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15519

adamdmoss · 2024-11-09T18:46:53Z

IMHO this is not only sensible but perhaps even still too conservative. But a definite start!

Current L2ARC write rate and headroom parameters are very conservative: l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at 8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs at 2x these rates). These values were selected 15+ years ago based on then-current SSDs size, performance and endurance. Today we have multi-TB, fast and cheap SSDs which can sustain much higher read/write rates. For this reason, this patch increases l2arc_write_max to 32M and l2arc_headroom to 8 (4x increase for both). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes openzfs#15457

shodanshok force-pushed the l2tune branch from bb1d94a to 4b3338d Compare October 26, 2023 21:09

shodanshok force-pushed the l2tune branch from 4b3338d to 21011ce Compare October 27, 2023 17:11

shodanshok marked this pull request as ready for review October 27, 2023 17:14

amotin approved these changes Oct 27, 2023

View reviewed changes

shodanshok force-pushed the l2tune branch from 21011ce to 3629e0a Compare October 27, 2023 17:39

shodanshok force-pushed the l2tune branch from 3629e0a to a95fc7a Compare October 27, 2023 17:48

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 1, 2023

behlendorf approved these changes Nov 7, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 7, 2023

behlendorf merged commit 887a3c5 into openzfs:master Nov 9, 2023
19 of 20 checks passed

amotin mentioned this pull request Nov 13, 2023

L2ARC: Restrict write size to 1/4 of the device #15519

Merged

13 tasks

shodanshok deleted the l2tune branch June 29, 2024 17:22

shodanshok mentioned this pull request Nov 9, 2024

zfs-2.2.7 patchset #16720

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase L2ARC write rate and headroom #15457

Increase L2ARC write rate and headroom #15457

shodanshok commented Oct 26, 2023 •

edited

Loading

amotin commented Oct 26, 2023

shodanshok commented Oct 26, 2023

amotin commented Oct 26, 2023

amotin commented Oct 26, 2023

shodanshok commented Oct 26, 2023

amotin commented Oct 26, 2023

shodanshok commented Oct 27, 2023

amotin commented Oct 27, 2023 •

edited

Loading

shodanshok commented Oct 27, 2023

amotin commented Oct 27, 2023

shodanshok commented Oct 27, 2023 •

edited

Loading

shodanshok commented Oct 28, 2023

behlendorf commented Oct 30, 2023

shodanshok commented Oct 31, 2023

behlendorf commented Nov 1, 2023

adamdmoss commented Nov 9, 2024

Increase L2ARC write rate and headroom #15457

Increase L2ARC write rate and headroom #15457

Conversation

shodanshok commented Oct 26, 2023 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin commented Oct 26, 2023

shodanshok commented Oct 26, 2023

amotin commented Oct 26, 2023

amotin commented Oct 26, 2023

shodanshok commented Oct 26, 2023

amotin commented Oct 26, 2023

shodanshok commented Oct 27, 2023

amotin commented Oct 27, 2023 • edited Loading

shodanshok commented Oct 27, 2023

amotin commented Oct 27, 2023

shodanshok commented Oct 27, 2023 • edited Loading

shodanshok commented Oct 28, 2023

behlendorf commented Oct 30, 2023

shodanshok commented Oct 31, 2023

behlendorf commented Nov 1, 2023

adamdmoss commented Nov 9, 2024

shodanshok commented Oct 26, 2023 •

edited

Loading

amotin commented Oct 27, 2023 •

edited

Loading

shodanshok commented Oct 27, 2023 •

edited

Loading