Improve log spacemap load time. #12789

amotin · 2021-11-24T03:22:56Z

Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 200 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool after
unclean export for about 45 minutes.

This patch improves the situation from five sides:

By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load after unclean export to the same number.
By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.

As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be and an
incompatible change, requiring new pool feature.

How Has This Been Tested?

Tested on FreeBSD for SSD and HDD pools with 5000 metaslabs per vdev (to simulate bigger system), heavily trashed by random rewrites during several hours to touch more metaslabs. Test included measurement of pool import time and specifically log spacemap load time.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2021-12-03T02:17:55Z

Wouldn't anybody who implemented/reviewed original log spacemap work wish to take a look on these optimization?

ahrens · 2021-12-03T02:54:46Z

@amotin Yes, but it will take some time for @sdimitro and I to get to it.

amotin · 2021-12-20T17:13:39Z

Happy holidays everybody! Would anybody make me a present and at least looked on the design this year?

module/zfs/spa_log_spacemap.c

amotin · 2022-01-28T14:50:19Z

Just another rebase.

ghost · 2022-03-14T20:02:45Z

The stable/13 build should be fixed by this: openzfs/zfs-buildbot#249

ahrens · 2022-03-17T05:17:07Z

system with 1500 disks with 200 metaslabs each

That's a lot of disks! I assume each is their own vdev (so you have 1500*200=300,000 metaslabs total). When we designed this, we assumed that log spacemap loading time would be dominated by the i/o required to read the logs, and the i/o performance would scale (roughly) linearly with the number of disks. It sounds like that is not the case, at least with faster storage and the prefetching changes here. Instead the loading is dominated by the CPU time of processing each entry.

By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load after unclean export to the same number.

I think that the time to load the spacemap logs should mostly be proportional to the number of bytes/entries in the logs (especially since your new prefetching makes it CPU-bound). So I would think that the most effective way to limit the time to load the spacemaps would be to limit their total size (in bytes or entries), rather than the number of spacemaps. Since the loading CPU time is single threaded, a wide variety of systems would have roughly the same performance in terms of entries loaded per second. So we could limit the load time to a certain number of seconds by limiting the number of entries to some tunable value. I'd think that would be more effective than limiting the number of TXG's / logs.

I think that the zfs_unflushed_log_block_max essentially does this, assuming that all the blocks are full (zfs_log_sm_blksz = 128KB). Maybe instead of introducing zfs_unflushed_log_txg_max, we should lower zfs_unflushed_log_block_max? It's currently ~256,000, for a max of ~32GB total log size.

By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.

This is great!

jasonbking · 2022-03-17T18:49:56Z

I don't have detailed data, but just to add some of our observations as well -- even pools with 80-100 disks can still take 20-30 minutes to import (with the import process waiting on the log spacemap processing to complete most of that time).

From what I can recall, what I saw during this was that a fraction of the disks in the pool would be busy at a time. I wasn't able to do a detailed analysis (live system and such), but it seemed like it could still be I/O bound, but just the single-threaded nature of the log processing meant that it was only ever using a fraction of the I/O available in the pool since it was only processing one log at a time (and AFAIK a single log isn't spread across all the disks in a pool) -- so the prefetching should be a big win since it should allow more parallel I/O, especially in larger pools.

amotin · 2022-03-17T20:57:45Z

system with 1500 disks with 200 metaslabs each

That's a lot of disks! I assume each is their own vdev (so you have 1500*200=300,000 metaslabs total). When we designed this, we assumed that log spacemap loading time would be dominated by the i/o required to read the logs, and the i/o performance would scale (roughly) linearly with the number of disks. It sounds like that is not the case, at least with faster storage and the prefetching changes here. Instead the loading is dominated by the CPU time of processing each entry.

1500 disks is indeed a bit extreme case, but it is real. Though people report much longer pool import times even on smaller ones as soon as log_spacemap feature is active.

I/O time was a huge problem until I implemented the prefetch part of the patch. I was just the last part of the patch I've implemented after I found that just limiting log length is good, but not sufficient. It helped a lot. But sure not completely by itself.

By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load after unclean export to the same number.

I think that the time to load the spacemap logs should mostly be proportional to the number of bytes/entries in the logs (especially since your new prefetching makes it CPU-bound). So I would think that the most effective way to limit the time to load the spacemaps would be to limit their total size (in bytes or entries), rather than the number of spacemaps. Since the loading CPU time is single threaded, a wide variety of systems would have roughly the same performance in terms of entries loaded per second. So we could limit the load time to a certain number of seconds by limiting the number of entries to some tunable value. I'd think that would be more effective than limiting the number of TXG's / logs.

I think that the zfs_unflushed_log_block_max essentially does this, assuming that all the blocks are full (zfs_log_sm_blksz = 128KB). Maybe instead of introducing zfs_unflushed_log_txg_max, we should lower zfs_unflushed_log_block_max? It's currently ~256,000, for a max of ~32GB total log size.

I think you are right from the point of limiting import time, but I'm afraid that it may kill all benefits from having a log if on some very active fragmented pool you have to flush the logs too soon/often. You may end up paying double write cost but without significant aggregation benefits. I think measuring the log length in TXGs as I've done should allow better write aggregation efficiency. But if you have good ideas for some lower zfs_unflushed_log_block_max defaults, I am open to discuss it also.

ahrens · 2022-04-19T23:22:54Z

I'm afraid that it may kill all benefits from having a log if on some very active fragmented pool you have to flush the logs too soon/often. You may end up paying double write cost but without significant aggregation benefits.

I agree that there's a trade-off between aggregation efficiency (larger log) and load time (smaller log).

If we think about the "per second" rates rather than "per txg", we can do the math like this: if we do 1 million frees/second, then if the log is max size 2 billion entries (32GB, the current default, with ~1 billion "non-obsolete" entries), we'll need to completely turn over the log once every 1000 seconds, so we will need to condense 1/1000th of all the metaslabs each second.

I think measuring the log length in TXGs as I've done should allow better write aggregation efficiency.

I'm not sure why we would measure it in TXG's rather than size (bytes or blocks), since the latter determines load time. Are you thinking that when each TXG has a lot of frees, you would prefer to allow more aggregation at the cost of a longer load time?

But if you have good ideas for some lower zfs_unflushed_log_block_max defaults, I am open to discuss it also.

It would be good to know how many log entries we can process per second, and what the target load time is. But here are some rough guesses:
If we can load at 10,000,000 entries/sec (this is just a guess), the current max load time is (32GiB/16B) / (10,000,000/sec) = 3.5 minutes. If we want to keep that under 1 minute, we could reduce the max log size to 8GB (zfs_unflushed_log_block_max = 64,000), for a load time of 53 seconds. In the earlier scenario of 1 million frees/sec, we'd now need to condense 1/250th of all the metaslabs each second. If we flush a txg once per 10 seconds, we'd be condensing 1/25th of all the metaslabs each txg, which is still a good improvement compared to no log spacemap.

It would be good to measure the actual number of entries that can be loaded each second. Since this is single threaded, and you've eliminated the disk bottleneck, this number should have low variability across different size servers.

Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#12789

New Features - Block cloning (#13392) - Linux container support (#14070, #14097, #12263) - Scrub error log (#12812, #12355) - BLAKE3 checksums (#12918) - Corrective "zfs receive" - Vdev and zpool user properties Performance - Fully adaptive ARC (#14359) - SHA2 checksums (#13741) - Edon-R checksums (#13618) - Zstd early abort (#13244) - Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#12789 (cherry picked from commit cbfe5cb849518dd8fb65bf94a72fd88a15093a67)

amotin requested review from sdimitro, ahrens, grwilson and behlendorf November 24, 2021 03:23

amotin added the Status: Code Review Needed Ready for review and testing label Nov 24, 2021

amotin force-pushed the logsm_limits branch 4 times, most recently from cce917e to f16e78c Compare November 24, 2021 21:32

ahrens assigned mmaybee Dec 3, 2021

amotin force-pushed the logsm_limits branch from f16e78c to 230b5a7 Compare December 20, 2021 17:13

nabijaczleweli reviewed Jan 7, 2022

View reviewed changes

module/zfs/spa_log_spacemap.c Outdated Show resolved Hide resolved

amotin force-pushed the logsm_limits branch from 230b5a7 to 4959ed1 Compare January 7, 2022 03:41

amotin force-pushed the logsm_limits branch from 4959ed1 to 7312bfe Compare January 28, 2022 14:49

amotin force-pushed the logsm_limits branch from 7312bfe to c2b1bb1 Compare January 28, 2022 14:57

amotin force-pushed the logsm_limits branch from c2b1bb1 to cf859e6 Compare March 2, 2022 01:25

amotin force-pushed the logsm_limits branch from cf859e6 to 9165c43 Compare March 14, 2022 18:41

amotin closed this Mar 17, 2022

amotin reopened this Mar 17, 2022

amotin mentioned this pull request Apr 18, 2022

Slow import #11034

Closed

amotin force-pushed the logsm_limits branch from 9165c43 to 3a74e42 Compare April 18, 2022 22:59

behlendorf mentioned this pull request Nov 1, 2022

Pool Import/Export extremely slow #12693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve log spacemap load time. #12789

Improve log spacemap load time. #12789

amotin commented Nov 24, 2021 •

edited

Loading

amotin commented Dec 3, 2021

ahrens commented Dec 3, 2021

amotin commented Dec 20, 2021 •

edited

Loading

amotin commented Jan 28, 2022

ghost commented Mar 14, 2022

ahrens commented Mar 17, 2022

jasonbking commented Mar 17, 2022

amotin commented Mar 17, 2022

ahrens commented Apr 19, 2022

Improve log spacemap load time. #12789

Improve log spacemap load time. #12789

Conversation

amotin commented Nov 24, 2021 • edited Loading

How Has This Been Tested?

Types of changes

Checklist:

amotin commented Dec 3, 2021

ahrens commented Dec 3, 2021

amotin commented Dec 20, 2021 • edited Loading

amotin commented Jan 28, 2022

ghost commented Mar 14, 2022

ahrens commented Mar 17, 2022

jasonbking commented Mar 17, 2022

amotin commented Mar 17, 2022

ahrens commented Apr 19, 2022

amotin commented Nov 24, 2021 •

edited

Loading

amotin commented Dec 20, 2021 •

edited

Loading