Log Spacemap Project #8442

sdimitro · 2019-02-21T17:05:42Z

Motivation

At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps.

The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory.

Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size.

About this patch

This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry refering to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool.

Testing

The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems.

Per Matt's request I also made sure that opening the pool readonly from older versions of ZFS is possible:
I created 2 VMs -
[A] latest bits but without the Log Spacemap feature,
[B] latest bits patched with the Log Spacemap feature.

$ # Create a file-based pool in [B] that has the feature activated
$ truncate -s 512m dsk
$ sudo zpool create testpool $(pwd)/dsk
$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool     69.5G  14.1G  55.4G        -         -      -    20%  1.00x    ONLINE  -
testpool   480M   129K   480M        -         -     0%     0%  1.00x    ONLINE  -

$ sudo zpool get feature@log_spacemap testpool
NAME      PROPERTY              VALUE                 SOURCE
testpool  feature@log_spacemap  active                local

$ # export the pool and ensure that there log spacemaps (debug bits randomly keep logs after export)
$ sudo zpool export testpool
$ sudo zdb -m -e -p . testpool
....
Log Space Maps in Pool:
Log Spacemap object 65 txg 4
space map object 65:
  smp_length = 0x1e0
  smp_alloc = 0x16200
Log Spacemap object 76 txg 5
space map object 76:
  smp_length = 0x3a0
  smp_alloc = 0x2400
Log Spacemap object 77 txg 6
space map object 77:
  smp_length = 0x268
  smp_alloc = 0x1e00
Log Spacemap object 78 txg 17
space map object 78:
  smp_length = 0x208
  smp_alloc = 0x1800
Log Spacemap object 79 txg 26
space map object 79:
  smp_length = 0x2e0
  smp_alloc = 0x1e00

Log Space Map Obsolete Entry Statistics:
10       valid entries out of 20       - txg 4
30       valid entries out of 43       - txg 5
24       valid entries out of 28       - txg 6
20       valid entries out of 22       - txg 17
33       valid entries out of 33       - txg 26
117      valid entries out of 146      - total

$ # Copy the pool over to [A] and try to import it
$ scp <host B>:~/dsk .
dsk                                                                                                   100%  512MB 144.2MB/s   00:03
$ sudo zpool import -d . testpool
This pool uses the following feature(s) not supported by this system:
	com.delphix:log_spacemap (Log metaslab changes on a single spacemap and flush them periodically.)
All unsupported features are only required for writing to the pool.
The pool can be imported using '-o readonly=on'.
cannot import 'testpool': unsupported version or feature

$ # Import the pool read-only:
$ sudo zpool import -o readonly=on -d . testpool
$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool     69.5G  13.8G  55.7G        -         -      -    19%  1.00x    ONLINE  -
testpool   480M    36K   480M        -         -     0%     0%  1.00x    ONLINE  -
$ sudo zpool get all testpool
....
testpool  unsupported@com.delphix:log_spacemap  readonly

Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment.

Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits.

As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects.

Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval.

Porting to Other Platforms

For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent - db58794
Update vdev_is_spacemap_addressable() for new spacemap encoding - 419ba59
Simplify spa_sync by breaking it up to smaller functions - 8dc2197
Factor metaslab_load_wait() in metaslab_load() - b194fab
Rename range_tree_verify to range_tree_verify_not_present - df72b8b
Change target size of metaslabs from 256GB to 16GB - c853f38
zdb -L should skip leak detection altogether - 21e7cf5
vs_alloc can underflow in L2ARC vdevs - 7558997
Simplify log vdev removal code - 6c926f4
Get rid of space_map_update() for ms_synced_length - 425d323
Introduce auxiliary metaslab histograms - 928e8ad
Error path in metaslab_load_impl() forgets to drop ms_sync_lock - 8eef997

References

Background, Motivation, and Internals of the Feature

OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ
Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)

Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/
OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw
Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320

Signed-off-by: Serapheim Dimitropoulos serapheim@delphix.com

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

postwait · 2019-03-29T16:01:03Z

Hi there, I spent the last few days reviewing this code. It's of great interest to me at work as we land in this pathological case often. I don't know how this community works, would my PR approval be helpful at all? Code's quite clean.

sdimitro · 2019-03-29T19:50:12Z

Thanks for reading through the review and I'm glad to hear that it will help you!

I think we can count you as a reviewer - more reviewers are always welcome. Keep in mind though that since this is a large change and in the metaslab code, I'd personally feel most comfortable merging this once I get feedback from folks whose code I had to change or those who have recently made changes to related codepaths.

In the meantime, feel free to let me know if you have any specific suggestions/feedback from your code review.

man/man5/zfs-module-parameters.5

module/zcommon/zfeature_common.c

ahrens · 2019-04-02T20:53:16Z

Regarding the zimport test failures, I think we need to figure out how to change the test such that these pass, and hopefully still get the intended test coverage. For example, we could create the pool without the log spacemap feature enabled. It seems like we would have the same problem for any new features (which are activated immediately), so it would be ideal to add a mechanism in the zimport test to handle this.

sdimitro · 2019-04-03T18:04:01Z

Regarding the zimport test failures, I think we need to figure out how to change the test such that these pass, and hopefully still get the intended test coverage. For example, we could create the pool without the log spacemap feature enabled. It seems like we would have the same problem for any new features (which are activated immediately), so it would be ideal to add a mechanism in the zimport test to handle this.

I agree although I'm not that familiar on the infrastructure changes that would be needed for this. @behlendorf thoughts?

edillmann · 2019-04-09T07:57:48Z

Hi,

Could it be possible to rebase this beautyfull piece of code to master ?

Thanks,

sdimitro · 2019-04-09T16:21:41Z

@edillmann rebased

behlendorf · 2019-04-11T17:38:07Z

For example, we could create the pool without the log spacemap feature enabled. It seems like we would have the same problem for any new features (which are activated immediately), so it would be ideal to add a mechanism in the zimport test to handle this.

For exactly this reason this behavior is supported by the existing infrastructure. Unfortunately, it's really not documented anywhere other than in the zimport commit itself, 133a5c6 .

    zimport.sh: Allow custom pool create options
    
    Allow custom options to be passed to 'zpool create` when creating
    a new pool.
    
    Normally zimport.sh is intented to prevent accidentally introduced
    incompatibilities so we want the default behavior.  However, when
    introducing a known incompatibility with a feature flag we need a
    way to disable the feature.  By adding a line like the following
    to the commit message the feature can be disabled allowing the
    pool to be compatibile with older versions.
    
    TEST_ZIMPORT_CREATE_OPTIONS="-o feature@encryption=disabled"

@sdimitro you're going to want to add the following line to the top most commit in this PR.

TEST_ZIMPORT_CREATE_OPTIONS="-o feature@log_spacemap=disabled"

Assuming that works as intended I think we should update the https://github.com/zfsonlinux/zfs/wiki/Buildbot-Options documentation on the wiki to more prominently describe hot to use this functionality.

behlendorf · 2019-06-19T16:51:42Z

@sdimitro can you please rebase this on master now that redacted send/recv has been merged.

sdimitro · 2019-07-01T19:56:23Z

Just rebased on the latest master. @behlendorf can you take a look when you get the change now that redacted send/receive is in?

cmd/zdb/zdb.c

man/man5/zfs-module-parameters.5

module/zfs/metaslab.c

man/man5/zfs-module-parameters.5

module/zfs/metaslab.c

module/zfs/spa.c

= Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry refering to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db58794 Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba59 Simplify spa_sync by breaking it up to smaller functions 8dc2197 Factor metaslab_load_wait() in metaslab_load() b194fab Rename range_tree_verify to range_tree_verify_not_present df72b8b Change target size of metaslabs from 256GB to 16GB c853f38 zdb -L should skip leak detection altogether 21e7cf5 vs_alloc can underflow in L2ARC vdevs 7558997 Simplify log vdev removal code 6c926f4 Get rid of space_map_update() for ms_synced_length 425d323 Introduce auxiliary metaslab histograms 928e8ad Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

codecov · 2019-07-16T05:40:05Z

Codecov Report

Merging #8442 into master will increase coverage by 0.07%.
The diff coverage is 96.87%.

@@            Coverage Diff             @@
##           master    #8442      +/-   ##
==========================================
+ Coverage   78.63%    78.7%   +0.07%     
==========================================
  Files         401      402       +1     
  Lines      120157   120977     +820     
==========================================
+ Hits        94481    95211     +730     
- Misses      25676    25766      +90

Flag	Coverage Δ
#kernel	`79.53% <96.41%> (+0.15%)`	⬆️
#user	`66.42% <91.48%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff9630d...2ca5ef8. Read the comment docs.

satmandu · 2019-07-16T17:19:54Z

Will this make it into a 0.8.x point release?

behlendorf · 2019-07-16T17:47:48Z

@satmandu this new feature will be part of the next major release. There's no plan to include in an 0.8.x point release.

postwait · 2019-07-16T17:54:29Z

Ouch.. so another year or so before we can escape from wicked fragmentation issues? That's a big bummer.

= Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db58794 Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba59 Simplify spa_sync by breaking it up to smaller functions 8dc2197 Factor metaslab_load_wait() in metaslab_load() b194fab Rename range_tree_verify to range_tree_verify_not_present df72b8b Change target size of metaslabs from 256GB to 16GB c853f38 zdb -L should skip leak detection altogether 21e7cf5 vs_alloc can underflow in L2ARC vdevs 7558997 Simplify log vdev removal code 6c926f4 Get rid of space_map_update() for ms_synced_length 425d323 Introduce auxiliary metaslab histograms 928e8ad Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#8442

- Unbreak on CURRENT Includes the new Log Spacemap functionality: openzfs/zfs#8442 PR: 239342 Sponsored by: iXsystems git-svn-id: svn+ssh://svn.freebsd.org/ports/head@507006 35697150-7ecd-e111-bb59-0022644237b5

- Unbreak on CURRENT Includes the new Log Spacemap functionality: openzfs/zfs#8442 PR: 239342 Sponsored by: iXsystems

- Unbreak on CURRENT Includes the new Log Spacemap functionality: openzfs/zfs#8442 PR: 239342 Sponsored by: iXsystems git-svn-id: svn+ssh://svn.freebsd.org/ports/head@507006 35697150-7ecd-e111-bb59-0022644237b5

= Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db58794 Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba59 Simplify spa_sync by breaking it up to smaller functions 8dc2197 Factor metaslab_load_wait() in metaslab_load() b194fab Rename range_tree_verify to range_tree_verify_not_present df72b8b Change target size of metaslabs from 256GB to 16GB c853f38 zdb -L should skip leak detection altogether 21e7cf5 vs_alloc can underflow in L2ARC vdevs 7558997 Simplify log vdev removal code 6c926f4 Get rid of space_map_update() for ms_synced_length 425d323 Introduce auxiliary metaslab histograms 928e8ad Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#8442 Signed-off-by: Bryant G. Ly <bly@catalogicsoftware.com> Conflicts: include/zfeature_common.h man/man5/zfs-module-parameters.5 module/zfs/dsl_pool.c module/zfs/spa.c tests/zfs-tests/tests/functional/cli_root/zpool_get/zpool_get.cfg

sdimitro added Status: Work in Progress Not yet ready for general review Missing Template labels Feb 21, 2019

sdimitro force-pushed the personal-lsm-1 branch 2 times, most recently from 76ce679 to b14bc54 Compare February 21, 2019 17:44

openzfs deleted a comment from Skaronator Feb 22, 2019

sdimitro force-pushed the personal-lsm-1 branch 5 times, most recently from 7cd2a5f to b2c78ee Compare February 28, 2019 22:27

sdimitro changed the title ~~LSM - WIP~~ Log Spacemap Project Feb 28, 2019

sdimitro removed the Missing Template label Feb 28, 2019

sdimitro force-pushed the personal-lsm-1 branch from b2c78ee to 20b8058 Compare March 4, 2019 17:04

sdimitro force-pushed the personal-lsm-1 branch from 20b8058 to bdda73a Compare March 18, 2019 22:42

sdimitro added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Mar 18, 2019

sdimitro force-pushed the personal-lsm-1 branch from bdda73a to c3da4d2 Compare March 29, 2019 18:43

ahrens reviewed Apr 2, 2019

View reviewed changes

man/man5/zfs-module-parameters.5 Show resolved Hide resolved

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved

module/zcommon/zfeature_common.c Show resolved Hide resolved

sdimitro force-pushed the personal-lsm-1 branch from c3da4d2 to 2f2aef3 Compare April 3, 2019 18:00

sdimitro force-pushed the personal-lsm-1 branch from 2f2aef3 to 6edb7e3 Compare April 4, 2019 16:09

ahrens approved these changes Apr 8, 2019

View reviewed changes

sdimitro force-pushed the personal-lsm-1 branch 2 times, most recently from 95175e3 to 245e482 Compare April 9, 2019 16:21

sdimitro force-pushed the personal-lsm-1 branch from 50e47f8 to cf5e8fc Compare June 5, 2019 23:36

sdimitro force-pushed the personal-lsm-1 branch from cf5e8fc to db3e6ab Compare July 1, 2019 19:30

behlendorf reviewed Jul 3, 2019

View reviewed changes

sdimitro force-pushed the personal-lsm-1 branch 2 times, most recently from a7f952f to c5c0546 Compare July 9, 2019 17:48

sdimitro mentioned this pull request Jul 9, 2019

Ability to suspend pool when hitting unrecoverable errors in spa_sync_iterate_to_convergence() code paths #9010

Closed

sdimitro force-pushed the personal-lsm-1 branch 2 times, most recently from 0c73bbe to b8852f3 Compare July 15, 2019 17:46

sdimitro force-pushed the personal-lsm-1 branch from b8852f3 to 2ca5ef8 Compare July 15, 2019 21:22

behlendorf approved these changes Jul 15, 2019

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 15, 2019

behlendorf merged commit 93e28d6 into openzfs:master Jul 16, 2019

uqs pushed a commit to freebsd/freebsd-ports that referenced this pull request Jul 20, 2019

- Update to 2019072000

04eb47b

- Unbreak on CURRENT Includes the new Log Spacemap functionality: openzfs/zfs#8442 PR: 239342 Sponsored by: iXsystems

ahrens mentioned this pull request Nov 7, 2020

Question about "Log Space Map Obsolete Entry Statistics" #11177

Closed

amotin mentioned this pull request May 27, 2022

One-off VERIFY failure in log spacemap code on recent git #13486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log Spacemap Project #8442

Log Spacemap Project #8442

sdimitro commented Feb 21, 2019 •

edited

Loading

postwait commented Mar 29, 2019

sdimitro commented Mar 29, 2019

ahrens commented Apr 2, 2019

sdimitro commented Apr 3, 2019

edillmann commented Apr 9, 2019

sdimitro commented Apr 9, 2019

behlendorf commented Apr 11, 2019

behlendorf commented Jun 19, 2019

sdimitro commented Jul 1, 2019

codecov bot commented Jul 16, 2019 •

edited

Loading

satmandu commented Jul 16, 2019

behlendorf commented Jul 16, 2019

postwait commented Jul 16, 2019

Log Spacemap Project #8442

Log Spacemap Project #8442

Conversation

sdimitro commented Feb 21, 2019 • edited Loading

Motivation

About this patch

Testing

Performance Analysis (Linux Specific)

Porting to Other Platforms

References

Types of changes

Checklist:

postwait commented Mar 29, 2019

sdimitro commented Mar 29, 2019

ahrens commented Apr 2, 2019

sdimitro commented Apr 3, 2019

edillmann commented Apr 9, 2019

sdimitro commented Apr 9, 2019

behlendorf commented Apr 11, 2019

behlendorf commented Jun 19, 2019

sdimitro commented Jul 1, 2019

codecov bot commented Jul 16, 2019 • edited Loading

Codecov Report

satmandu commented Jul 16, 2019

behlendorf commented Jul 16, 2019

postwait commented Jul 16, 2019

sdimitro commented Feb 21, 2019 •

edited

Loading

codecov bot commented Jul 16, 2019 •

edited

Loading