Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log Spacemap Project #8442

Merged
merged 1 commit into from
Jul 16, 2019
Merged

Log Spacemap Project #8442

merged 1 commit into from
Jul 16, 2019

Conversation

sdimitro
Copy link
Contributor

@sdimitro sdimitro commented Feb 21, 2019

Motivation

At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps.

The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory.

Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size.

About this patch

This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry refering to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool.

Testing

The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems.

Per Matt's request I also made sure that opening the pool readonly from older versions of ZFS is possible:
I created 2 VMs -
[A] latest bits but without the Log Spacemap feature,
[B] latest bits patched with the Log Spacemap feature.

$ # Create a file-based pool in [B] that has the feature activated
$ truncate -s 512m dsk
$ sudo zpool create testpool $(pwd)/dsk
$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool     69.5G  14.1G  55.4G        -         -      -    20%  1.00x    ONLINE  -
testpool   480M   129K   480M        -         -     0%     0%  1.00x    ONLINE  -

$ sudo zpool get feature@log_spacemap testpool
NAME      PROPERTY              VALUE                 SOURCE
testpool  feature@log_spacemap  active                local

$ # export the pool and ensure that there log spacemaps (debug bits randomly keep logs after export)
$ sudo zpool export testpool
$ sudo zdb -m -e -p . testpool
....
Log Space Maps in Pool:
Log Spacemap object 65 txg 4
space map object 65:
  smp_length = 0x1e0
  smp_alloc = 0x16200
Log Spacemap object 76 txg 5
space map object 76:
  smp_length = 0x3a0
  smp_alloc = 0x2400
Log Spacemap object 77 txg 6
space map object 77:
  smp_length = 0x268
  smp_alloc = 0x1e00
Log Spacemap object 78 txg 17
space map object 78:
  smp_length = 0x208
  smp_alloc = 0x1800
Log Spacemap object 79 txg 26
space map object 79:
  smp_length = 0x2e0
  smp_alloc = 0x1e00

Log Space Map Obsolete Entry Statistics:
10       valid entries out of 20       - txg 4
30       valid entries out of 43       - txg 5
24       valid entries out of 28       - txg 6
20       valid entries out of 22       - txg 17
33       valid entries out of 33       - txg 26
117      valid entries out of 146      - total
$ # Copy the pool over to [A] and try to import it
$ scp <host B>:~/dsk .
dsk                                                                                                   100%  512MB 144.2MB/s   00:03
$ sudo zpool import -d . testpool
This pool uses the following feature(s) not supported by this system:
	com.delphix:log_spacemap (Log metaslab changes on a single spacemap and flush them periodically.)
All unsupported features are only required for writing to the pool.
The pool can be imported using '-o readonly=on'.
cannot import 'testpool': unsupported version or feature

$ # Import the pool read-only:
$ sudo zpool import -o readonly=on -d . testpool
$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool     69.5G  13.8G  55.7G        -         -      -    19%  1.00x    ONLINE  -
testpool   480M    36K   480M        -         -     0%     0%  1.00x    ONLINE  -
$ sudo zpool get all testpool
....
testpool  unsupported@com.delphix:log_spacemap  readonly

Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment.
SyncPass1

Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits.
syncPerTXG

As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects.
stockPass
lsmPass

Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval.
3DayGraph
10HRGraph

Porting to Other Platforms

For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent - db58794
Update vdev_is_spacemap_addressable() for new spacemap encoding - 419ba59
Simplify spa_sync by breaking it up to smaller functions - 8dc2197
Factor metaslab_load_wait() in metaslab_load() - b194fab
Rename range_tree_verify to range_tree_verify_not_present - df72b8b
Change target size of metaslabs from 256GB to 16GB - c853f38
zdb -L should skip leak detection altogether - 21e7cf5
vs_alloc can underflow in L2ARC vdevs - 7558997
Simplify log vdev removal code - 6c926f4
Get rid of space_map_update() for ms_synced_length - 425d323
Introduce auxiliary metaslab histograms - 928e8ad
Error path in metaslab_load_impl() forgets to drop ms_sync_lock - 8eef997

References

Background, Motivation, and Internals of the Feature

  • OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ
  • Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)

  • Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/
  • OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw
  • Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320

Signed-off-by: Serapheim Dimitropoulos serapheim@delphix.com

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@sdimitro sdimitro added Status: Work in Progress Not yet ready for general review Missing Template labels Feb 21, 2019
@sdimitro sdimitro force-pushed the personal-lsm-1 branch 2 times, most recently from 76ce679 to b14bc54 Compare February 21, 2019 17:44
@openzfs openzfs deleted a comment from Skaronator Feb 22, 2019
@sdimitro sdimitro force-pushed the personal-lsm-1 branch 5 times, most recently from 7cd2a5f to b2c78ee Compare February 28, 2019 22:27
@sdimitro sdimitro changed the title LSM - WIP Log Spacemap Project Feb 28, 2019
@sdimitro sdimitro added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Mar 18, 2019
@postwait
Copy link

Hi there, I spent the last few days reviewing this code. It's of great interest to me at work as we land in this pathological case often. I don't know how this community works, would my PR approval be helpful at all? Code's quite clean.

@sdimitro
Copy link
Contributor Author

Thanks for reading through the review and I'm glad to hear that it will help you!

I think we can count you as a reviewer - more reviewers are always welcome. Keep in mind though that since this is a large change and in the metaslab code, I'd personally feel most comfortable merging this once I get feedback from folks whose code I had to change or those who have recently made changes to related codepaths.

In the meantime, feel free to let me know if you have any specific suggestions/feedback from your code review.

man/man5/zfs-module-parameters.5 Show resolved Hide resolved
man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved
module/zcommon/zfeature_common.c Show resolved Hide resolved
@ahrens
Copy link
Member

ahrens commented Apr 2, 2019

Regarding the zimport test failures, I think we need to figure out how to change the test such that these pass, and hopefully still get the intended test coverage. For example, we could create the pool without the log spacemap feature enabled. It seems like we would have the same problem for any new features (which are activated immediately), so it would be ideal to add a mechanism in the zimport test to handle this.

@sdimitro
Copy link
Contributor Author

sdimitro commented Apr 3, 2019

Regarding the zimport test failures, I think we need to figure out how to change the test such that these pass, and hopefully still get the intended test coverage. For example, we could create the pool without the log spacemap feature enabled. It seems like we would have the same problem for any new features (which are activated immediately), so it would be ideal to add a mechanism in the zimport test to handle this.

I agree although I'm not that familiar on the infrastructure changes that would be needed for this. @behlendorf thoughts?

@edillmann
Copy link
Contributor

Hi,

Could it be possible to rebase this beautyfull piece of code to master ?

Thanks,

@sdimitro sdimitro force-pushed the personal-lsm-1 branch 2 times, most recently from 95175e3 to 245e482 Compare April 9, 2019 16:21
@sdimitro
Copy link
Contributor Author

sdimitro commented Apr 9, 2019

@edillmann rebased

@behlendorf
Copy link
Contributor

For example, we could create the pool without the log spacemap feature enabled. It seems like we would have the same problem for any new features (which are activated immediately), so it would be ideal to add a mechanism in the zimport test to handle this.

For exactly this reason this behavior is supported by the existing infrastructure. Unfortunately, it's really not documented anywhere other than in the zimport commit itself, 133a5c6 .

    zimport.sh: Allow custom pool create options
    
    Allow custom options to be passed to 'zpool create` when creating
    a new pool.
    
    Normally zimport.sh is intented to prevent accidentally introduced
    incompatibilities so we want the default behavior.  However, when
    introducing a known incompatibility with a feature flag we need a
    way to disable the feature.  By adding a line like the following
    to the commit message the feature can be disabled allowing the
    pool to be compatibile with older versions.
    
    TEST_ZIMPORT_CREATE_OPTIONS="-o feature@encryption=disabled"

@sdimitro you're going to want to add the following line to the top most commit in this PR.

TEST_ZIMPORT_CREATE_OPTIONS="-o feature@log_spacemap=disabled"

Assuming that works as intended I think we should update the https://github.com/zfsonlinux/zfs/wiki/Buildbot-Options documentation on the wiki to more prominently describe hot to use this functionality.

@behlendorf
Copy link
Contributor

@sdimitro can you please rebase this on master now that redacted send/recv has been merged.

@sdimitro
Copy link
Contributor Author

sdimitro commented Jul 1, 2019

Just rebased on the latest master. @behlendorf can you take a look when you get the change now that redacted send/receive is in?

cmd/zdb/zdb.c Show resolved Hide resolved
man/man5/zfs-module-parameters.5 Show resolved Hide resolved
module/zfs/metaslab.c Outdated Show resolved Hide resolved
man/man5/zfs-module-parameters.5 Show resolved Hide resolved
module/zfs/metaslab.c Show resolved Hide resolved
module/zfs/metaslab.c Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry refering to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db58794

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59

Simplify spa_sync by breaking it up to smaller functions
8dc2197

Factor metaslab_load_wait() in metaslab_load()
b194fab

Rename range_tree_verify to range_tree_verify_not_present
df72b8b

Change target size of metaslabs from 256GB to 16GB
c853f38

zdb -L should skip leak detection altogether
21e7cf5

vs_alloc can underflow in L2ARC vdevs
7558997

Simplify log vdev removal code
6c926f4

Get rid of space_map_update() for ms_synced_length
425d323

Introduce auxiliary metaslab histograms
928e8ad

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 15, 2019
@codecov
Copy link

codecov bot commented Jul 16, 2019

Codecov Report

Merging #8442 into master will increase coverage by 0.07%.
The diff coverage is 96.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8442      +/-   ##
==========================================
+ Coverage   78.63%    78.7%   +0.07%     
==========================================
  Files         401      402       +1     
  Lines      120157   120977     +820     
==========================================
+ Hits        94481    95211     +730     
- Misses      25676    25766      +90
Flag Coverage Δ
#kernel 79.53% <96.41%> (+0.15%) ⬆️
#user 66.42% <91.48%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff9630d...2ca5ef8. Read the comment docs.

@behlendorf behlendorf merged commit 93e28d6 into openzfs:master Jul 16, 2019
@satmandu
Copy link
Contributor

Will this make it into a 0.8.x point release?

@behlendorf
Copy link
Contributor

@satmandu this new feature will be part of the next major release. There's no plan to include in an 0.8.x point release.

@postwait
Copy link

Ouch.. so another year or so before we can escape from wicked fragmentation issues? That's a big bummer.

sdimitro added a commit to sdimitro/zfs that referenced this pull request Jul 16, 2019
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db58794

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59

Simplify spa_sync by breaking it up to smaller functions
8dc2197

Factor metaslab_load_wait() in metaslab_load()
b194fab

Rename range_tree_verify to range_tree_verify_not_present
df72b8b

Change target size of metaslabs from 256GB to 16GB
c853f38

zdb -L should skip leak detection altogether
21e7cf5

vs_alloc can underflow in L2ARC vdevs
7558997

Simplify log vdev removal code
6c926f4

Get rid of space_map_update() for ms_synced_length
425d323

Introduce auxiliary metaslab histograms
928e8ad

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#8442
TulsiJain pushed a commit to TulsiJain/zfs that referenced this pull request Jul 20, 2019
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db58794

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59

Simplify spa_sync by breaking it up to smaller functions
8dc2197

Factor metaslab_load_wait() in metaslab_load()
b194fab

Rename range_tree_verify to range_tree_verify_not_present
df72b8b

Change target size of metaslabs from 256GB to 16GB
c853f38

zdb -L should skip leak detection altogether
21e7cf5

vs_alloc can underflow in L2ARC vdevs
7558997

Simplify log vdev removal code
6c926f4

Get rid of space_map_update() for ms_synced_length
425d323

Introduce auxiliary metaslab histograms
928e8ad

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#8442
TulsiJain pushed a commit to TulsiJain/zfs that referenced this pull request Jul 20, 2019
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db58794

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59

Simplify spa_sync by breaking it up to smaller functions
8dc2197

Factor metaslab_load_wait() in metaslab_load()
b194fab

Rename range_tree_verify to range_tree_verify_not_present
df72b8b

Change target size of metaslabs from 256GB to 16GB
c853f38

zdb -L should skip leak detection altogether
21e7cf5

vs_alloc can underflow in L2ARC vdevs
7558997

Simplify log vdev removal code
6c926f4

Get rid of space_map_update() for ms_synced_length
425d323

Introduce auxiliary metaslab histograms
928e8ad

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#8442
uqs pushed a commit to freebsd/freebsd-ports that referenced this pull request Jul 20, 2019
- Unbreak on CURRENT

Includes the new Log Spacemap functionality:
openzfs/zfs#8442

PR: 239342
Sponsored by: iXsystems


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@507006 35697150-7ecd-e111-bb59-0022644237b5
uqs pushed a commit to freebsd/freebsd-ports that referenced this pull request Jul 20, 2019
- Unbreak on CURRENT

Includes the new Log Spacemap functionality:
openzfs/zfs#8442

PR: 239342
Sponsored by: iXsystems
Jehops pushed a commit to Jehops/freebsd-ports-legacy that referenced this pull request Jul 21, 2019
- Unbreak on CURRENT

Includes the new Log Spacemap functionality:
openzfs/zfs#8442

PR: 239342
Sponsored by: iXsystems


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@507006 35697150-7ecd-e111-bb59-0022644237b5
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request Apr 28, 2020
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db58794

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59

Simplify spa_sync by breaking it up to smaller functions
8dc2197

Factor metaslab_load_wait() in metaslab_load()
b194fab

Rename range_tree_verify to range_tree_verify_not_present
df72b8b

Change target size of metaslabs from 256GB to 16GB
c853f38

zdb -L should skip leak detection altogether
21e7cf5

vs_alloc can underflow in L2ARC vdevs
7558997

Simplify log vdev removal code
6c926f4

Get rid of space_map_update() for ms_synced_length
425d323

Introduce auxiliary metaslab histograms
928e8ad

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#8442

Signed-off-by: Bryant G. Ly <bly@catalogicsoftware.com>

Conflicts:
	include/zfeature_common.h
	man/man5/zfs-module-parameters.5
	module/zfs/dsl_pool.c
	module/zfs/spa.c
	tests/zfs-tests/tests/functional/cli_root/zpool_get/zpool_get.cfg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants