Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set aside a metaslab for ZIL blocks #11389

Merged
merged 1 commit into from
Jan 21, 2021

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Dec 23, 2020

Motivation and Context

Mixing ZIL and normal allocations has several problems:

  1. The ZIL allocations are allocated, written to disk, and then a few
    seconds later freed. This leaves behind holes (free segments) where the
    ZIL blocks used to be, which increases fragmentation, which negatively
    impacts performance.

  2. When under moderate load, ZIL allocations are of 128KB. If the pool
    is fairly fragmented, there may not be many free chunks of that size.
    This causes ZFS to load more metaslabs to locate free segments of 128KB
    or more. The loading happens synchronously (from zil_commit()), and can
    take around a second even if the metaslab's spacemap is cached in the
    ARC. All concurrent synchronous operations on this filesystem must wait
    while the metaslab is loading. This can cause a significant performance
    impact.

  3. If the pool is very fragmented, there may be zero free chunks of
    128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
    which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

Description

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations. Each one is
tried in order until the allocation succeeds:

  1. dedicated log vdevs, aka "slog" (spa_log_class)
  2. embedded slog metaslabs (spa_embedded_log_class)
  3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between 0.5% and
1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is
not available for user data.

How Has This Been Tested?

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and
recordsize=8k, testing shows a ~50% performance increase on random 8k sync
writes. On even more fragmented systems (which hit problem #3 above and call
txg_wait_synced()), the performance improvement can be arbitrarily large
(>100x).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ahrens ahrens added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Dec 23, 2020
@ahrens ahrens mentioned this pull request Jan 5, 2021
13 tasks
@@ -6344,6 +6350,17 @@ dump_block_stats(spa_t *spa)
100.0 * alloc / space);
}

if (spa_embedded_log_class(spa)->mc_allocator[0].mca_rotor != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the selection of the embedded_log_class dynamic? (i.e. at each spa_import?). I'm trying to understand how zdb instance of the pool knows which ms to use for embedded log and if it can be different from what the runtime pool is using?

If zdb can know the actual ms, then a follow on zdb change could augment the zdb -m output by tagging the embedded log ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the metaslab is selected when opening the pool. So zdb wouldn't necessarily select the same metaslab as was used recently (or is currently in use, if the pool is imported while running zdb). But typically, on pools that have moderate or higher fullness or fragmentation, the zil metaslab will really stick out as it will be the only one that's (nearly) empty and unfragmented.

@sempervictus
Copy link
Contributor

I've had this running on a couple test VMs for a week or so - they haven't blown up and do seem to have lower pool fragmentation rates in overwritten pools getting beat up with a bunch of fio scripts. The performance benefit on mostly-full pools is pretty noticeable (this is not on fast io subsystems by any means, ceph on encrypted bcache served up as a qemu volume) - iowait spikes aren't nearly as bad after a few fills.
How safe is the on-disk format for pulling the code into a production branch?

@ahrens
Copy link
Member Author

ahrens commented Jan 17, 2021

@sempervictus Thanks for your testing, glad it's working well for you! There's no on-disk format change, so the PR code should be safe to use in production. (We just "happen" to choose different places to allocate ZIL vs other blocks, and the ZIL metaslab is determined each time the pool is opened.) FYI, we have been using a slightly different version of this in the Delphix product for over a year.

Copy link
Contributor

@sdimitro sdimitro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a couple of questions around small vdevs.

module/zfs/vdev.c Show resolved Hide resolved
module/zfs/vdev_removal.c Show resolved Hide resolved
Copy link
Contributor

@mmaybee mmaybee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo Serapheim's concern that there may be cases where you try to passivate a null metaslab pointer with small pools, everything looks good.

module/zfs/metaslab.c Outdated Show resolved Hide resolved
module/zfs/metaslab.c Show resolved Hide resolved
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 20, 2021
@behlendorf behlendorf merged commit aa755b3 into openzfs:master Jan 21, 2021
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem openzfs#3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#11389
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem openzfs#3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#11389
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants