Improved zpool status output, list all affected datasets #12812

gamanakis · 2021-12-01T11:30:56Z

Motivation and Context

This is a new effort to merge #9175 in current master.

Description

Currently, determining which datasets are affected by corruption is
a manual process.

The primary difficulty in reporting the list of affected snapshots is
that since the error was initially found, the snapshot where the error
originally occurred in, may have been deleted. To solve this issue, we
add the ID of the head dataset of the original snapshot which the error
was detected in, to the stored error report. Then any time a filesystem
is deleted, the errors associated with it are deleted as well. Any time
a clone promote occurs, we modify reports associated with the original
head to refer to the new head. The stored error reports are identified
by this head ID, the birth time of the block which the error occurred
in, as well as some information about the error itself are also stored.

Once this information is stored, we can find the set of datasets
affected by an error by walking back the list of snapshots in the given
head until we find one with the appropriate birth txg, and then traverse
through the snapshots of the clone family, terminating a branch if the
block was replaced in a given snapshot. Then we report this information
back to libzfs, and to the zpool status command, where it is displayed
as follows:

pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3
08:27:57 2021
config:

    NAME        STATE     READ WRITE CKSUM
    test        ONLINE       0     0     0
      sdb       ONLINE       0     0 1.58K

errors: Permanent errors have been detected in the following files:

    test@1:/test.0.0
    /test/test.0.0
    /test/1clone/test.0.0

A new feature flag is introduced to mark the presence of this change, as
well as promotion and backwards compatibility logic. This is an updated
version of #9175. Rebase required fixing the tests, updating the ABI of
libzfs, updating the man pages, fixing bugs, and updating the old
on-disk error logs to the new format upon activation of the feature.

Signed-off-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis gamanakis@gmail.com

How Has This Been Tested?

Compiles and runs locally, the two introduced tests (zpool_status_00{3,4}) pass.
ZTS does not report any significant errors.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

alek-p · 2021-12-02T21:44:54Z

Thanks for working on this! Looks like it will also help with picking candidate snapshots for healing recv.

gamanakis · 2021-12-02T22:26:06Z

Marked this as ready for review. I incorporated the suggestions by @allanjude from #9175.

module/zcommon/zfeature_common.c

module/zfs/spa_errlog.c

tonyhutter · 2021-12-02T23:22:41Z

Your commit message is currently just:

Improved zpool status output, list all affected datasets
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>

Please add more detail to the commit message. At a minimum, include an example of what the new zpool status -v output looks like. Also add a line saying: "This is an updated version of #9175".

tonyhutter · 2021-12-02T23:37:08Z

You should probably also mention in your commit message that this adds a feature flag.

module/zfs/spa_errlog.c

tonyhutter · 2021-12-02T23:49:35Z

Hmm... am I missing somewhere where you disable headerrlog for zpool_status_004_pos.ksh? These test look nearly identical:

$ diff -u tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_003_pos.ksh tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_004_pos.ksh
--- tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_003_pos.ksh	2021-12-02 15:46:44.893262446 -0800
+++ tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_004_pos.ksh	2021-12-02 15:46:44.893262446 -0800
@@ -28,7 +28,7 @@
 
 #
 # DESCRIPTION:
-# Verify correct output with 'zpool status -v' after corrupting a file
+# Verify feature@headerrlog=disabled works perfectly.
 #
 # STRATEGY:
 # 1. Create a file system
@@ -66,4 +66,4 @@
 log_must eval "zpool status -v | grep '$TESTPOOL/testclone/10m_file'"
 log_must eval "zpool status -v | grep '$TESTDIR/10m_file'"
 
-log_pass "'zpool status -v' outputs affected filesystem, snapshot & clone"
+log_pass "'feature@headerrlog=disabled works perfectly"

gamanakis · 2021-12-03T08:11:18Z

@tonyhutter thank you for helping out with some rough edges I missed, these are now fixed.

module/zfs/spa_errlog.c

tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_004_pos.ksh

gamanakis · 2021-12-08T08:47:28Z

I also need to fix releasing the hold of a dataset in case of an error in check_filesystem().

Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of openzfs#9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#9175 Closes openzfs#12812

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix.

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Closes openzfs#14239 Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Closes openzfs#14239 Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by #12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14239 Closes #14289

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#14239 Closes openzfs#14289

New features: - Fully adaptive ARC eviction (#14359) - Block cloning (#13392) - Scrub error log (#12812, #12355) - Linux container support (#14070, #14097, #12263) - BLAKE3 Checksums (#12918) - Corrective "zfs receive" (#9372) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

New Features - Block cloning (#13392) - Linux container support (#14070, #14097, #12263) - Scrub error log (#12812, #12355) - BLAKE3 checksums (#12918) - Corrective "zfs receive" - Vdev and zpool user properties Performance - Fully adaptive ARC (#14359) - SHA2 checksums (#13741) - Edon-R checksums (#13618) - Zstd early abort (#13244) - Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

New features: - Fully adaptive ARC eviction (openzfs#14359) - Block cloning (openzfs#13392) - Scrub error log (openzfs#12812, openzfs#12355) - Linux container support (openzfs#14070, openzfs#14097, openzfs#12263) - BLAKE3 Checksums (openzfs#12918) - Corrective "zfs receive" (openzfs#9372) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

gamanakis mentioned this pull request Dec 1, 2021

Improved zpool status output, list all the dataset affected by errlog. #9175

Closed

12 tasks

gamanakis force-pushed the status_output branch 3 times, most recently from d89a0cf to e8a6616 Compare December 2, 2021 09:35

gamanakis marked this pull request as ready for review December 2, 2021 22:25

tonyhutter reviewed Dec 2, 2021

View reviewed changes

module/zcommon/zfeature_common.c Outdated Show resolved Hide resolved

tonyhutter reviewed Dec 2, 2021

View reviewed changes

module/zfs/spa_errlog.c Outdated Show resolved Hide resolved

tonyhutter reviewed Dec 2, 2021

View reviewed changes

module/zfs/spa_errlog.c Outdated Show resolved Hide resolved

gamanakis force-pushed the status_output branch from e8a6616 to 7d6cf25 Compare December 3, 2021 08:08

gamanakis force-pushed the status_output branch 3 times, most recently from b1a4bcf to 78ba7ab Compare December 3, 2021 09:06

allanjude reviewed Dec 4, 2021

View reviewed changes

gamanakis force-pushed the status_output branch from 78ba7ab to 72a4ae9 Compare December 6, 2021 10:19

tonyhutter reviewed Dec 6, 2021

View reviewed changes

module/zfs/spa_errlog.c Show resolved Hide resolved

tonyhutter reviewed Dec 6, 2021

View reviewed changes

module/zfs/spa_errlog.c Outdated Show resolved Hide resolved

tonyhutter reviewed Dec 6, 2021

View reviewed changes

module/zfs/spa_errlog.c Outdated Show resolved Hide resolved

tonyhutter reviewed Dec 6, 2021

View reviewed changes

tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_004_pos.ksh Outdated Show resolved Hide resolved

gamanakis force-pushed the status_output branch from 72a4ae9 to a7a5e14 Compare December 7, 2021 10:32

gamanakis force-pushed the status_output branch 4 times, most recently from 4b9600b to 04296bf Compare December 9, 2021 11:27

ahrens mentioned this pull request Dec 13, 2022

deadlock between spa_errlog_lock and dp_config_rwlock #14289

Merged

13 tasks

ahrens mentioned this pull request Jan 25, 2023

zloop fails with zdb: leaked MOS object (persistent error log) #14434

Closed

gamanakis mentioned this pull request Mar 15, 2023

Teach zpool scrub to scrub only blocks in error log #12355

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved zpool status output, list all affected datasets #12812

Improved zpool status output, list all affected datasets #12812

gamanakis commented Dec 1, 2021 •

edited

Loading

alek-p commented Dec 2, 2021

gamanakis commented Dec 2, 2021

tonyhutter commented Dec 2, 2021 •

edited

Loading

tonyhutter commented Dec 2, 2021

tonyhutter commented Dec 2, 2021 •

edited

Loading

gamanakis commented Dec 3, 2021

gamanakis commented Dec 8, 2021

Improved zpool status output, list all affected datasets #12812

Improved zpool status output, list all affected datasets #12812

Conversation

gamanakis commented Dec 1, 2021 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

alek-p commented Dec 2, 2021

gamanakis commented Dec 2, 2021

tonyhutter commented Dec 2, 2021 • edited Loading

tonyhutter commented Dec 2, 2021

tonyhutter commented Dec 2, 2021 • edited Loading

gamanakis commented Dec 3, 2021

gamanakis commented Dec 8, 2021

gamanakis commented Dec 1, 2021 •

edited

Loading

tonyhutter commented Dec 2, 2021 •

edited

Loading

tonyhutter commented Dec 2, 2021 •

edited

Loading