deadlock between spa_errlog_lock and dp_config_rwlock #14289

ahrens · 2022-12-13T23:14:01Z

Motivation and Context

There is a lock order inversion deadlock between spa_errlog_lock and dp_config_rwlock:

A thread in spa_delete_dataset_errlog() is running from a sync task. It is holding the dp_config_rwlock for writer (see dsl_sync_task_sync()), and waiting for the spa_errlog_lock.

A thread in dsl_pool_config_enter() is holding the spa_errlog_lock (see spa_get_errlog_size()) and waiting for the dp_config_rwlock (as reader).

Note that this was introduced by #12812 @gamanakis.

See #14239

Description

This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held.

How Has This Been Tested?

Tested by having some errors in the pool (via zinject -t data /path/to/file), one thread running zpool iostat 0.001, and another thread runs zfs destroy (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

ryao

I feel like we might want a test case for this to avoid future regressions, but otherwise, this looks good tome.

behlendorf · 2022-12-14T18:10:26Z

After this change the code now reliably asserts when running the zpool_scrub tests on all of the TEST builders.

ahrens · 2022-12-14T22:52:45Z

@behlendorf thanks for bringing that to my attention. I could have changed the lock order there too, but I decided to make a more extensive change to simplify this code, by removing the ability to get the exact number of entries that "would be" returned by spa_get_errlog(). In the course of testing that, I realized that there were many missing checks for overflowing the user's buffer (count going negative), so if the error count changed after the "would be" check, the kernel would overwrite the user's address space. I introduced copyout_entry() to ensure that this check is made in all code paths.

lib/libzfs/libzfs_pool.c

grwilson

Looks good. I see this code snippet would now have an out-of-date comment:

 static int
 status_callback(zpool_handle_t *zhp, void *data)
 {
<snip>
                         /*
                          * If the approximate error count is small, get a
                          * precise count by fetching the entire log and
                          * uniquifying the results.
                          */
                         if (nerr > 0 && nerr < 100 && !cbp->cb_verbose &&
                             zpool_get_errlog(zhp, &nverrlist) == 0) {
                                 nvpair_t *elem;

                                 elem = NULL;
                                 nerr = 0;
                                 while ((elem = nvlist_next_nvpair(nverrlist,
                                     elem)) != NULL) {
                                         nerr++;
                                 }
                         }

This code has been around since the beginning of time so I don't have much context on its useful but updating the comment seems warranted.

ahrens · 2022-12-16T01:35:54Z

@grwilson I agree that behavior is potentially confusing. I've changed it to always display the "approximate" error count.

behlendorf

If you can squash and rebase this we should be able to get a cleaner test run on all the builders.

ikozhukhov · 2022-12-19T19:11:48Z

we can see regression on DilOS with original commit.

This test /redundancy/redundancy_stripe has unstable result.
Launch 10 times on Debian:
/usr/share/zfs/zfs-tests.sh -I 10 -vKR -s 3G -t "tests/functional/redundancy/redundancy_stripe"
A result:
— Configuration —
Runfiles: /var/tmp/zfs-tests.217598.run
STF_TOOLS: /usr/share/zfs/test-runner
STF_SUITE: /usr/share/zfs/zfs-tests
STF_PATH: /var/tmp/constrained_path.HS5szy
FILEDIR: /var/tmp
FILES: /var/tmp/file-vdev0 /var/tmp/file-vdev1 /var/tmp/file-vdev2
LOOPBACKS: /dev/loop0 /dev/loop1 /dev/loop2
DISKS: loop0 loop1 loop2
NUM_DISKS: 3
FILESIZE: 3G
ITERATIONS: 10
TAGS: functional
STACK_TRACER: no
Keep pool(s): testpool_13709
Missing util(s): getfacl pax setfacl
/usr/share/zfs/test-runner/bin/test-runner.py -K -c "/var/tmp/zfs-tests.217598.run" -T "functional" -i "/usr/share/zfs/zfs-tests" -I "10"
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/usr/lib/python3.9/threading.py", line 1266, in run
self.function(*self.args, **self.kwargs)
File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [] []
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:09] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:11] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:12] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:17] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Exception in thread Thread-17:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/usr/lib/python3.9/threading.py", line 1266, in run
self.function(*self.args, **self.kwargs)
File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [] []
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:06] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:09] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:15] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:12] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Results Summary
PASS 28
Running Time: 00:02:04
Percent passed: 100.0%
Log directory: /var/tmp/test_results/20221214T114848
Tests with results other than PASS that are expected:
Tests with result of PASS that are unexpected:
Tests with results other than PASS that are unexpected:
Also note that failed tests are not registered. The total must be 30.

On DilOS this degradation begun after import this commit:
0409d33

could you please try to check this use case for your testing ?

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Closes openzfs#14239 Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

ahrens · 2022-12-20T22:51:57Z

@behlendorf rebased and squashed

@ikozhukhov I think you're saying that the below error was introduced by 0409d33, which also caused the bug that I'm fixing here?

File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'

If that's the case then please file a separate bug report as I don't think it's related to the deadlock that I'm addressing in this PR.

That said, it looks like your bug is that the test suite (test-runner.py) is calling a python function with the wrong arguments? I don't see any changes to python files in 0409d33, so I'm not sure how it would cause this, but I could be missing something.

ikozhukhov · 2022-12-21T07:09:12Z

@behlendorf rebased and squashed

@ikozhukhov I think you're saying that the below error was introduced by 0409d33, which also caused the bug that I'm fixing here?
File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'
If that's the case then please file a separate bug report as I don't think it's related to the deadlock that I'm addressing in this PR.

That said, it looks like your bug is that the test suite (test-runner.py) is calling a python function with the wrong arguments? I don't see any changes to python files in 0409d33, so I'm not sure how it would cause this, but I could be missing something.

yes, we can see issue with original commit what you try to fix by this PR.
we can see issue with test /redundancy/redundancy_stripe - it has unstable result if we try to run tests in loop.
it was just example based on Debian 11 instead of on DilOS for reproduce issue.
on DilOS we can see deadlock and couldn't destroy zpool destroy -f testpool - it's busy

gamanakis · 2022-12-21T21:57:35Z

@ahrens thank you so much for taking this over. I was hitting this while working on #12355 but wasn't able to resolve it. With your proposed solution, I will take another shot at finishing #12355.

There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by openzfs#12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#14239 Closes openzfs#14289

ahrens requested review from behlendorf, grwilson and mmaybee December 13, 2022 23:14

ahrens force-pushed the errlog branch from 2ef38b1 to a76caae Compare December 13, 2022 23:15

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 14, 2022

ryao approved these changes Dec 14, 2022

View reviewed changes

ryao self-requested a review December 14, 2022 19:48

ahrens force-pushed the errlog branch from a76caae to 3d00930 Compare December 14, 2022 22:48

ahrens force-pushed the errlog branch from 3d00930 to abef0c5 Compare December 14, 2022 23:04

mmaybee approved these changes Dec 15, 2022

View reviewed changes

lib/libzfs/libzfs_pool.c Show resolved Hide resolved

grwilson approved these changes Dec 15, 2022

View reviewed changes

behlendorf approved these changes Dec 19, 2022

View reviewed changes

ahrens force-pushed the errlog branch from 015d919 to 4992841 Compare December 20, 2022 22:41

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 20, 2022

behlendorf merged commit 018f260 into openzfs:master Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock between spa_errlog_lock and dp_config_rwlock #14289

deadlock between spa_errlog_lock and dp_config_rwlock #14289

ahrens commented Dec 13, 2022 •

edited

Loading

ryao left a comment

behlendorf commented Dec 14, 2022

ahrens commented Dec 14, 2022

grwilson left a comment

ahrens commented Dec 16, 2022 •

edited

Loading

behlendorf left a comment

ikozhukhov commented Dec 19, 2022

ahrens commented Dec 20, 2022

ikozhukhov commented Dec 21, 2022

gamanakis commented Dec 21, 2022

deadlock between spa_errlog_lock and dp_config_rwlock #14289

deadlock between spa_errlog_lock and dp_config_rwlock #14289

Conversation

ahrens commented Dec 13, 2022 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

ryao left a comment

Choose a reason for hiding this comment

behlendorf commented Dec 14, 2022

ahrens commented Dec 14, 2022

grwilson left a comment

Choose a reason for hiding this comment

ahrens commented Dec 16, 2022 • edited Loading

behlendorf left a comment

Choose a reason for hiding this comment

ikozhukhov commented Dec 19, 2022

ahrens commented Dec 20, 2022

ikozhukhov commented Dec 21, 2022

gamanakis commented Dec 21, 2022

ahrens commented Dec 13, 2022 •

edited

Loading

ahrens commented Dec 16, 2022 •

edited

Loading