Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock between spa_errlog_lock and dp_config_rwlock #14289

Merged
merged 1 commit into from
Dec 22, 2022

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Dec 13, 2022

Motivation and Context

There is a lock order inversion deadlock between spa_errlog_lock and dp_config_rwlock:

A thread in spa_delete_dataset_errlog() is running from a sync task. It is holding the dp_config_rwlock for writer (see dsl_sync_task_sync()), and waiting for the spa_errlog_lock.

A thread in dsl_pool_config_enter() is holding the spa_errlog_lock (see spa_get_errlog_size()) and waiting for the dp_config_rwlock (as reader).

Note that this was introduced by #12812 @gamanakis.

See #14239

Description

This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held.

How Has This Been Tested?

Tested by having some errors in the pool (via zinject -t data /path/to/file), one thread running zpool iostat 0.001, and another thread runs zfs destroy (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@ryao ryao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we might want a test case for this to avoid future regressions, but otherwise, this looks good tome.

@behlendorf
Copy link
Contributor

After this change the code now reliably asserts when running the zpool_scrub tests on all of the TEST builders.

@ahrens
Copy link
Member Author

ahrens commented Dec 14, 2022

@behlendorf thanks for bringing that to my attention. I could have changed the lock order there too, but I decided to make a more extensive change to simplify this code, by removing the ability to get the exact number of entries that "would be" returned by spa_get_errlog(). In the course of testing that, I realized that there were many missing checks for overflowing the user's buffer (count going negative), so if the error count changed after the "would be" check, the kernel would overwrite the user's address space. I introduced copyout_entry() to ensure that this check is made in all code paths.

lib/libzfs/libzfs_pool.c Show resolved Hide resolved
Copy link
Member

@grwilson grwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I see this code snippet would now have an out-of-date comment:

 static int
 status_callback(zpool_handle_t *zhp, void *data)
 {
<snip>
                         /*
                          * If the approximate error count is small, get a
                          * precise count by fetching the entire log and
                          * uniquifying the results.
                          */
                         if (nerr > 0 && nerr < 100 && !cbp->cb_verbose &&
                             zpool_get_errlog(zhp, &nverrlist) == 0) {
                                 nvpair_t *elem;

                                 elem = NULL;
                                 nerr = 0;
                                 while ((elem = nvlist_next_nvpair(nverrlist,
                                     elem)) != NULL) {
                                         nerr++;
                                 }
                         }

This code has been around since the beginning of time so I don't have much context on its useful but updating the comment seems warranted.

@ahrens
Copy link
Member Author

ahrens commented Dec 16, 2022

@grwilson I agree that behavior is potentially confusing. I've changed it to always display the "approximate" error count.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can squash and rebase this we should be able to get a cleaner test run on all the builders.

@ikozhukhov
Copy link
Contributor

we can see regression on DilOS with original commit.

This test /redundancy/redundancy_stripe has unstable result.
Launch 10 times on Debian:
/usr/share/zfs/zfs-tests.sh -I 10 -vKR -s 3G -t "tests/functional/redundancy/redundancy_stripe"
A result:
— Configuration —
Runfiles: /var/tmp/zfs-tests.217598.run
STF_TOOLS: /usr/share/zfs/test-runner
STF_SUITE: /usr/share/zfs/zfs-tests
STF_PATH: /var/tmp/constrained_path.HS5szy
FILEDIR: /var/tmp
FILES: /var/tmp/file-vdev0 /var/tmp/file-vdev1 /var/tmp/file-vdev2
LOOPBACKS: /dev/loop0 /dev/loop1 /dev/loop2
DISKS: loop0 loop1 loop2
NUM_DISKS: 3
FILESIZE: 3G
ITERATIONS: 10
TAGS: functional
STACK_TRACER: no
Keep pool(s): testpool_13709
Missing util(s): getfacl pax setfacl
/usr/share/zfs/test-runner/bin/test-runner.py -K -c "/var/tmp/zfs-tests.217598.run" -T "functional" -i "/usr/share/zfs/zfs-tests" -I "10"
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/usr/lib/python3.9/threading.py", line 1266, in run
self.function(*self.args, **self.kwargs)
File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [] []
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:09] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:11] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:12] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:17] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Exception in thread Thread-17:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/usr/lib/python3.9/threading.py", line 1266, in run
self.function(*self.args, **self.kwargs)
File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [] []
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:06] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:09] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:15] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:12] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
Results Summary
PASS 28
Running Time: 00:02:04
Percent passed: 100.0%
Log directory: /var/tmp/test_results/20221214T114848
Tests with results other than PASS that are expected:
Tests with result of PASS that are unexpected:
Tests with results other than PASS that are unexpected:
Also note that failed tests are not registered. The total must be 30.

On DilOS this degradation begun after import this commit:
0409d33

could you please try to check this use case for your testing ?

There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:

A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.

A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).

Note that this was introduced by openzfs#12812.

This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.

Additionally, a buffer overrun in `spa_get_errlog()` is corrected.  Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.

Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.

Closes openzfs#14239
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@ahrens
Copy link
Member Author

ahrens commented Dec 20, 2022

@behlendorf rebased and squashed

@ikozhukhov I think you're saying that the below error was introduced by 0409d33, which also caused the bug that I'm fixing here?

File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'

If that's the case then please file a separate bug report as I don't think it's related to the deadlock that I'm addressing in this PR.

That said, it looks like your bug is that the test suite (test-runner.py) is calling a python function with the wrong arguments? I don't see any changes to python files in 0409d33, so I'm not sure how it would cause this, but I could be missing something.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 20, 2022
@ikozhukhov
Copy link
Contributor

@behlendorf rebased and squashed

@ikozhukhov I think you're saying that the below error was introduced by 0409d33, which also caused the bug that I'm fixing here?

File "/usr/share/zfs/test-runner/bin/test-runner.py", line 214, in kill_cmd
self.run(False)
TypeError: run() missing 2 required positional arguments: 'kmemleak' and 'kmsg'

If that's the case then please file a separate bug report as I don't think it's related to the deadlock that I'm addressing in this PR.

That said, it looks like your bug is that the test suite (test-runner.py) is calling a python function with the wrong arguments? I don't see any changes to python files in 0409d33, so I'm not sure how it would cause this, but I could be missing something.

yes, we can see issue with original commit what you try to fix by this PR.
we can see issue with test /redundancy/redundancy_stripe - it has unstable result if we try to run tests in loop.
it was just example based on Debian 11 instead of on DilOS for reproduce issue.
on DilOS we can see deadlock and couldn't destroy zpool destroy -f testpool - it's busy

@gamanakis
Copy link
Contributor

@ahrens thank you so much for taking this over. I was hitting this while working on #12355 but wasn't able to resolve it. With your proposed solution, I will take another shot at finishing #12355.

@behlendorf behlendorf merged commit 018f260 into openzfs:master Dec 22, 2022
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:

A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.

A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).

Note that this was introduced by openzfs#12812.

This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.

Additionally, a buffer overrun in `spa_get_errlog()` is corrected.  Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.

Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14239
Closes openzfs#14289
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants