Fix zpl_mount() deadlock #7693

behlendorf · 2018-07-09T18:46:34Z

Description

Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Motivation and Context

Resolve issue #7691. @ColinIanKing @sforshee can you please review this proposed fix.

How Has This Been Tested?

Locally built and verified the relevant test cases still pass. Unfortunately, I wasn't able to reproduce the issue original issue so I can't verify this does completely resolve it. However, based on the analysis above it should.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

ColinIanKing · 2018-07-11T11:55:37Z

I've tried this fix with lxd creating in parallel 64, then 96, then 128 containers and also deleting these without any lockups. Without the fix, I was able to trigger the lockups. Fix looks good to me. Thanks!

alek-p

LGTM, thanks for the detailed write-up on the 3 threads involved.

Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#7691

vazir · 2018-07-20T21:00:00Z

Any idea if this patch is going to be adopted to Ubuntu 18.04 anytime soon?

ColinIanKing · 2018-07-20T21:31:04Z

We have a 3 week release cadence on the kernel for patch integration, testing and release, so it should be landing in the next 3 week cycle.

vazir · 2018-07-21T08:45:12Z

I hit this bug in slightly different circumstances. Invoking "lxc ls" from inside the ZFS mounted container, lead for me to the same behavior everyone here described. So i than had to power-off the host to restore it. That may be in parallel with manual container space mount via "zfs mount" before starting the container and not unmounting.

Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Reviewed by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#7598 Closes openzfs#7659 Closes openzfs#7691 Closes openzfs#7693

rincebrain mentioned this pull request Jul 9, 2018

Dedup + LXD leads to permanently hung tasks #7659

Closed

behlendorf requested a review from alek-p July 9, 2018 18:58

alek-p approved these changes Jul 11, 2018

View reviewed changes

behlendorf force-pushed the issue-7659 branch from 29aa053 to bbee325 Compare July 11, 2018 21:24

behlendorf merged commit ac09630 into openzfs:master Jul 11, 2018

behlendorf deleted the issue-7659 branch April 19, 2021 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zpl_mount() deadlock #7693

Fix zpl_mount() deadlock #7693

behlendorf commented Jul 9, 2018 •

edited

Loading

ColinIanKing commented Jul 11, 2018

alek-p left a comment

vazir commented Jul 20, 2018

ColinIanKing commented Jul 20, 2018

vazir commented Jul 21, 2018 •

edited

Loading

Fix zpl_mount() deadlock #7693

Fix zpl_mount() deadlock #7693

Conversation

behlendorf commented Jul 9, 2018 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

ColinIanKing commented Jul 11, 2018

alek-p left a comment

Choose a reason for hiding this comment

vazir commented Jul 20, 2018

ColinIanKing commented Jul 20, 2018

vazir commented Jul 21, 2018 • edited Loading

behlendorf commented Jul 9, 2018 •

edited

Loading

vazir commented Jul 21, 2018 •

edited

Loading