Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

asomers
Copy link
Contributor

@asomers asomers commented Jun 14, 2024

This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this:

In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD.

Sponsored by: Axcient
Signed-off-by: Alan Somers asomers@gmail.com

Motivation and Context

Without this change, unmount speed is very slow, especially when a scan is ongoing

Description

Copy the same changes as in #9115 to FreeBSD.

How Has This Been Tested?

Tested locally on a server with about 240 HDDs, 1000 datasets, and an ongoing resilver. Unmount speed went from 1-3 datasets per 10 seconds to "too fast to measure".

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Footnotes

  1. https://github.com/illumos/illumos-gate/commit/5aaeed5c617553c4cec6328c1f4c19079a5a495a

  2. https://github.com/freebsd/freebsd-src/commit/10b9d77bf1ccf2f3affafa6261692cb92cf7e992

  3. https://github.com/openzfs/zfs/pull/7795

  4. https://github.com/openzfs/zfs/pull/9115

  5. https://github.com/openzfs/zfs/pull/8987

This applies the same change in openzfs#9115 to FreeBSD.  This was actually the
old behavior in FreeBSD 12; it only regressed when FreeBSD support was
added to OpenZFS.  As far as I can tell, the timeline went like this:

* Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
* Illumos added the dirty data check [^4]
* FreeBSD merged in Illumos's conditional check [^3]
* OpenZFS forked from Illumos
* OpenZFS removed the dirty data check in openzfs#7795 [^5]
* @mattmacy forked the OpenZFS repo and began to add FreeBSD support
* OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos
  used, in slightly different form.  At this point the OpenZFS repo did
  not yet have multi-OS support.
* Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on
  slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server
with a large pool that has 1000 datasets and is resilvering an HDD.

[^1]: openzfs#9115
[^2]: openzfs#8987
[^3]: freebsd/freebsd-src@10b9d77
[^4]: illumos/illumos-gate@5aaeed5
[^5]: openzfs#7795

Sponsored by:	Axcient
Signed-off-by:	Alan Somers <asomers@gmail.com>
@amotin
Copy link
Member

amotin commented Jun 14, 2024

I haven't looked too deep, but looking on original panic in #7753 I suspect there may be a race between last dnode of dataset being synced in dmu_objset_sync_dnodes() and userquota_updates_task() actually completed in another task.

@tonyhutter
Copy link
Contributor

@amotin can you give a little more detail the dmu_objset_sync_dnodes()/userquota_updates_task() race condition you're concerned about? If there is a race then I'm worried that it would affect the Linux zfsvfs_teardown() codepath as well, since this PR basically makes them the same code.

Linux codepath in master:

       // module/os/linux/zfs/zfs_vfsops.c:zfsvfs_teardown()
       /*
        * Evict cached data. We must write out any dirty data before
        * disowning the dataset.
        */
       objset_t *os = zfsvfs->z_os;
       boolean_t os_dirty = B_FALSE;
       for (int t = 0; t < TXG_SIZE; t++) {
              if (dmu_objset_is_dirty(os, t)) {
                     os_dirty = B_TRUE;
                     break;
              }
       }
       if (!zfs_is_readonly(zfsvfs) && os_dirty) {
              txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
       }
       dmu_objset_evict_dbufs(zfsvfs->z_os);
       dsl_dir_t *dd = os->os_dsl_dataset->ds_dir;
       dsl_dir_cancel_waiters(dd);

This PR:

       // module/os/freebsd/zfs/zfs_vfsops.c:zfsvfs_teardown()
       /*
        * Evict cached data. We must write out any dirty data before
        * disowning the dataset.
        */
       objset_t *os = zfsvfs->z_os;
       boolean_t os_dirty = B_FALSE;
       for (int t = 0; t < TXG_SIZE; t++) {
              if (dmu_objset_is_dirty(os, t)) {
                     os_dirty = B_TRUE;
                     break;
              }
       }
       if (!zfs_is_readonly(zfsvfs) && os_dirty)
              txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
       dmu_objset_evict_dbufs(zfsvfs->z_os);
       dd = zfsvfs->z_os->os_dsl_dataset->ds_dir;
       dsl_dir_cancel_waiters(dd);

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 8, 2024
@amotin
Copy link
Member

amotin commented Aug 8, 2024

@tonyhutter dmu_objset_is_dirty() checks whether objset has any elements in os_dirty_dnodes multilist. Entries are deleted from there by dmu_objset_sync_dnodes() before the dnode is actually synced and moved into os_synced_dnodes. But the last dnode may be not yet synced by that point, barely starting it. I am not sure flushing dbufs from under dnode_sync() is really dangerous, since they should still have references, but I guess they may be not evicted at very least. On top of that dsl_pool_sync() calls dmu_objset_sync_done() for all datasets only after they all synced their dnodes, that even more likely haven't completed yet. Meanwhile dmu_objset_sync_done() may call userquota_updates_task(), which as I can see may dirty few more dnodes. dnode_sync() for those dnodes is explicitly called later by dmu_objset_sync(), which again makes me think about flushing dbufs under it. After that meta-dnode is also synced, which may also have a bunch of dbufs.

TLDR, I am not sure it is legal to call dmu_objset_is_dirty() out of syncing context, and while I am not sure it is fatal in this case, it seems to me it may end up not evicting some of dbufs from cache, as was desired. I also wonder if this code is expected to guarantee that dataset is synced in case of crash, if unmount is expected to provide such guaranties.

@amotin
Copy link
Member

amotin commented Aug 8, 2024

As a quick thought, I wonder if txg_wait_synced(dmu_objset_pool(os, os_dirty ? 0 : spa_syncing_txg(os->os_spa)); would close the race, making sure whatever TXG sync is ongoing now will be completed, otherwise doing nothing.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 8, 2024
@behlendorf behlendorf merged commit ed0db1c into openzfs:master Aug 9, 2024
23 of 25 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
This applies the same change in openzfs#9115 to FreeBSD.  This was actually the
old behavior in FreeBSD 12; it only regressed when FreeBSD support was
added to OpenZFS.  As far as I can tell, the timeline went like this:

* Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
* Illumos added the dirty data check [^4]
* FreeBSD merged in Illumos's conditional check [^3]
* OpenZFS forked from Illumos
* OpenZFS removed the dirty data check in openzfs#7795 [^5]
* @mattmacy forked the OpenZFS repo and began to add FreeBSD support
* OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos
  used, in slightly different form.  At this point the OpenZFS repo did
  not yet have multi-OS support.
* Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on
  slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server
with a large pool that has 1000 datasets and is resilvering an HDD.

[^1]: openzfs#9115
[^2]: openzfs#8987
[^3]: freebsd/freebsd-src@10b9d77
[^4]: illumos/illumos-gate@5aaeed5
[^5]: openzfs#7795

Sponsored by: Axcient
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes openzfs#16268
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants