Workaround issue cleaning up automounted snapshots on Linux #12670

rincebrain · 2021-10-23T03:16:31Z

Motivation and Context

Getting annoyed by #11632 popping up

Description

Sometimes, for cleaning up automatic .zfs/snapshot mounts, we go to cancel taskqids which we subsequently can't find before dispatching a new unmount task. On a non-DEBUG build, it just barrels ahead and spawns a new one even if the cancel errored out and didn't set taskqid to TASKQID_INVALID. On a DEBUG build, we get a VERIFY failure instead.

Since the cancel failing seems pretty harmless to ignore here, changed the logic to just shrug and act like it succeeded if it gets back ENOENT (which is what it gets if the task can't be found), and retry a couple times before giving up like before on EBUSY (the other status you can get back from the cancel call). (dbgmsg print added when the retry logic is hit in order to make it easier to see what's going on...)

How Has This Been Tested?

Well, on my 2.0.6 system where this was happening regularly after a lot of snapshots were mounted, a previous version which printed on hitting the retry logic and didn't permit ENOENT printed at most once before succeeding on a subsequent run.

(Curiously, at least once it printed, still only once, that it was retrying on taskqid 0 (== TASKQID_INVALID), so...is it getting here before taskqid is properly set/updated somewhere sometimes, somehow, and then it gets updated by someone else in situ? Because otherwise, it should have printed 5 times...hm.)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/os/linux/zfs/zfs_ctldir.c

bghira · 2021-10-27T17:13:39Z

i have no comments on this PR other than to say your work on the project hasn't gone unnoticed and i wanted to thank you for picking it up. you seem to be having fun learning all of the ropes.

rincebrain · 2021-10-27T17:17:40Z

Heh, more like playing whack-a-mole every time I run into an irritation with the FS I've used for so long rather than just shrugging, but thank you.

I'd say it's all enlightened self-interest, but I don't use encryption and keep staring at bugs in that too...

behlendorf · 2021-10-27T19:20:10Z

module/os/linux/zfs/zfs_ctldir.c

 	ASSERT3S(se->se_taskqid, ==, TASKQID_INVALID);
+	rw_exit(&se->se_taskqid_lock);


Let's just move this ASSERT down so it's right before the call to taskq_dispatch_delay under the new lock. I think we can live without ASSERTing this in the delay <= 0 case in the name of readability (even though it's true). Plus it's a bit silly to take the lock here when the ASSERT will get compiled out on production builds anyway.

This could then also be changed to a mutex since this seems to be the only place we take it as a reader.

Oh huh, I pushed an old version - I thought taking the lock for nothing was silly too, and #ifdef ZFS_DEBUG'd it, not sure where that change went in this push...

But sure, that can be done instead.

module/os/linux/zfs/zfs_ctldir.c

tonynguien · 2021-11-03T01:18:47Z

@rincebrain Would you mind squashing the commits?

rincebrain · 2021-11-03T01:22:46Z

Squash!

e: And again with a pull --rebase to grab 6d680e6...

On Linux, sometimes, when ZFS goes to unmount an automounted snap, it fails a VERIFY check on debug builds, because taskq_cancel_id returned ENOENT after not finding the taskq it was trying to cancel. This presumably happens when it already died for some reason; in this case, we don't really mind it already being dead, since we're just going to dispatch a new task to unmount it right after. So we just ignore it if we get back ENOENT trying to cancel here, retry a couple times if we get back the only other possible condition (EBUSY), and log to dbgmsg if we got anything but ENOENT or success. (We also add some locking around taskqid, to avoid one or two cases of two instances of trying to cancel something at once.) Closes: openzfs#11632 Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

tonynguien · 2021-11-03T14:57:41Z

Squash!

e: And again with a pull --rebase to grab 6d680e6...

Great. Thanks!

On Linux, sometimes, when ZFS goes to unmount an automounted snap, it fails a VERIFY check on debug builds, because taskq_cancel_id returned ENOENT after not finding the taskq it was trying to cancel. This presumably happens when it already died for some reason; in this case, we don't really mind it already being dead, since we're just going to dispatch a new task to unmount it right after. So we just ignore it if we get back ENOENT trying to cancel here, retry a couple times if we get back the only other possible condition (EBUSY), and log to dbgmsg if we got anything but ENOENT or success. (We also add some locking around taskqid, to avoid one or two cases of two instances of trying to cancel something at once.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#11632 Closes openzfs#12670

On Linux, sometimes, when ZFS goes to unmount an automounted snap, it fails a VERIFY check on debug builds, because taskq_cancel_id returned ENOENT after not finding the taskq it was trying to cancel. This presumably happens when it already died for some reason; in this case, we don't really mind it already being dead, since we're just going to dispatch a new task to unmount it right after. So we just ignore it if we get back ENOENT trying to cancel here, retry a couple times if we get back the only other possible condition (EBUSY), and log to dbgmsg if we got anything but ENOENT or success. (We also add some locking around taskqid, to avoid one or two cases of two instances of trying to cancel something at once.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #11632 Closes #12670

ahrens assigned tonynguien Oct 23, 2021

tonynguien added the Status: Code Review Needed Ready for review and testing label Oct 25, 2021

tonynguien requested review from behlendorf and don-brady October 25, 2021 19:24

behlendorf reviewed Oct 25, 2021

View reviewed changes

module/os/linux/zfs/zfs_ctldir.c Outdated Show resolved Hide resolved

behlendorf reviewed Oct 27, 2021

View reviewed changes

rincebrain force-pushed the taskq branch 2 times, most recently from 7eb39ed to 3f8d02f Compare October 28, 2021 09:39

behlendorf approved these changes Oct 29, 2021

View reviewed changes

tonynguien approved these changes Nov 3, 2021

View reviewed changes

rincebrain force-pushed the taskq branch from 3f8d02f to 776f9d4 Compare November 3, 2021 01:21

rincebrain force-pushed the taskq branch from 776f9d4 to 6c18c71 Compare November 3, 2021 01:38

tonynguien merged commit e79b680 into openzfs:master Nov 3, 2021

tonynguien added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 3, 2021

rincebrain mentioned this pull request Aug 18, 2023

Cherrypick 2 snapshot fixes for 2.1.13 #15187

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround issue cleaning up automounted snapshots on Linux #12670

Workaround issue cleaning up automounted snapshots on Linux #12670

rincebrain commented Oct 23, 2021

bghira commented Oct 27, 2021

rincebrain commented Oct 27, 2021

behlendorf Oct 27, 2021

rincebrain Oct 27, 2021

tonynguien commented Nov 3, 2021

rincebrain commented Nov 3, 2021 •

edited

Loading

tonynguien commented Nov 3, 2021

		ASSERT3S(se->se_taskqid, ==, TASKQID_INVALID);
		rw_exit(&se->se_taskqid_lock);

Workaround issue cleaning up automounted snapshots on Linux #12670

Workaround issue cleaning up automounted snapshots on Linux #12670

Conversation

rincebrain commented Oct 23, 2021

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

bghira commented Oct 27, 2021

rincebrain commented Oct 27, 2021

behlendorf Oct 27, 2021

Choose a reason for hiding this comment

rincebrain Oct 27, 2021

Choose a reason for hiding this comment

tonynguien commented Nov 3, 2021

rincebrain commented Nov 3, 2021 • edited Loading

tonynguien commented Nov 3, 2021

rincebrain commented Nov 3, 2021 •

edited

Loading