Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround issue cleaning up automounted snapshots on Linux #12670

Merged
merged 1 commit into from
Nov 3, 2021

Conversation

rincebrain
Copy link
Contributor

Motivation and Context

Getting annoyed by #11632 popping up

Description

Sometimes, for cleaning up automatic .zfs/snapshot mounts, we go to cancel taskqids which we subsequently can't find before dispatching a new unmount task. On a non-DEBUG build, it just barrels ahead and spawns a new one even if the cancel errored out and didn't set taskqid to TASKQID_INVALID. On a DEBUG build, we get a VERIFY failure instead.

Since the cancel failing seems pretty harmless to ignore here, changed the logic to just shrug and act like it succeeded if it gets back ENOENT (which is what it gets if the task can't be found), and retry a couple times before giving up like before on EBUSY (the other status you can get back from the cancel call). (dbgmsg print added when the retry logic is hit in order to make it easier to see what's going on...)

How Has This Been Tested?

Well, on my 2.0.6 system where this was happening regularly after a lot of snapshots were mounted, a previous version which printed on hitting the retry logic and didn't permit ENOENT printed at most once before succeeding on a subsequent run.

(Curiously, at least once it printed, still only once, that it was retrying on taskqid 0 (== TASKQID_INVALID), so...is it getting here before taskqid is properly set/updated somewhere sometimes, somehow, and then it gets updated by someone else in situ? Because otherwise, it should have printed 5 times...hm.)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@bghira
Copy link

bghira commented Oct 27, 2021

i have no comments on this PR other than to say your work on the project hasn't gone unnoticed and i wanted to thank you for picking it up. you seem to be having fun learning all of the ropes.

@rincebrain
Copy link
Contributor Author

Heh, more like playing whack-a-mole every time I run into an irritation with the FS I've used for so long rather than just shrugging, but thank you.

I'd say it's all enlightened self-interest, but I don't use encryption and keep staring at bugs in that too...

ASSERT3S(se->se_taskqid, ==, TASKQID_INVALID);
rw_exit(&se->se_taskqid_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just move this ASSERT down so it's right before the call to taskq_dispatch_delay under the new lock. I think we can live without ASSERTing this in the delay <= 0 case in the name of readability (even though it's true). Plus it's a bit silly to take the lock here when the ASSERT will get compiled out on production builds anyway.

This could then also be changed to a mutex since this seems to be the only place we take it as a reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh huh, I pushed an old version - I thought taking the lock for nothing was silly too, and #ifdef ZFS_DEBUG'd it, not sure where that change went in this push...

But sure, that can be done instead.

module/os/linux/zfs/zfs_ctldir.c Show resolved Hide resolved
@tonynguien
Copy link
Contributor

@rincebrain Would you mind squashing the commits?

@rincebrain
Copy link
Contributor Author

rincebrain commented Nov 3, 2021

Squash!

e: And again with a pull --rebase to grab 6d680e6...

On Linux, sometimes, when ZFS goes to unmount an automounted snap,
it fails a VERIFY check on debug builds, because taskq_cancel_id
returned ENOENT after not finding the taskq it was trying to cancel.

This presumably happens when it already died for some reason; in this
case, we don't really mind it already being dead, since we're just
going to dispatch a new task to unmount it right after.

So we just ignore it if we get back ENOENT trying to cancel here,
retry a couple times if we get back the only other possible condition
(EBUSY), and log to dbgmsg if we got anything but ENOENT or success.

(We also add some locking around taskqid, to avoid one or two cases
of two instances of trying to cancel something at once.)

Closes: openzfs#11632

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
@tonynguien
Copy link
Contributor

Squash!

e: And again with a pull --rebase to grab 6d680e6...

Great. Thanks!

@tonynguien tonynguien merged commit e79b680 into openzfs:master Nov 3, 2021
@tonynguien tonynguien added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 3, 2021
rincebrain added a commit to rincebrain/zfs that referenced this pull request Aug 18, 2023
On Linux, sometimes, when ZFS goes to unmount an automounted snap,
it fails a VERIFY check on debug builds, because taskq_cancel_id
returned ENOENT after not finding the taskq it was trying to cancel.

This presumably happens when it already died for some reason; in this
case, we don't really mind it already being dead, since we're just
going to dispatch a new task to unmount it right after.

So we just ignore it if we get back ENOENT trying to cancel here,
retry a couple times if we get back the only other possible condition
(EBUSY), and log to dbgmsg if we got anything but ENOENT or success.

(We also add some locking around taskqid, to avoid one or two cases
of two instances of trying to cancel something at once.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#11632
Closes openzfs#12670
behlendorf pushed a commit that referenced this pull request Aug 25, 2023
On Linux, sometimes, when ZFS goes to unmount an automounted snap,
it fails a VERIFY check on debug builds, because taskq_cancel_id
returned ENOENT after not finding the taskq it was trying to cancel.

This presumably happens when it already died for some reason; in this
case, we don't really mind it already being dead, since we're just
going to dispatch a new task to unmount it right after.

So we just ignore it if we get back ENOENT trying to cancel here,
retry a couple times if we get back the only other possible condition
(EBUSY), and log to dbgmsg if we got anything but ENOENT or success.

(We also add some locking around taskqid, to avoid one or two cases
of two instances of trying to cancel something at once.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #11632
Closes #12670
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants