-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Break out of zfs_zget early if unlinked znode #9583
Conversation
Can we please have more of these awesomely documented PR's? 👍 |
module/os/linux/zfs/zfs_znode.c
Outdated
@@ -1111,6 +1111,10 @@ zfs_zget(zfsvfs_t *zfsvfs, uint64_t obj_num, znode_t **zpp) | |||
mutex_exit(&zp->z_lock); | |||
sa_buf_rele(db, NULL); | |||
zfs_znode_hold_exit(zfsvfs, zh); | |||
/* if znode is already marked for deletion, break out early */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line throws a style tantrum, might want to rephrase or split
(basically it's too long, over 80 because it seems to count the tabs as multiple characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks for the notice! I didn't realize tabs count as 8 spaces, but it should be addressed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know either, maybe rewrite the test a bit someday... Because this should've fitted fine.
So not really your fault either :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your analysis and proposed fix are spot on, thank you for so clearly documenting the issue and getting to the root cause!
module/os/linux/zfs/zfs_znode.c
Outdated
/* if znode is already marked for deletion, break out early */ | ||
if (zp->z_unlinked) { | ||
return (SET_ERROR(ENOENT)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good, I'd just suggest we restructure this block a little bit. What do you think of this (untested) version, it has a couple small advantages:
- This allows us to skip the
igrab()
entirely whenzp->z_unlinked
is set. - There's a single
mutex_exit() / sa_buf_rele() / ZFS_OBJ_HOLD_EXIT()
path. - It aligns this logic a little more closely with the FreeBSD and illumos versions of this function (if you squint). All of the platforms need this same unlinked check, but the VFS integration specifics are different on each platform.
mutex_enter(&zp->z_lock);
ASSERT3U(zp->z_id, ==, obj_num);
/*
* If zp->z_unlinked is set then the znode is marked
* for deletion and should not be discoverable.
*
* If igrab() returns NULL the VFS has independently
* determined the inode should be evicted and has
* called iput_final() to start the eviction process.
* The SA handle is still valid but because the VFS
* requires that the eviction succeed we must drop
* our locks and references to allow the eviction to
* complete. The zfs_zget() may then be retried.
*/
if (zp->z_unlinked) {
err = SET_ERROR(ENOENT);
} else if (igrab(ZTOI(zp)) == NULL) {
err = SET_ERROR(EAGAIN);
cond_resched();
} else {
*zpp = zp;
err = 0;
}
mutex_exit(&zp->z_lock);
sa_buf_rele(db, NULL);
zfs_znode_hold_exit(zfsvfs, zh);
if (err == EAGAIN)
goto again;
return (err);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me too, but I like the revised version better. I suggest we go with that and add an original-patch-by line to the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should have the if (zp->z_unlinked) err = ENOENT;
code, like illumos. @behlendorf's way of doing that seems cleanest to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! I think we should go with this version as well, it looks much cleaner and has the mentioned advantages over the original patch. Apologies for the initial "naive" fix, I'm still finding my way around the ZFS code and didn't stop to think and look at the illumos code...
I can respin the PR to be more aligned with this one, or we can go with @ryao's original-patch-by suggestion. Either one is fine by me, so please let me know how you would like to proceed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. We're regression testing this, and will double-check with the user/reporter specific workload that it's all good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hrasiq How about: Who ever gets it in for review first?
It's not worth the time debating who is sending it in.
And don't worry about it, it's not a naive fix you did... People having feedback doesn't mean your code is bad or naive ;)
@mfoliveira What are you regression testing? The proposed changes in this Review or the original fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ornias1993 The proposed changes (reported to be untested). The original fix has been regression tested for the submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ornias1993 In that case, allow me to re-submit this one :)
Codecov Report
@@ Coverage Diff @@
## master #9583 +/- ##
==========================================
+ Coverage 79.23% 79.23% +<.01%
==========================================
Files 419 418 -1
Lines 123696 123686 -10
==========================================
- Hits 98014 98008 -6
+ Misses 25682 25678 -4
Continue to review full report at Codecov.
|
If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com>
I've integrated the changes suggested by @behlendorf, but moved the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hadfl thanks for updating this, the refreshed version looks good to me.
@mfoliveira @hrasiq merged, thanks again for running this issue down and verifying the fix! We'll make sure to get this backported for the next point release. |
It looks like Sorry for the delay in getting back to this; had a holiday/travel/conference. |
@mfoliveira Don't be sorry for taking just a few days to get a PR into an opensource project... Take care! :) |
The changes in commit 41e1aa2 / PR #9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR #9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes #9602
If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes openzfs#9583
The changes in commit 41e1aa2 / PR openzfs#9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR openzfs#9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes openzfs#9602
If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes openzfs#9583
The changes in commit 41e1aa2 / PR openzfs#9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR openzfs#9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes openzfs#9602
If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes #9583
The changes in commit 41e1aa2 / PR #9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR #9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes #9602
If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes openzfs#9583
The changes in commit 41e1aa2 / PR openzfs#9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR openzfs#9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes openzfs#9602
If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes openzfs#9583
This patch is an RFC for a specific issue we've had Ubuntu users report
lately. From our testing, it looks like a valid fix but I'm unsure if it's the
most appropriate one. I would really appreciate suggestions on how to improve it
or any feedback about how to do it in a more fitting manner.
Thanks!
Motivation and Context
We've seen users hitting hangs and blocked tasks in Ubuntu, seemingly due to an
issue in ZFS commit. Usually, we have stack traces like below in the kernel
logs:
The above was collected from an Ubuntu Bionic system running ZFSonLinux 0.8.1
and a mysql workload. After doing some investigation with the help of crash
dumps (excerpts below), we believe this might be a race between the
z_iput
/evict
thread and the ZFS writeback thread.Crash dump analysis
The kworker thread that's currently doing the commit/writeback doesn't seem to
be stalled, but actively working on the writeback up until the hung_task_timeout
NMI:
It looks like the writeback thread is looping in this section of
zfs_zget()
:I've fished around in the kworker's stack to see if we could dig out the inode
that would cause
igrab()
to return NULL:So, the inode lies in
0xffffa04acf9fd2a8
and its state is set to0xa7
. In include/linux/fs.h:The important bit here is
I_FREEING
. This looks to be the case mentioned bythe comment above the
igrab()
call, asigrab()
will return NULL if the inodeis currently being evicted (i.e.
I_FREEING
is set). Furthermore,i_sb_list
has been deleted from this inode, but
i_hash
still has some entries:Looking at the code in
evict()
, this suggests that this inode is currently sitting ininode_wait_for_writeback()
. If we check thez_iput
task:That looks suspiciously close to the
evict()
stack we should be seeing. It alsolooks like
z_iput
is trying to evict our specific inode:It seems that the
z_iput
thread is trying to evict an inode, but is currentlywaiting until the inode has gone through the writeback thread. The writeback
thread on the other hand seems to be locked waiting for the inode eviction to
complete (due to the
igrab()
loop).Description
We've noticed that these issues showed up on systems with slow mechanical
storage, and no SSD or caching devices. We've also noticed that the znodes
responsible for these hangs always had z_unlinked set, indicating that we were
trying to commit files that have since been deleted.
In these cases we've seen,
zfs_get_data()
blocks onzfs_zget()
and can'tcheck if the znode has been marked for deletion (it would return
ENOENT
inthese cases and the writeback thread would continue). This patch tries to detect
if the znode has been marked for deletion inside
zfs_zget()
, and breaks outearly with
ENOENT
if that's the case.How Has This Been Tested?
We've been testing this patch in some Ubuntu environments with the Bionic 4.15
kernel and varied mysql workloads, and results look good so far. We haven't seen
the lockups anymore, and no major breakage of other ZFS functionality has been
observed. This patch has also been tested with the ZFS test suite, and no
obvious regressions have been noticed.
Types of changes
Checklist:
Signed-off-by
.