Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illumos 5056 - ZFS deadlock on db_mtx and dn_holds #3240

Closed
wants to merge 0 commits into from

Conversation

chrisrd
Copy link
Contributor

@chrisrd chrisrd commented Mar 31, 2015

Pull in Illumos 5056 with a largely mechanical application of Illumos patches.

Let's get the buildbots chewing on this patch stack...

Whilst the individual patches are likely worth applying in their own right at some point, it will be interesting to see if the stack helps with #3225.

@a22f7a0: the freespill stuff in dbuf_free_range() was more than mechanical and I'm not sure I have it right. This was related to @4254acb from @dweeezil and @58806b4 from @nedbass. Would you guys be able to take a look at that in particular?

@kernelOfTruth
Copy link
Contributor

@chrisrd Great work ! 👍

this needs the file policy.h in

lib/libspl/include/sys/policy.h

not exactly sure what contents it needs to contain from https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/policy.h - but I went with the minimum

please adapt in this pull-request: https://github.com/zfsonlinux/zfs/pull/3240.diff the file,
it shows a different file/naming-scheme than the others

/*
 * CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License (the "License").
 * You may not use this file except in compliance with the License.
 *
 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * When distributing Covered Code, include this CDDL HEADER in each
 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 * If applicable, add the following below this CDDL HEADER, with the
 * fields enclosed by brackets "[]" replaced with your own identifying
 * information: Portions Copyright [yyyy] [name of copyright owner]
 *
 * CDDL HEADER END
 */
/*
 * Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
 * Copyright 2013, Joyent, Inc. All rights reserved.
 */

#ifndef _LIBSPL_SYS_POLICY_H
#define _LIBSPL_SYS_POLICY_H

#endif /* _LIBSPL_SYS_POLICY_H */

<-- this builds for me (without DEBUG)

let's see if this survives the buildbots ...

@kernelOfTruth
Copy link
Contributor

@chrisrd FYI:

#3241

I'll push more changes, if needed, when anything might come up and I can deal with it

currently pretty busy - but still: let's get this tested and working ASAP

so that @dweeezil and @behlendorf can check whether this does any difference on #3225

@chrisrd
Copy link
Contributor Author

chrisrd commented Mar 31, 2015

@kernelOfTruth The @6dc1dd0 patch does contain the policy.h file in that path.

I'm struggling to understand exactly what the build is doing at the point of failure, it looks like it's extracted stuff from a tar ball and is building in a separate directory, but the policy.h file is missing from the tar ball, e.g. from one of the build failure log files:

Installing zfs-0.6.3-269_g8f8571a.src.rpm
Executing(%prep): /bin/sh -e /tmp/zfs-build--fSgKVAK0/TMP/rpm-tmp.ul0zYZ
+ umask 022
+ cd /tmp/zfs-build--fSgKVAK0/BUILD
+ cd /tmp/zfs-build--fSgKVAK0/BUILD
+ rm -rf zfs-0.6.3
+ /bin/gzip -dc /tmp/zfs-build--fSgKVAK0/SOURCES/zfs-0.6.3.tar.gz
+ /bin/tar -xf -
+ STATUS=0
+ [ 0 -ne 0 ]
+ cd zfs-0.6.3
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ exit 0
Executing(%build): /bin/sh -e /tmp/zfs-build--fSgKVAK0/TMP/rpm-tmp.zrSw2k
+ umask 022
+ cd /tmp/zfs-build--fSgKVAK0/BUILD
+ cd zfs-0.6.3
+ CFLAGS=-O2 -g
+ export CFLAGS
+ CXXFLAGS=-O2 -g
+ export CXXFLAGS
+ FFLAGS=-O2 -g
+ export FFLAGS
+ ./configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/lib64 --libexecdir=/usr/lib/rpm --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --with-config=user --with-udevdir=/lib/udev --with-udevruledir=/lib/udev/rules.d --with-dracutdir=/usr/lib/dracut --disable-static --enable-debug --without-blkid --enable-sysvinit --disable-systemd

...Oh, I just got it! I need to add the file to the Makefile.in in the directory where I added the file, that's used to create the tar ball. I'll push an update soon...

@kernelOfTruth
Copy link
Contributor

I know,

still the naming scheme seems "off" - some files I checked, e.g. vnode.h , types.h

start with

_ LIBSPL _

#ifndef _LIBSPL_SYS_VNODE_H
#define _LIBSPL_SYS_VNODE_H

#endif /* _LIBSPL_SYS_VNODE_H */
#ifndef _LIBSPL_SYS_TYPES_H
#define _LIBSPL_SYS_TYPES_H

anyway - great that you found the cause - let's see if the buildbot can proceed

it didn't in #3241 ;)

(or that Ubuntu 12.04 buildbot is supposed to do that)

@chrisrd
Copy link
Contributor Author

chrisrd commented Mar 31, 2015

OK, let's see how that goes.

I've adjusted the #define per your suggestion whilst I was playing with the patches.

@chrisrd chrisrd force-pushed the illumos-5056 branch 3 times, most recently from a62c0cb to 2d17463 Compare March 31, 2015 16:57
@kernelOfTruth
Copy link
Contributor

okay great, so now it builds 👍

Unable to open /dev/zfs: No such file or directory.
Verify the ZFS module stack is loaded by running '/sbin/modprobe zfs'.
zimport.sh: ZFS modules must be unloaded
program finished with exit code 1

so the modules seemingly aren't being imported ?

or do working

shell_7 test modules

and

shell_9 test ztest

indicate that with

shell_11 test import failed

shell_10 test zconfig failed

show that there are problems with the pool or its configuration other than with the modules ?

...

alright, so the modues are being imported, I'm wondering why the buildbot doesn't show any error (red) on dmesg output

1    persistent zpool.cache             
command timed out: 3600 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1

apparently it's failing a test_1

https://github.com/zfsonlinux/zfs/blob/master/scripts/zconfig.sh

will take a look tomorrow - been a long day & most of this is new to me - never had to deal with these kind of issues 😉

some of you guys probably know by heart what could be the problem

(removed some of the clutter to improve overview)

edit:

[ 1153.590028] VERIFY(!list_is_empty(list)) failed
[ 1153.590048] PANIC at list.h:126:list_remove()

include/sys/list.h

static inline void
list_remove(list_t *list, void *object)
{
    ASSERT(!list_is_empty(list));
    list_del(list_d2l(list, object));
}

@nedbass
Copy link
Contributor

nedbass commented Mar 31, 2015

@chrisrd the freespill handling looks correct to me. Although with the sorted AVL tree I think the for loop will handle the spill block case correctly without all the special handling. I'll do some testing to verify.

@nedbass
Copy link
Contributor

nedbass commented Apr 1, 2015

@chrisrd I am convinced we should drop the special handling involving freespill within the for loop in dbuf_free_range(), and just use that hunk of the Illumos patch as is. The special handling higher up in that function is still needed, however.

@chrisrd
Copy link
Contributor Author

chrisrd commented Apr 1, 2015

@nedbass Thanks - I'll look harder to see if I can follow the reasoning and drop the special handling.

@chrisrd
Copy link
Contributor Author

chrisrd commented Apr 2, 2015

Added 2 new patches fixing issues introduced by Illumos 5056 and causing buildbot failures.

@kernelOfTruth
Copy link
Contributor

./module/zfs/dbuf.c: 2066: line > 80 characters
Makefile:1258: recipe for target 'checkstyle' failed
make: *** [checkstyle] Error 1

wow - besides that everything's green, great job 👍

I'm curious if these changes help with the lockups on ZoL

@chrisrd
Copy link
Contributor Author

chrisrd commented Apr 2, 2015

OK, new push for the style fail

@chrisrd
Copy link
Contributor Author

chrisrd commented Apr 2, 2015

Note: @49ce3ad - Patch out crgetzoneid() (don't pull) is temporary, required for building with Illumos 3897, until openzfs/spl#444 is applied.

@behlendorf
Copy link
Contributor

@chrisrd you can now drop Patch out crgetzomeid() from this patch stack, I've merged the small patch you proposed in to the spl master branch. It would be great if you could refresh and push this patch stack again.

@chrisrd
Copy link
Contributor Author

chrisrd commented Apr 2, 2015

@behlendorf Thanks - Patch out crgetzomeid() dropped and stack pushed again

One of the two buildbot failure (prior to push) looks like a test harness failure rather than related to the patch stack:

http://buildbot.zfsonlinux.org/builders/ubuntu-10.04.3-amd64-builder/builds/2717/steps/git_1/logs/stdio

git_1 cloning zfs
...
error: RPC failed; result=22, HTTP code = 0

The other is #3075:

http://buildbot.zfsonlinux.org/builders/ubuntu-14.04-amd64-current-builder/builds/2411/steps/shell_17/logs/stdio

[ 2152.253306] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 2152.253317] bad because of flags:
[ 2152.253324] flags: 0x2000(writeback)

@chrisrd chrisrd force-pushed the illumos-5056 branch 3 times, most recently from e273b87 to 3be7851 Compare April 3, 2015 06:18
@odoucet
Copy link

odoucet commented Apr 3, 2015

Redone my test experienced here : #3190 (comment) with the pull request #3240 (just before "Patch out crgetzomeid()" drop).
Reminder : It is a 100% read work on 16 threads that was hanging after a few hours.
Conclusion : hanging again with this patch :(

Apr  3 10:42:12 filer4a kernel: INFO: task cat:8406 blocked for more than 120 seconds.
Apr  3 10:42:12 filer4a kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  3 10:42:12 filer4a kernel: cat             D ffffffff81810d00     0  8406  16123 0x00000080
Apr  3 10:42:12 filer4a kernel: ffff881607bbb9c8 0000000000000082 ffff881607bbbfd8 0000000000013140
Apr  3 10:42:12 filer4a kernel: ffff881607bba010 0000000000013140 0000000000013140 0000000000013140
Apr  3 10:42:12 filer4a kernel: ffff881607bbbfd8 0000000000013140 ffff88180087c480 ffff8818069a84c0
Apr  3 10:42:12 filer4a kernel: Call Trace:
Apr  3 10:42:12 filer4a kernel: [<ffffffff815eee69>] schedule+0x29/0x70
Apr  3 10:42:12 filer4a kernel: [<ffffffff815ef17e>] schedule_preempt_disabled+0xe/0x10
Apr  3 10:42:12 filer4a kernel: [<ffffffff815ed8d4>] __mutex_lock_slowpath+0x194/0x240
Apr  3 10:42:12 filer4a kernel: [<ffffffff815ed71b>] mutex_lock+0x2b/0x50
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02ddfcf>] buf_hash_find+0x9f/0x1f0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02e75fe>] arc_read+0xfe/0xaf0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02ec9e5>] ? dbuf_rele_and_unlock+0x265/0x4d0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02ed270>] ? dbuf_fill_done+0xf0/0xf0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02ed748>] dbuf_read_impl+0x278/0x440 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02ef195>] ? __dbuf_hold_impl+0x305/0x580 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02eea02>] dbuf_read+0x102/0x590 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02f8867>] dmu_buf_hold_array_by_dnode+0x127/0x5c0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02f8f85>] dmu_buf_hold_array+0x65/0x90 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa02f8ff1>] dmu_read_uio+0x41/0xd0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa038487c>] ? zfs_range_lock+0x12c/0x180 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa039009a>] zfs_read+0x12a/0x2f0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffffa03a82cb>] zpl_read+0x9b/0xe0 [zfs]
Apr  3 10:42:12 filer4a kernel: [<ffffffff8119bac1>] vfs_read+0xb1/0x130
Apr  3 10:42:12 filer4a kernel: [<ffffffff8119bf4f>] SyS_read+0x5f/0xa0
Apr  3 10:42:12 filer4a kernel: [<ffffffff810e1476>] ? __audit_syscall_exit+0x246/0x2f0
Apr  3 10:42:12 filer4a kernel: [<ffffffff815f9479>] system_call_fastpath+0x16/0x1b

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Apr 4, 2015
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Apr 4, 2015
rejects due to openzfs#2129 in:

module/zfs/dbuf.c
module/zfs/zap.c
module/zfs/dsl_dir.c
module/zfs/dsl_dataset.c
module/zfs/dnode.c
@chrisrd
Copy link
Contributor Author

chrisrd commented Apr 5, 2015

Crap! I didn't mean to close this. Again (#3239). I wanted the buildbots to work on the patch stack on top of the newly merged code, so I rebased my local branch and force pushed it. After 15m the new commits hadn't shown up in this pull request, so I thought I'd delete all the changes on my side, force push again, then reapply the patch stack and push once more. But the force push closed the pull request and the 'Reopen and comment' button is greyed out. How is this supposed to work?!?! Sigh. Looks like I have to do another pull request: #3254

@kernelOfTruth kernelOfTruth mentioned this pull request May 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants