Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFSd unresposive #1790

Closed
ghost opened this issue Oct 16, 2013 · 8 comments
Closed

NFSd unresposive #1790

ghost opened this issue Oct 16, 2013 · 8 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@ghost
Copy link

ghost commented Oct 16, 2013

Hi
we experience frequent lockups of our nfs server (running on kernel RHEL6 2.6.32-358.18.1 with zfsonlinux 0.6.2). Client operations are timing out. The server dumps messages like the one posted below. During the lockups txg_sync seems busy. All file systems are still accessible locally on the server.

Oct 15 23:54:21 fs2 kernel: INFO: task nfsd:6108 blocked for more than 120 seconds.
Oct 15 23:54:21 fs2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 23:54:21 fs2 kernel: nfsd D 0000000000000000 0 6108 2 0x00000080
Oct 15 23:54:21 fs2 kernel: ffff880357cb9740 0000000000000046 ffff88035d224ae0 0000000000000000
Oct 15 23:54:21 fs2 kernel: ffff880357cb9740 ffffffffa01a203a ffff8803611d0c88 ffff880300000000
Oct 15 23:54:21 fs2 kernel: ffff88035d225098 ffff880357cb9fd8 000000000000fb88 ffff88035d225098
Oct 15 23:54:21 fs2 kernel: Call Trace:
Oct 15 23:54:21 fs2 kernel: [] ? kmem_alloc_debug+0x13a/0x4c0 [spl]
Oct 15 23:54:21 fs2 kernel: [] ? prepare_to_wait_exclusive+0x4e/0x80
Oct 15 23:54:21 fs2 kernel: [] cv_wait_common+0x105/0x1c0 [spl]
Oct 15 23:54:21 fs2 kernel: [] ? autoremove_wake_function+0x0/0x40
Oct 15 23:54:21 fs2 kernel: [] ? avl_find+0x60/0xb0 [zavl]
Oct 15 23:54:21 fs2 kernel: [] __cv_wait+0x15/0x20 [spl]
Oct 15 23:54:21 fs2 kernel: [] zfs_range_lock+0x2ac/0x5c0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? mutex_lock+0x1e/0x50
Oct 15 23:54:21 fs2 kernel: [] zfs_write+0x5e7/0xca0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? dnode_hold_impl+0x347/0x560 [zfs]
Oct 15 23:54:21 fs2 kernel: [] zpl_write_common+0x52/0x70 [zfs]
Oct 15 23:54:21 fs2 kernel: [] zpl_write+0x68/0xa0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? zpl_write+0x0/0xa0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] do_loop_readv_writev+0x59/0x90
Oct 15 23:54:21 fs2 kernel: [] do_readv_writev+0x1e6/0x1f0
Oct 15 23:54:21 fs2 kernel: [] ? tsd_exit+0x5f/0x2b0 [spl]
Oct 15 23:54:21 fs2 kernel: [] ? nfsd_setuser+0x126/0x2c0 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] ? mutex_lock+0x1e/0x50
Oct 15 23:54:21 fs2 kernel: [] ? zfs_open+0xa5/0x140 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? zpl_open+0x0/0x90 [zfs]
Oct 15 23:54:21 fs2 kernel: [] vfs_writev+0x46/0x60
Oct 15 23:54:21 fs2 kernel: [] nfsd_vfs_write+0x105/0x430 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] ? dentry_open+0x52/0xc0
Oct 15 23:54:21 fs2 kernel: [] ? nfsd_open+0x1db/0x230 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] nfsd_write+0xe7/0x100 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] nfsd3_proc_write+0xaf/0x140 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] nfsd_dispatch+0xfe/0x240 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] svc_process_common+0x344/0x640 [sunrpc]
Oct 15 23:54:21 fs2 kernel: [] ? default_wake_function+0x0/0x20
Oct 15 23:54:21 fs2 kernel: [] svc_process+0x110/0x160 [sunrpc]
Oct 15 23:54:21 fs2 kernel: [] nfsd+0xc2/0x160 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] ? nfsd+0x0/0x160 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] kthread+0x96/0xa0
Oct 15 23:54:21 fs2 kernel: [] child_rip+0xa/0x20
Oct 15 23:54:21 fs2 kernel: [] ? kthread+0x0/0xa0
Oct 15 23:54:21 fs2 kernel: [] ? child_rip+0x0/0x20

Oct 15 23:54:21 fs2 kernel: INFO: task nfsd:6113 blocked for more than 120 seconds.
Oct 15 23:54:21 fs2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 23:54:21 fs2 kernel: nfsd D 0000000000000007 0 6113 2 0x00000080
Oct 15 23:54:21 fs2 kernel: ffff8803579e3740 0000000000000046 ffff8803579e3708 ffff8803579e3704
Oct 15 23:54:21 fs2 kernel: ffff8803579e3740 ffff88037fc25000 ffff8800282b6700 000000000000b15b
Oct 15 23:54:21 fs2 kernel: ffff8803609ddab8 ffff8803579e3fd8 000000000000fb88 ffff8803609ddab8
Oct 15 23:54:21 fs2 kernel: Call Trace:
Oct 15 23:54:21 fs2 kernel: [] cv_wait_common+0x105/0x1c0 [spl]
Oct 15 23:54:21 fs2 kernel: [] ? autoremove_wake_function+0x0/0x40
Oct 15 23:54:21 fs2 kernel: [] ? avl_find+0x60/0xb0 [zavl]
Oct 15 23:54:21 fs2 kernel: [] __cv_wait+0x15/0x20 [spl]
Oct 15 23:54:21 fs2 kernel: [] zfs_range_lock+0x2ac/0x5c0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? zfs_inode_update+0x110/0x180 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? mutex_lock+0x1e/0x50
Oct 15 23:54:21 fs2 kernel: [] zfs_write+0x5e7/0xca0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? dnode_hold_impl+0x347/0x560 [zfs]
Oct 15 23:54:21 fs2 kernel: [] zpl_write_common+0x52/0x70 [zfs]
Oct 15 23:54:21 fs2 kernel: [] zpl_write+0x68/0xa0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? zpl_write+0x0/0xa0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] do_loop_readv_writev+0x59/0x90
Oct 15 23:54:21 fs2 kernel: [] do_readv_writev+0x1e6/0x1f0
Oct 15 23:54:21 fs2 kernel: [] ? tsd_exit+0x5f/0x2b0 [spl]
Oct 15 23:54:21 fs2 kernel: [] ? nfsd_setuser+0x126/0x2c0 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] ? mutex_lock+0x1e/0x50
Oct 15 23:54:21 fs2 kernel: [] ? zfs_open+0xa5/0x140 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? zpl_open+0x0/0x90 [zfs]
Oct 15 23:54:21 fs2 kernel: [] vfs_writev+0x46/0x60
Oct 15 23:54:21 fs2 kernel: [] nfsd_vfs_write+0x105/0x430 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] ? dentry_open+0x52/0xc0
Oct 15 23:54:21 fs2 kernel: [] ? nfsd_open+0x1db/0x230 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] nfsd_write+0xe7/0x100 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] nfsd3_proc_write+0xaf/0x140 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] nfsd_dispatch+0xfe/0x240 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] svc_process_common+0x344/0x640 [sunrpc]
Oct 15 23:54:21 fs2 kernel: [] ? default_wake_function+0x0/0x20
Oct 15 23:54:21 fs2 kernel: [] svc_process+0x110/0x160 [sunrpc]
Oct 15 23:54:21 fs2 kernel: [] nfsd+0xc2/0x160 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] ? nfsd+0x0/0x160 [nfsd]
Oct 15 23:54:21 fs2 kernel: [] kthread+0x96/0xa0
Oct 15 23:54:21 fs2 kernel: [] child_rip+0xa/0x20
Oct 15 23:54:21 fs2 kernel: [] ? kthread+0x0/0xa0
Oct 15 23:54:21 fs2 kernel: [] ? child_rip+0x0/0x20

@ghost
Copy link
Author

ghost commented Oct 22, 2013

The lockup happens when the arc size reaches it's target size. txg_sync is busy with 100% IO. After a while the situation might recover. Do you have a hint how this can be debugged?

@behlendorf
Copy link
Contributor

Could you dump all the stack from the system with sysrq-t? In particular, we need to see what the txg_sync process is doing. The above stacks suggest just a very highly contented region of a file, they're blocked waiting on a range lock.

@prometheanfire
Copy link
Contributor

I've had similiar performance issues as well, I rolled back to 0.6.1 as it was more stable for me.

#1724

@ryao
Copy link
Contributor

ryao commented Oct 24, 2013

The similarity to issue #1724 suggests that this could be issue #1720, which was fixed by 7a61440. It is a regression that was introduced in 0.6.2, so 0.6.1 is unaffected. The fix will be in 0.6.3, but is trivial to backport to 0.6.2. Gentoo's ZFS packaging has the backport in 0.6.2-r2.

@ghost
Copy link
Author

ghost commented Oct 25, 2013

In the meantime we switched back to 0.6.1. Everything is running fine so far.

@behlendorf
Copy link
Contributor

@jsunderm, @prometheanfire If it's convenient it would great if you could verify that this doesn't happen in master and that this is another duplicate of #1720.

@prometheanfire
Copy link
Contributor

Fix confirmed for me

@behlendorf
Copy link
Contributor

Confirmed fixed in master. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

3 participants