long hung tasks #796

mooinglemur · 2012-06-23T21:21:15Z

Filing bug at request of ryao.

Kernel 3.4.1-vs2.3.3.4 with 3.4 patch from https://bugs.gentoo.org/show_bug.cgi?id=416685

Using git sources of both spl and zfs.

I rebooted into this kernel, and my scrub continued. I then intiated a zfs destroy -r on a fs with a few dozen snapshots, and then rm -r a large tree in a different fs on the same pool. I got some hung tasks for several minutes, but it eventually unfroze and continued.

pool: bkup0
state: ONLINE
scan: scrub in progress since Sun Jun 17 00:50:01 2012
1.59T scanned out of 10.9T at 1.57M/s, (scan is slow, no estimated time)
0 repaired, 14.58% done
config:

    NAME                STATE     READ WRITE CKSUM
    bkup0               ONLINE       0     0     0
      raidz2-0          ONLINE       0     0     0
        dm-name-zfs0_1  ONLINE       0     0     0
        dm-name-zfs0_2  ONLINE       0     0     0
        dm-name-zfs0_3  ONLINE       0     0     0
        dm-name-zfs0_4  ONLINE       0     0     0
        dm-name-zfs0_5  ONLINE       0     0     0
        dm-name-zfs0_6  ONLINE       0     0     0
        dm-name-zfs0_7  ONLINE       0     0     0
        dm-name-zfs0_8  ONLINE       0     0     0

errors: No known data errors

pool: bkup1
state: ONLINE
scan: resilvered 324G in 42h22m with 0 errors on Fri Jun 15 20:11:17 2012
config:

    NAME                                             STATE     READ WRITE CKSUM
    bkup1                                            ONLINE       0     0     0
      raidz1-0                                       ONLINE       0     0     0
        scsi-SATA_ST31000340AS_5QJ0LDCZ-part1        ONLINE       0     0     0
        scsi-SATA_ST31000340NS_9QJ0DJ64-part1        ONLINE       0     0     0
        scsi-SATA_ST31000340NS_3QJ09MHL-part1        ONLINE       0     0     0
        scsi-SATA_ST31000333AS_9TE1HKZ4-part1        ONLINE       0     0     0
        scsi-SATA_ST31000340AS_9QJ1HVQ9-part1        ONLINE       0     0     0
    spares
      scsi-SATA_SAMSUNG_HD103UJS13PJ1LS625210-part1  AVAIL

errors: No known data errors

[ 599.525371] INFO: task txg_sync:6809 blocked for more than 120 seconds.
[ 599.525375] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 599.525379] txg_sync D ffff88053fcd2400 0 6809 2 0x00000000
[ 599.525387] ffff8805008ef9e0 0000000000000046 ffff880521e82b60 0000000000012400
[ 599.525395] ffff8805008effd8 ffff8805008ee000 0000000000012400 0000000000004000
[ 599.525401] ffff8805008effd8 0000000000012400 ffff88052a208ec0 ffff880521e82b60
[ 599.525408] Call Trace:
[ 599.525419] [] ? _raw_spin_unlock_irqrestore+0x2f/0x40
[ 599.525427] [] ? try_to_wake_up+0xce/0x2a0
[ 599.525433] [] schedule+0x24/0x70
[ 599.525443] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 599.525446] [] ? wake_up_bit+0x40/0x40
[ 599.525449] [] __cv_wait+0xe/0x10 [spl]
[ 599.525457] [] zio_wait+0xf3/0x1b0 [zfs]
[ 599.525465] [] dsl_dataset_destroy_sync+0xa7c/0x1080 [zfs]
[ 599.525468] [] ? __cv_timedwait+0xc5/0x1a0 [spl]
[ 599.525477] [] dsl_sync_task_group_sync+0x123/0x3d0 [zfs]
[ 599.525485] [] dsl_pool_sync+0x20b/0x490 [zfs]
[ 599.525496] [] spa_sync+0x3a2/0x10b0 [zfs]
[ 599.525504] [] txg_init+0x4cb/0x9b0 [zfs]
[ 599.525513] [] ? txg_init+0x210/0x9b0 [zfs]
[ 599.525515] [] ? __thread_create+0x310/0x3a0 [spl]
[ 599.525518] [] __thread_create+0x383/0x3a0 [spl]
[ 599.525520] [] ? __thread_create+0x310/0x3a0 [spl]
[ 599.525522] [] kthread+0x96/0xa0
[ 599.525525] [] kernel_thread_helper+0x4/0x10
[ 599.525527] [] ? retint_restore_args+0xe/0xe
[ 599.525529] [] ? kthread_worker_fn+0x140/0x140
[ 599.525531] [] ? gs_change+0xb/0xb
[ 599.525548] INFO: task zfs:10166 blocked for more than 120 seconds.
[ 599.525549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 599.525550] zfs D ffff88053fc52400 0 10166 10125 0x00000000
[ 599.525552] ffff8804d717fbe8 0000000000000086 ffff880521e824c0 0000000000012400
[ 599.525555] ffff8804d717ffd8 ffff8804d717e000 0000000000012400 0000000000004000
[ 599.525557] ffff8804d717ffd8 0000000000012400 ffff88052a1380c0 ffff880521e824c0
[ 599.525560] Call Trace:
[ 599.525562] [] ? spl_debug_get_subsys+0x51/0x460 [spl]
[ 599.525565] [] ? spl_debug_msg+0x413/0x970 [spl]
[ 599.525567] [] schedule+0x24/0x70
[ 599.525569] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 599.525571] [] ? wake_up_bit+0x40/0x40
[ 599.525573] [] __cv_wait+0xe/0x10 [spl]
[ 599.525581] [] txg_wait_synced+0xb3/0x190 [zfs]
[ 599.525590] [] dsl_sync_task_group_wait+0x109/0x240 [zfs]
[ 599.525597] [] ? dsl_dataset_disown+0x240/0x240 [zfs]
[ 599.525604] [] ? dsl_destroy_inconsistent+0x1c0/0x1c0 [zfs]
[ 599.525612] [] dsl_sync_task_do+0x54/0x80 [zfs]
[ 599.525620] [] dsl_dataset_destroy+0x132/0x490 [zfs]
[ 599.525627] [] ? dsl_dataset_tryown+0x49/0x120 [zfs]
[ 599.525633] [] dmu_objset_destroy+0x36/0x40 [zfs]
[ 599.525639] [] zfs_unmount_snap+0x253/0x480 [zfs]
[ 599.525646] [] pool_status_check+0x196/0x270 [zfs]
[ 599.525648] [] do_vfs_ioctl+0x96/0x500
[ 599.525650] [] ? alloc_fd+0x47/0x140
[ 599.525652] [] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 599.525654] [] sys_ioctl+0x4a/0x80
[ 599.525656] [] system_call_fastpath+0x16/0x1b
[ 599.525658] INFO: task rm:10209 blocked for more than 120 seconds.
[ 599.525659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 599.525660] rm D ffff88053fc52400 0 10209 10168 0x00000000
[ 599.525662] ffff8804dfff5b48 0000000000000086 ffff880515a5b7e0 0000000000012400
[ 599.525664] ffff8804dfff5fd8 ffff8804dfff4000 0000000000012400 0000000000004000
[ 599.525667] ffff8804dfff5fd8 0000000000012400 ffff88052a1380c0 ffff880515a5b7e0
[ 599.525669] Call Trace:
[ 599.525671] [] ? _raw_spin_unlock_irqrestore+0x2f/0x40
[ 599.525673] [] ? try_to_wake_up+0xce/0x2a0
[ 599.525675] [] schedule+0x24/0x70
[ 599.525677] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 599.525679] [] ? wake_up_bit+0x40/0x40
[ 599.525681] [] __cv_wait+0xe/0x10 [spl]
[ 599.525689] [] txg_wait_open+0x83/0x110 [zfs]
[ 599.525696] [] dmu_tx_wait+0xed/0xf0 [zfs]
[ 599.525703] [] dmu_tx_assign+0x86/0xe90 [zfs]
[ 599.525708] [] ? dmu_buf_rele+0x2b/0x40 [zfs]
[ 599.525714] [] zfs_rmnode+0x127/0x340 [zfs]
[ 599.525720] [] zfs_zinactive+0x8b/0x100 [zfs]
[ 599.525725] [] zfs_inactive+0x66/0x210 [zfs]
[ 599.525729] [] zpl_vap_init+0x603/0x6c0 [zfs]
[ 599.525732] [] evict+0xa6/0x1b0
[ 599.525734] [] iput+0x103/0x210
[ 599.525735] [] do_unlinkat+0x157/0x1c0
[ 599.525738] [] ? sys_newfstatat+0x2e/0x40
[ 599.525740] [] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 599.525742] [] sys_unlinkat+0x1d/0x40
[ 599.525743] [] system_call_fastpath+0x16/0x1b
[ 719.262968] INFO: task txg_sync:6809 blocked for more than 120 seconds.
[ 719.262973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 719.262977] txg_sync D ffff88053fcd2400 0 6809 2 0x00000000
[ 719.262984] ffff8805008ef9e0 0000000000000046 ffff880521e82b60 0000000000012400
[ 719.262992] ffff8805008effd8 ffff8805008ee000 0000000000012400 0000000000004000
[ 719.262999] ffff8805008effd8 0000000000012400 ffff88052a208ec0 ffff880521e82b60
[ 719.263006] Call Trace:
[ 719.263016] [] ? _raw_spin_unlock_irqrestore+0x2f/0x40
[ 719.263024] [] ? try_to_wake_up+0xce/0x2a0
[ 719.263030] [] schedule+0x24/0x70
[ 719.263040] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 719.263046] [] ? wake_up_bit+0x40/0x40
[ 719.263053] [] __cv_wait+0xe/0x10 [spl]
[ 719.263069] [] zio_wait+0xf3/0x1b0 [zfs]
[ 719.263091] [] dsl_dataset_destroy_sync+0xa7c/0x1080 [zfs]
[ 719.263099] [] ? __cv_timedwait+0xc5/0x1a0 [spl]
[ 719.263115] [] dsl_sync_task_group_sync+0x123/0x3d0 [zfs]
[ 719.263122] [] dsl_pool_sync+0x20b/0x490 [zfs]
[ 719.263131] [] spa_sync+0x3a2/0x10b0 [zfs]
[ 719.263139] [] txg_init+0x4cb/0x9b0 [zfs]
[ 719.263148] [] ? txg_init+0x210/0x9b0 [zfs]
[ 719.263150] [] ? __thread_create+0x310/0x3a0 [spl]
[ 719.263153] [] __thread_create+0x383/0x3a0 [spl]
[ 719.263155] [] ? __thread_create+0x310/0x3a0 [spl]
[ 719.263157] [] kthread+0x96/0xa0
[ 719.263160] [] kernel_thread_helper+0x4/0x10
[ 719.263162] [] ? retint_restore_args+0xe/0xe
[ 719.263164] [] ? kthread_worker_fn+0x140/0x140
[ 719.263166] [] ? gs_change+0xb/0xb
[ 719.263183] INFO: task zfs:10166 blocked for more than 120 seconds.
[ 719.263184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 719.263185] zfs D ffff88053fc52400 0 10166 10125 0x00000000
[ 719.263187] ffff8804d717fbe8 0000000000000086 ffff880521e824c0 0000000000012400
[ 719.263190] ffff8804d717ffd8 ffff8804d717e000 0000000000012400 0000000000004000
[ 719.263192] ffff8804d717ffd8 0000000000012400 ffff88052a1380c0 ffff880521e824c0
[ 719.263195] Call Trace:
[ 719.263197] [] ? spl_debug_get_subsys+0x51/0x460 [spl]
[ 719.263200] [] ? spl_debug_msg+0x413/0x970 [spl]
[ 719.263201] [] schedule+0x24/0x70
[ 719.263204] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 719.263206] [] ? wake_up_bit+0x40/0x40
[ 719.263208] [] __cv_wait+0xe/0x10 [spl]
[ 719.263216] [] txg_wait_synced+0xb3/0x190 [zfs]
[ 719.263225] [] dsl_sync_task_group_wait+0x109/0x240 [zfs]
[ 719.263232] [] ? dsl_dataset_disown+0x240/0x240 [zfs]
[ 719.263239] [] ? dsl_destroy_inconsistent+0x1c0/0x1c0 [zfs]
[ 719.263247] [] dsl_sync_task_do+0x54/0x80 [zfs]
[ 719.263254] [] dsl_dataset_destroy+0x132/0x490 [zfs]
[ 719.263261] [] ? dsl_dataset_tryown+0x49/0x120 [zfs]
[ 719.263268] [] dmu_objset_destroy+0x36/0x40 [zfs]
[ 719.263274] [] zfs_unmount_snap+0x253/0x480 [zfs]
[ 719.263280] [] pool_status_check+0x196/0x270 [zfs]
[ 719.263283] [] do_vfs_ioctl+0x96/0x500
[ 719.263285] [] ? alloc_fd+0x47/0x140
[ 719.263288] [] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 719.263289] [] sys_ioctl+0x4a/0x80
[ 719.263291] [] system_call_fastpath+0x16/0x1b
[ 719.263293] INFO: task rm:10209 blocked for more than 120 seconds.
[ 719.263294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 719.263295] rm D ffff88053fc52400 0 10209 10168 0x00000000
[ 719.263297] ffff8804dfff5b48 0000000000000086 ffff880515a5b7e0 0000000000012400
[ 719.263300] ffff8804dfff5fd8 ffff8804dfff4000 0000000000012400 0000000000004000
[ 719.263302] ffff8804dfff5fd8 0000000000012400 ffff88052a1380c0 ffff880515a5b7e0
[ 719.263304] Call Trace:
[ 719.263306] [] ? _raw_spin_unlock_irqrestore+0x2f/0x40
[ 719.263308] [] ? try_to_wake_up+0xce/0x2a0
[ 719.263310] [] schedule+0x24/0x70
[ 719.263312] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 719.263314] [] ? wake_up_bit+0x40/0x40
[ 719.263316] [] __cv_wait+0xe/0x10 [spl]
[ 719.263324] [] txg_wait_open+0x83/0x110 [zfs]
[ 719.263331] [] dmu_tx_wait+0xed/0xf0 [zfs]
[ 719.263338] [] dmu_tx_assign+0x86/0xe90 [zfs]
[ 719.263343] [] ? dmu_buf_rele+0x2b/0x40 [zfs]
[ 719.263350] [] zfs_rmnode+0x127/0x340 [zfs]
[ 719.263355] [] zfs_zinactive+0x8b/0x100 [zfs]
[ 719.263360] [] zfs_inactive+0x66/0x210 [zfs]
[ 719.263365] [] zpl_vap_init+0x603/0x6c0 [zfs]
[ 719.263367] [] evict+0xa6/0x1b0
[ 719.263369] [] iput+0x103/0x210
[ 719.263371] [] do_unlinkat+0x157/0x1c0
[ 719.263374] [] ? sys_newfstatat+0x2e/0x40
[ 719.263375] [] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 719.263377] [] sys_unlinkat+0x1d/0x40
[ 719.263379] [] system_call_fastpath+0x16/0x1b
[ 839.000555] INFO: task txg_sync:6809 blocked for more than 120 seconds.
[ 839.000559] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 839.000563] txg_sync D ffff88053fcd2400 0 6809 2 0x00000000
[ 839.000570] ffff8805008ef9e0 0000000000000046 ffff880521e82b60 0000000000012400
[ 839.000580] ffff8805008effd8 ffff8805008ee000 0000000000012400 0000000000004000
[ 839.000590] ffff8805008effd8 0000000000012400 ffff88052a208ec0 ffff880521e82b60
[ 839.000598] Call Trace:
[ 839.000609] [] ? _raw_spin_unlock_irqrestore+0x2f/0x40
[ 839.000617] [] ? try_to_wake_up+0xce/0x2a0
[ 839.000622] [] schedule+0x24/0x70
[ 839.000633] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 839.000639] [] ? wake_up_bit+0x40/0x40
[ 839.000646] [] __cv_wait+0xe/0x10 [spl]
[ 839.000661] [] zio_wait+0xf3/0x1b0 [zfs]
[ 839.000685] [] dsl_dataset_destroy_sync+0xa7c/0x1080 [zfs]
[ 839.000688] [] ? __cv_timedwait+0xc5/0x1a0 [spl]
[ 839.000696] [] dsl_sync_task_group_sync+0x123/0x3d0 [zfs]
[ 839.000704] [] dsl_pool_sync+0x20b/0x490 [zfs]
[ 839.000713] [] spa_sync+0x3a2/0x10b0 [zfs]
[ 839.000722] [] txg_init+0x4cb/0x9b0 [zfs]
[ 839.000733] [] ? txg_init+0x210/0x9b0 [zfs]
[ 839.000736] [] ? __thread_create+0x310/0x3a0 [spl]
[ 839.000739] [] __thread_create+0x383/0x3a0 [spl]
[ 839.000741] [] ? __thread_create+0x310/0x3a0 [spl]
[ 839.000743] [] kthread+0x96/0xa0
[ 839.000746] [] kernel_thread_helper+0x4/0x10
[ 839.000748] [] ? retint_restore_args+0xe/0xe
[ 839.000749] [] ? kthread_worker_fn+0x140/0x140
[ 839.000751] [] ? gs_change+0xb/0xb
[ 839.000769] INFO: task zfs:10166 blocked for more than 120 seconds.
[ 839.000770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 839.000771] zfs D ffff88053fc52400 0 10166 10125 0x00000000
[ 839.000773] ffff8804d717fbe8 0000000000000086 ffff880521e824c0 0000000000012400
[ 839.000776] ffff8804d717ffd8 ffff8804d717e000 0000000000012400 0000000000004000
[ 839.000778] ffff8804d717ffd8 0000000000012400 ffff88052a1380c0 ffff880521e824c0
[ 839.000780] Call Trace:
[ 839.000783] [] ? spl_debug_get_subsys+0x51/0x460 [spl]
[ 839.000785] [] ? spl_debug_msg+0x413/0x970 [spl]
[ 839.000787] [] schedule+0x24/0x70
[ 839.000790] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 839.000791] [] ? wake_up_bit+0x40/0x40
[ 839.000794] [] __cv_wait+0xe/0x10 [spl]
[ 839.000802] [] txg_wait_synced+0xb3/0x190 [zfs]
[ 839.000810] [] dsl_sync_task_group_wait+0x109/0x240 [zfs]
[ 839.000817] [] ? dsl_dataset_disown+0x240/0x240 [zfs]
[ 839.000825] [] ? dsl_destroy_inconsistent+0x1c0/0x1c0 [zfs]
[ 839.000833] [] dsl_sync_task_do+0x54/0x80 [zfs]
[ 839.000840] [] dsl_dataset_destroy+0x132/0x490 [zfs]
[ 839.000847] [] ? dsl_dataset_tryown+0x49/0x120 [zfs]
[ 839.000853] [] dmu_objset_destroy+0x36/0x40 [zfs]
[ 839.000860] [] zfs_unmount_snap+0x253/0x480 [zfs]
[ 839.000866] [] pool_status_check+0x196/0x270 [zfs]
[ 839.000868] [] do_vfs_ioctl+0x96/0x500
[ 839.000870] [] ? alloc_fd+0x47/0x140
[ 839.000873] [] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 839.000875] [] sys_ioctl+0x4a/0x80
[ 839.000877] [] system_call_fastpath+0x16/0x1b
[ 839.000878] INFO: task rm:10209 blocked for more than 120 seconds.
[ 839.000879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 839.000880] rm D ffff88053fc52400 0 10209 10168 0x00000000
[ 839.000882] ffff8804dfff5b48 0000000000000086 ffff880515a5b7e0 0000000000012400
[ 839.000885] ffff8804dfff5fd8 ffff8804dfff4000 0000000000012400 0000000000004000
[ 839.000887] ffff8804dfff5fd8 0000000000012400 ffff88052a1380c0 ffff880515a5b7e0
[ 839.000889] Call Trace:
[ 839.000891] [] ? _raw_spin_unlock_irqrestore+0x2f/0x40
[ 839.000893] [] ? try_to_wake_up+0xce/0x2a0
[ 839.000895] [] schedule+0x24/0x70
[ 839.000897] [] __cv_timedwait+0xa7/0x1a0 [spl]
[ 839.000899] [] ? wake_up_bit+0x40/0x40
[ 839.000902] [] __cv_wait+0xe/0x10 [spl]
[ 839.000910] [] txg_wait_open+0x83/0x110 [zfs]
[ 839.000917] [] dmu_tx_wait+0xed/0xf0 [zfs]
[ 839.000923] [] dmu_tx_assign+0x86/0xe90 [zfs]
[ 839.000928] [] ? dmu_buf_rele+0x2b/0x40 [zfs]
[ 839.000935] [] zfs_rmnode+0x127/0x340 [zfs]
[ 839.000940] [] zfs_zinactive+0x8b/0x100 [zfs]
[ 839.000946] [] zfs_inactive+0x66/0x210 [zfs]
[ 839.000950] [] zpl_vap_init+0x603/0x6c0 [zfs]
[ 839.000953] [] evict+0xa6/0x1b0
[ 839.000954] [] iput+0x103/0x210
[ 839.000956] [] do_unlinkat+0x157/0x1c0
[ 839.000959] [] ? sys_newfstatat+0x2e/0x40
[ 839.000961] [] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 839.000962] [] sys_unlinkat+0x1d/0x40
[ 839.000964] [] system_call_fastpath+0x16/0x1b
[ 839.000966] INFO: task zfs:10227 blocked for more than 120 seconds.
[ 839.000967] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 839.000968] zfs D ffff88053fc92400 0 10227 10211 0x00000000
[ 839.000970] ffff8804bbfc7bf0 0000000000000082 ffff8804f66690c0 0000000000012400
[ 839.000972] ffff8804bbfc7fd8 ffff8804bbfc6000 0000000000012400 0000000000004000
[ 839.000975] ffff8804bbfc7fd8 0000000000012400 ffff88052a1f7520 ffff8804f66690c0
[ 839.000977] Call Trace:
[ 839.000979] [] ? vsnprintf+0x356/0x610
[ 839.000981] [] ? spl_debug_get_subsys+0x51/0x460 [spl]
[ 839.000984] [] ? spl_debug_msg+0x413/0x970 [spl]
[ 839.000992] [] ? spa_meta_objset+0x19/0x40 [zfs]
[ 839.000994] [] schedule+0x24/0x70
[ 839.000996] [] rwsem_down_failed_common+0xc5/0x150
[ 839.000998] [] rwsem_down_read_failed+0x15/0x17
[ 839.000999] [] call_rwsem_down_read_failed+0x14/0x30
[ 839.001001] [] ? down_read+0x12/0x20
[ 839.001009] [] dsl_dir_open_spa+0xb3/0x540 [zfs]
[ 839.001017] [] ? spa_meta_objset+0x19/0x40 [zfs]
[ 839.001019] [] ? avl_find+0x5a/0xa0 [zavl]
[ 839.001026] [] dsl_dataset_hold+0x3b/0x290 [zfs]
[ 839.001029] [] ? __kmalloc+0x13e/0x1c0
[ 839.001035] [] dmu_objset_hold+0x1f/0x1a0 [zfs]
[ 839.001042] [] zfs_secpolicy_smb_acl+0x2f84/0x3b20 [zfs]
[ 839.001048] [] ? pool_status_check+0x61/0x270 [zfs]
[ 839.001054] [] pool_status_check+0x196/0x270 [zfs]
[ 839.001056] [] do_vfs_ioctl+0x96/0x500
[ 839.001058] [] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 839.001059] [] sys_ioctl+0x4a/0x80
[ 839.001061] [] system_call_fastpath+0x16/0x1b

The text was updated successfully, but these errors were encountered:

behlendorf · 2012-06-29T20:06:50Z

The stacks are just advisory but they seem to indicate the txg_sync thread was block on an I/O request for quite some time. Perhaps one of your drives had a hiccup.

ryanolf · 2012-07-10T18:42:59Z

I also frequently see lock-ups (sometimes the system seems to have crashed...) with similar traces in the logs. This happens typically when under a reasonably heavy load, e.g. when resilvering/scrubbing and/or snapshotting while rsyncing. I thought maybe this was related to zfs memory issues, but do you think it could be an issue with my disks or eSATA card? I am finding scrubs fix errors on them from time to time. They are in a TowerRAID eSATA external enclosure driven by a eSATA raid card (RocketRaid 622- but just running them JBOD), so there are many places to go wrong. I'm running the latest stable Ubuntu ppa as of July 10. Our array is an 8TB raidz2 with a 32 GB usb stick cache device. The array is about 75% full with compression and dedup enabled. The system has 8GB RAM. I'd very much appreciate you helping me understand these stack traces so that I can do something on my end to help stop the lockups, and that they help you with your great work on zfs.

Looking at other bug reports, it looks like this implementation of zfs has a variety of issues with memory and excessive disk access while under load. Since scrubbing takes so long on reasonably sized arrays, it seems that the system should be able to do this and handle other IO without locking up. Is there perhaps an underlying issue with prioritizing finite resources? This may seem to go under "optimization" but because zfs systems can easily grow large, some amount of optimization may be required for basic functionality.

Here are my kernel logs from the most recent lockup. The system recovered from this one:

[64198.775024] INFO: task txg_sync:3038 blocked for more than 120 seconds.
[64198.775027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64198.775030] txg_sync        D 0000000000000003     0  3038      2 0x00000000
[64198.775033]  ffff8801edf4bbd0 0000000000000046 ffff8801edf4bb70 ffffffff8105fb02
[64198.775037]  ffff8801edf4bfd8 ffff8801edf4bfd8 ffff8801edf4bfd8 0000000000013780
[64198.775040]  ffff880209dda000 ffff8801f1dd5bc0 ffff8801edf4bbe0 ffff8801cc6231e8
[64198.775044] Call Trace:
[64198.775051]  [<ffffffff8105fb02>] ? default_wake_function+0x12/0x20
[64198.775056]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64198.775070]  [<ffffffffa019dae8>] cv_wait_common+0x98/0x190 [spl]
[64198.775073]  [<ffffffff8108b030>] ? add_wait_queue+0x60/0x60
[64198.775079]  [<ffffffffa019dc13>] __cv_wait+0x13/0x20 [spl]
[64198.775106]  [<ffffffffa02e6a6b>] zio_wait+0xfb/0x170 [zfs]
[64198.775123]  [<ffffffffa028032d>] dsl_pool_sync+0x29d/0x450 [zfs]
[64198.775141]  [<ffffffffa0293e2e>] spa_sync+0x38e/0xa00 [zfs]
[64198.775161]  [<ffffffffa02a3ef6>] txg_sync_thread+0x286/0x450 [zfs]
[64198.775180]  [<ffffffffa02a3c70>] ? txg_init+0x250/0x250 [zfs]
[64198.775185]  [<ffffffffa01970b8>] thread_generic_wrapper+0x78/0x90 [spl]
[64198.775191]  [<ffffffffa0197040>] ? __thread_create+0x310/0x310 [spl]
[64198.775193]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64198.775197]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64198.775200]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64198.775202]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.771919] INFO: task l2arc_feed:1088 blocked for more than 120 seconds.
[64318.771923] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.771925] l2arc_feed      D 0000000000000003     0  1088      2 0x00000000
[64318.771929]  ffff8801fd0d5cf0 0000000000000046 ffff8801fd0d5cd0 ffffffff8104c2e8
[64318.771933]  ffff8801fd0d5fd8 ffff8801fd0d5fd8 ffff8801fd0d5fd8 0000000000013780
[64318.771936]  ffff88017a3f6000 ffff88020067c4d0 ffff8801fd0d5d00 ffff8801fc066e08
[64318.771939] Call Trace:
[64318.771946]  [<ffffffff8104c2e8>] ? __wake_up_common+0x58/0x90
[64318.771954]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.771964]  [<ffffffffa019dae8>] cv_wait_common+0x98/0x190 [spl]
[64318.771968]  [<ffffffff8108b030>] ? add_wait_queue+0x60/0x60
[64318.771974]  [<ffffffffa019dc13>] __cv_wait+0x13/0x20 [spl]
[64318.771998]  [<ffffffffa02e6a6b>] zio_wait+0xfb/0x170 [zfs]
[64318.772009]  [<ffffffffa024e637>] l2arc_feed_thread+0x627/0x850 [zfs]
[64318.772020]  [<ffffffffa024e010>] ? l2arc_evict+0x310/0x310 [zfs]
[64318.772026]  [<ffffffffa01970b8>] thread_generic_wrapper+0x78/0x90 [spl]
[64318.772031]  [<ffffffffa0197040>] ? __thread_create+0x310/0x310 [spl]
[64318.772033]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.772037]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.772040]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.772042]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.772058] INFO: task z_null_iss/0:2977 blocked for more than 120 seconds.
[64318.772059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.772061] z_null_iss/0    D 0000000000000000     0  2977      2 0x00000000
[64318.772063]  ffff8801f0e01cd0 0000000000000046 ffff880210a65c00 ffff8801b0b5c6c0
[64318.772067]  ffff8801f0e01fd8 ffff8801f0e01fd8 ffff8801f0e01fd8 0000000000013780
[64318.772070]  ffff8801fbb40000 ffff8801f944dbc0 ffffffffa02f7110 ffff8801ff7a9ee0
[64318.772073] Call Trace:
[64318.772076]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.772079]  [<ffffffff816588c7>] __mutex_lock_slowpath+0xd7/0x150
[64318.772082]  [<ffffffff816584da>] mutex_lock+0x2a/0x50
[64318.772101]  [<ffffffffa02aebad>] vdev_queue_io+0x6d/0x120 [zfs]
[64318.772119]  [<ffffffffa02e6867>] zio_vdev_io_start+0x1e7/0x2f0 [zfs]
[64318.772136]  [<ffffffffa02e431f>] zio_execute+0x9f/0xf0 [zfs]
[64318.772142]  [<ffffffffa019822b>] taskq_thread+0x24b/0x5a0 [spl]
[64318.772145]  [<ffffffff810562da>] ? finish_task_switch+0x4a/0xf0
[64318.772148]  [<ffffffff8105faf0>] ? try_to_wake_up+0x200/0x200
[64318.772153]  [<ffffffffa0197fe0>] ? task_done+0x140/0x140 [spl]
[64318.772156]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.772158]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.772161]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.772163]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.772169] INFO: task txg_sync:3038 blocked for more than 120 seconds.
[64318.772171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.772172] txg_sync        D 0000000000000003     0  3038      2 0x00000000
[64318.772175]  ffff8801edf4bbd0 0000000000000046 ffff8801edf4bb70 ffffffff8105fb02
[64318.772178]  ffff8801edf4bfd8 ffff8801edf4bfd8 ffff8801edf4bfd8 0000000000013780
[64318.772181]  ffff880209dda000 ffff8801f1dd5bc0 ffff8801edf4bbe0 ffff8801cc6231e8
[64318.772184] Call Trace:
[64318.772187]  [<ffffffff8105fb02>] ? default_wake_function+0x12/0x20
[64318.772189]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.772195]  [<ffffffffa019dae8>] cv_wait_common+0x98/0x190 [spl]
[64318.772198]  [<ffffffff8108b030>] ? add_wait_queue+0x60/0x60
[64318.772203]  [<ffffffffa019dc13>] __cv_wait+0x13/0x20 [spl]
[64318.772220]  [<ffffffffa02e6a6b>] zio_wait+0xfb/0x170 [zfs]
[64318.772237]  [<ffffffffa028032d>] dsl_pool_sync+0x29d/0x450 [zfs]
[64318.772255]  [<ffffffffa0293e2e>] spa_sync+0x38e/0xa00 [zfs]
[64318.772274]  [<ffffffffa02a3ef6>] txg_sync_thread+0x286/0x450 [zfs]
[64318.772292]  [<ffffffffa02a3c70>] ? txg_init+0x250/0x250 [zfs]
[64318.772297]  [<ffffffffa01970b8>] thread_generic_wrapper+0x78/0x90 [spl]
[64318.772302]  [<ffffffffa0197040>] ? __thread_create+0x310/0x310 [spl]
[64318.772304]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.772307]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.772310]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.772312]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.772328] INFO: task kworker/2:3:5687 blocked for more than 120 seconds.
[64318.772329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.772331] kworker/2:3     D 0000000000000002     0  5687      2 0x00000000
[64318.772334]  ffff8801ed3e5c50 0000000000000046 ffff8801ed3e5c10 0000000000000000
[64318.772337]  ffff8801ed3e5fd8 ffff8801ed3e5fd8 ffff8801ed3e5fd8 0000000000013780
[64318.772340]  ffff8800a8e5c4d0 ffff88006b6244d0 ffff8800bf574800 ffff8801f01544b0
[64318.772343] Call Trace:
[64318.772346]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.772349]  [<ffffffff816588c7>] __mutex_lock_slowpath+0xd7/0x150
[64318.772352]  [<ffffffff816584da>] mutex_lock+0x2a/0x50
[64318.772369]  [<ffffffffa02bdaa8>] zfsctl_unmount_snapshot+0x58/0xf0 [zfs]
[64318.772372]  [<ffffffff81057600>] ? dequeue_entity+0x120/0x200
[64318.772375]  [<ffffffff81311bad>] ? __rb_erase_color+0x11d/0x1c0
[64318.772379]  [<ffffffff810126e5>] ? __switch_to+0xf5/0x360
[64318.772381]  [<ffffffff8105613d>] ? set_next_entity+0xad/0xd0
[64318.772398]  [<ffffffffa02bdb40>] ? zfsctl_unmount_snapshot+0xf0/0xf0 [zfs]
[64318.772415]  [<ffffffffa02bdb6e>] zfsctl_expire_snapshot+0x2e/0x60 [zfs]
[64318.772418]  [<ffffffff81084f9a>] process_one_work+0x11a/0x480
[64318.772421]  [<ffffffff81085d44>] worker_thread+0x164/0x370
[64318.772424]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.772427]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.772429]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.772432]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.772434]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.772436] INFO: task kworker/2:4:5695 blocked for more than 120 seconds.
[64318.772438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.772439] kworker/2:4     D 0000000000000002     0  5695      2 0x00000000
[64318.772442]  ffff88015380bb90 0000000000000046 ffff880215802900 0000000000000000
[64318.772445]  ffff88015380bfd8 ffff88015380bfd8 ffff88015380bfd8 0000000000013780
[64318.772448]  ffff88020030dbc0 ffff8800a8e5c4d0 ffff8801f534f458 7fffffffffffffff
[64318.772451] Call Trace:
[64318.772454]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.772457]  [<ffffffff81658105>] schedule_timeout+0x2a5/0x320
[64318.772459]  [<ffffffff81053a6c>] ? select_idle_sibling+0xcc/0x230
[64318.772462]  [<ffffffff816578ff>] wait_for_common+0xdf/0x180
[64318.772465]  [<ffffffff8105faf0>] ? try_to_wake_up+0x200/0x200
[64318.772468]  [<ffffffff81657a7d>] wait_for_completion+0x1d/0x20
[64318.772471]  [<ffffffff8108a76f>] kthread_create_on_node+0xaf/0x130
[64318.772473]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.772476]  [<ffffffff81083874>] ? alloc_worker+0x24/0x70
[64318.772479]  [<ffffffff81085936>] create_worker+0x136/0x1b0
[64318.772482]  [<ffffffff81085a15>] maybe_create_worker+0x65/0x100
[64318.772484]  [<ffffffff81085af8>] manage_workers.isra.29+0x48/0x130
[64318.772487]  [<ffffffff81085ecb>] worker_thread+0x2eb/0x370
[64318.772489]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.772492]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.772494]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.772497]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.772499]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.772501] INFO: task kworker/0:1:13440 blocked for more than 120 seconds.
[64318.772502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.772504] kworker/0:1     D 0000000000000000     0 13440      2 0x00000000
[64318.772507]  ffff880132483ad0 0000000000000046 ffffffff8111a0a5 0000000000011220
[64318.772510]  ffff880132483fd8 ffff880132483fd8 ffff880132483fd8 0000000000013780
[64318.772513]  ffff8800a8e5ade0 ffff880073d2c4d0 ffff880073d2c4d0 7fffffffffffffff
[64318.772516] Call Trace:
[64318.772520]  [<ffffffff8111a0a5>] ? mempool_alloc_slab+0x15/0x20
[64318.772523]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.772525]  [<ffffffff81658105>] schedule_timeout+0x2a5/0x320
[64318.772529]  [<ffffffff8103dcf9>] ? default_spin_lock_flags+0x9/0x10
[64318.772531]  [<ffffffff8105fa7e>] ? try_to_wake_up+0x18e/0x200
[64318.772534]  [<ffffffff816578ff>] wait_for_common+0xdf/0x180
[64318.772537]  [<ffffffff8105faf0>] ? try_to_wake_up+0x200/0x200
[64318.772540]  [<ffffffff81657a7d>] wait_for_completion+0x1d/0x20
[64318.772542]  [<ffffffff81081939>] call_usermodehelper_exec+0x109/0x120
[64318.772559]  [<ffffffffa02bcf7d>] __zfsctl_unmount_snapshot+0xad/0x120 [zfs]
[64318.772576]  [<ffffffffa02bdade>] zfsctl_unmount_snapshot+0x8e/0xf0 [zfs]
[64318.772578]  [<ffffffff81057600>] ? dequeue_entity+0x120/0x200
[64318.772581]  [<ffffffff81311bad>] ? __rb_erase_color+0x11d/0x1c0
[64318.772584]  [<ffffffff810126e5>] ? __switch_to+0xf5/0x360
[64318.772586]  [<ffffffff8105613d>] ? set_next_entity+0xad/0xd0
[64318.772601]  [<ffffffffa02bdb40>] ? zfsctl_unmount_snapshot+0xf0/0xf0 [zfs]
[64318.772616]  [<ffffffffa02bdb6e>] zfsctl_expire_snapshot+0x2e/0x60 [zfs]
[64318.772619]  [<ffffffff81084f9a>] process_one_work+0x11a/0x480
[64318.772622]  [<ffffffff81085d44>] worker_thread+0x164/0x370
[64318.772625]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.772627]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.772630]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.772633]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.772635]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.772637] INFO: task rsync:13631 blocked for more than 120 seconds.
[64318.772639] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.772640] rsync           D 0000000000000002     0 13631  13630 0x00000000
[64318.772643]  ffff8801eeea56a8 0000000000000082 ffff88015899e540 ffff8801ffa30908
[64318.772646]  ffff8801eeea5fd8 ffff8801eeea5fd8 ffff8801eeea5fd8 0000000000013780
[64318.772649]  ffff88020067ade0 ffff8800a8e58000 ffff8801eeea56b8 ffff8801c295aa28
[64318.772653] Call Trace:
[64318.772655]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.772661]  [<ffffffffa019dae8>] cv_wait_common+0x98/0x190 [spl]
[64318.772663]  [<ffffffff8108b030>] ? add_wait_queue+0x60/0x60
[64318.772669]  [<ffffffffa019dc13>] __cv_wait+0x13/0x20 [spl]
[64318.772686]  [<ffffffffa02e6a6b>] zio_wait+0xfb/0x170 [zfs]
[64318.772698]  [<ffffffffa0256797>] dbuf_read+0x337/0x7b0 [zfs]
[64318.772711]  [<ffffffffa025d490>] dmu_buf_hold+0xf0/0x1a0 [zfs]
[64318.772728]  [<ffffffffa02b1bb9>] zap_idx_to_blk+0xa9/0x140 [zfs]
[64318.772741]  [<ffffffffa025c3c0>] ? ddt_zap_update+0xb0/0xb0 [zfs]
[64318.772758]  [<ffffffffa02b33ea>] fzap_prefetch+0x3a/0x70 [zfs]
[64318.772775]  [<ffffffffa02b84e6>] zap_prefetch_uint64+0x76/0xa0 [zfs]
[64318.772788]  [<ffffffffa025c3d3>] ddt_zap_prefetch+0x13/0x20 [zfs]
[64318.772800]  [<ffffffffa025b14c>] ddt_prefetch+0xbc/0xc0 [zfs]
[64318.772812]  [<ffffffffa02569f3>] ? dbuf_read+0x593/0x7b0 [zfs]
[64318.772817]  [<ffffffffa01939fb>] ? kmem_free_debug+0x4b/0x150 [spl]
[64318.772829]  [<ffffffffa0257919>] ? dbuf_hold_impl+0x99/0xc0 [zfs]
[64318.772840]  [<ffffffffa0256d69>] ? dbuf_rele_and_unlock+0x159/0x200 [zfs]
[64318.772843]  [<ffffffff816584cd>] ? mutex_lock+0x1d/0x50
[64318.772853]  [<ffffffffa024d076>] ? add_reference.isra.9+0xe6/0x120 [zfs]
[64318.772857]  [<ffffffff816584cd>] ? mutex_lock+0x1d/0x50
[64318.772868]  [<ffffffffa02569f3>] ? dbuf_read+0x593/0x7b0 [zfs]
[64318.772871]  [<ffffffff816584cd>] ? mutex_lock+0x1d/0x50
[64318.772874]  [<ffffffff816584cd>] ? mutex_lock+0x1d/0x50
[64318.772891]  [<ffffffffa029fa7d>] ? spa_config_exit+0x7d/0xc0 [zfs]
[64318.772906]  [<ffffffffa0272cb0>] dsl_dataset_block_freeable+0x40/0x60 [zfs]
[64318.772919]  [<ffffffffa026a4f4>] dmu_tx_hold_free+0x464/0x7c0 [zfs]
[64318.772932]  [<ffffffffa025ce37>] dmu_free_long_range_impl+0x177/0x270 [zfs]
[64318.772945]  [<ffffffffa025ec1b>] dmu_free_long_range+0x4b/0x70 [zfs]
[64318.772962]  [<ffffffffa02bf0e0>] zfs_rmnode+0x60/0x340 [zfs]
[64318.772980]  [<ffffffffa02de8f6>] zfs_zinactive+0x86/0xd0 [zfs]
[64318.772997]  [<ffffffffa02d998b>] zfs_inactive+0x6b/0x1e0 [zfs]
[64318.773000]  [<ffffffff811256be>] ? truncate_pagecache+0x5e/0x70
[64318.773017]  [<ffffffffa02ecba8>] zpl_evict_inode+0x28/0x30 [zfs]
[64318.773021]  [<ffffffff811927b8>] evict+0xb8/0x1c0
[64318.773023]  [<ffffffff811929a8>] iput_final+0xe8/0x210
[64318.773025]  [<ffffffff81192b0e>] iput+0x3e/0x50
[64318.773028]  [<ffffffff81187533>] do_unlinkat+0x153/0x1d0
[64318.773031]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
[64318.773033]  [<ffffffff8117828a>] ? sys_write+0x4a/0x90
[64318.773036]  [<ffffffff81188106>] sys_unlink+0x16/0x20
[64318.773038]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
[64318.773040] INFO: task kworker/0:0:13635 blocked for more than 120 seconds.
[64318.773042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.773043] kworker/0:0     D 0000000000000000     0 13635      2 0x00000000
[64318.773046]  ffff8801e8d97c50 0000000000000046 ffff8801e8d97c10 0000000000000000
[64318.773049]  ffff8801e8d97fd8 ffff8801e8d97fd8 ffff8801e8d97fd8 0000000000013780
[64318.773053]  ffff8801252fade0 ffff8800a8e5ade0 ffff8801a6060200 ffff8801f01544b0
[64318.773056] Call Trace:
[64318.773058]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.773061]  [<ffffffff816588c7>] __mutex_lock_slowpath+0xd7/0x150
[64318.773064]  [<ffffffff816584da>] mutex_lock+0x2a/0x50
[64318.773081]  [<ffffffffa02bdaa8>] zfsctl_unmount_snapshot+0x58/0xf0 [zfs]
[64318.773083]  [<ffffffff81057600>] ? dequeue_entity+0x120/0x200
[64318.773086]  [<ffffffff810126e5>] ? __switch_to+0xf5/0x360
[64318.773088]  [<ffffffff8105613d>] ? set_next_entity+0xad/0xd0
[64318.773104]  [<ffffffffa02bdb40>] ? zfsctl_unmount_snapshot+0xf0/0xf0 [zfs]
[64318.773119]  [<ffffffffa02bdb6e>] zfsctl_expire_snapshot+0x2e/0x60 [zfs]
[64318.773121]  [<ffffffff81084f9a>] process_one_work+0x11a/0x480
[64318.773124]  [<ffffffff81085d44>] worker_thread+0x164/0x370
[64318.773127]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.773129]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.773132]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.773134]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.773137]  [<ffffffff81664130>] ? gs_change+0x13/0x13
[64318.773139] INFO: task kworker/0:2:20829 blocked for more than 120 seconds.
[64318.773140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64318.773142] kworker/0:2     D 0000000000000000     0 20829      2 0x00000000
[64318.773144]  ffff88015ab1fb90 0000000000000046 ffff880034abd870 000000000bd28000
[64318.773148]  ffff88015ab1ffd8 ffff88015ab1ffd8 ffff88015ab1ffd8 0000000000013780
[64318.773151]  ffff880201fdc4d0 ffff8801252fade0 ffff8801a0094b58 7fffffffffffffff
[64318.773154] Call Trace:
[64318.773156]  [<ffffffff81657abf>] schedule+0x3f/0x60
[64318.773159]  [<ffffffff81658105>] schedule_timeout+0x2a5/0x320
[64318.773161]  [<ffffffff81053a6c>] ? select_idle_sibling+0xcc/0x230
[64318.773164]  [<ffffffff816578ff>] wait_for_common+0xdf/0x180
[64318.773167]  [<ffffffff8105faf0>] ? try_to_wake_up+0x200/0x200
[64318.773170]  [<ffffffff81657a7d>] wait_for_completion+0x1d/0x20
[64318.773172]  [<ffffffff8108a76f>] kthread_create_on_node+0xaf/0x130
[64318.773175]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.773178]  [<ffffffff81083874>] ? alloc_worker+0x24/0x70
[64318.773180]  [<ffffffff81085936>] create_worker+0x136/0x1b0
[64318.773183]  [<ffffffff81085a15>] maybe_create_worker+0x65/0x100
[64318.773186]  [<ffffffff81085af8>] manage_workers.isra.29+0x48/0x130
[64318.773189]  [<ffffffff81085ecb>] worker_thread+0x2eb/0x370
[64318.773191]  [<ffffffff81085be0>] ? manage_workers.isra.29+0x130/0x130
[64318.773194]  [<ffffffff8108a59c>] kthread+0x8c/0xa0
[64318.773196]  [<ffffffff81664134>] kernel_thread_helper+0x4/0x10
[64318.773199]  [<ffffffff8108a510>] ? flush_kthread_worker+0xa0/0xa0
[64318.773201]  [<ffffffff81664130>] ? gs_change+0x13/0x13

ryao · 2012-10-11T04:46:36Z

@mooinglemur Is this still an issue in 0.6.0-rc11?

dswartz · 2012-10-18T16:51:26Z

I just had this happen this morning. Running the most recent zfsonlinux on ubuntu12. It completely b0rked my ESXi host which had a datastore mounted on the zfsonlinux VM. I had to reboot the whole host from the previous SAN (openindiana [a solaris variant]), since my wife uses a virtualized windows XP for telecommuting, and the WAF took a big hit due to this. I was migrating datasets between disks when this happened.

elan · 2012-11-03T09:07:46Z

3.2.0-32-generic #51-Ubuntu SMP, zfs-0.6.0-rc11

I just saw this today, seemed to make an AFP copy hang for a while, but no ill effects besides that observed:

Nov  1 21:06:26 whopper kernel: [30451.753316] umount          D ffffffff81806200     0 21237  21236 0x00000000
Nov  1 21:06:26 whopper kernel: [30451.753322]  ffff880219145d28 0000000000000082 ffff880219145cd8 ffffffffa012d069
Nov  1 21:06:26 whopper kernel: [30451.753327]  ffff880219145fd8 ffff880219145fd8 ffff880219145fd8 00000000000137c0
Nov  1 21:06:26 whopper kernel: [30451.753332]  ffff8804155e9700 ffff880151510000 ffff880219145d38 ffff8804150c8500
Nov  1 21:06:26 whopper kernel: [30451.753337] Call Trace:
Nov  1 21:06:26 whopper kernel: [30451.753368]  [<ffffffffa012d069>] ? dbuf_rele_and_unlock+0x159/0x200 [zfs]
Nov  1 21:06:26 whopper kernel: [30451.753375]  [<ffffffff81658f2f>] schedule+0x3f/0x60
Nov  1 21:06:26 whopper kernel: [30451.753386]  [<ffffffffa006bead>] __taskq_wait_id+0x7d/0x150 [spl]
Nov  1 21:06:26 whopper kernel: [30451.753391]  [<ffffffff811b76b4>] ? fsnotify_clear_marks_by_inode+0x94/0xf0
Nov  1 21:06:26 whopper kernel: [30451.753397]  [<ffffffff8108ab80>] ? add_wait_queue+0x60/0x60
Nov  1 21:06:26 whopper kernel: [30451.753405]  [<ffffffffa006cff3>] __taskq_wait+0x53/0xf0 [spl]
Nov  1 21:06:26 whopper kernel: [30451.753436]  [<ffffffffa0166679>] ? rrw_enter+0x69/0x190 [zfs]
Nov  1 21:06:26 whopper kernel: [30451.753470]  [<ffffffffa01abbfc>] zfs_sb_teardown+0x5c/0x3a0 [zfs]
Nov  1 21:06:26 whopper kernel: [30451.753475]  [<ffffffff81193033>] ? dispose_list+0x23/0x50
Nov  1 21:06:26 whopper kernel: [30451.753505]  [<ffffffffa01abfb0>] zfs_umount+0x30/0xc0 [zfs]
Nov  1 21:06:26 whopper kernel: [30451.753535]  [<ffffffffa01c710e>] zpl_put_super+0xe/0x10 [zfs]
Nov  1 21:06:26 whopper kernel: [30451.753540]  [<ffffffff8117a0e2>] generic_shutdown_super+0x62/0xe0
Nov  1 21:06:26 whopper kernel: [30451.753544]  [<ffffffff8117a1f6>] kill_anon_super+0x16/0x30
Nov  1 21:06:26 whopper kernel: [30451.753572]  [<ffffffffa01c6f0e>] zpl_kill_sb+0x1e/0x30 [zfs]
Nov  1 21:06:26 whopper kernel: [30451.753576]  [<ffffffff8117a83c>] deactivate_locked_super+0x3c/0xa0
Nov  1 21:06:26 whopper kernel: [30451.753580]  [<ffffffff8117b0be>] deactivate_super+0x4e/0x70
Nov  1 21:06:26 whopper kernel: [30451.753584]  [<ffffffff8119773d>] mntput_no_expire+0x9d/0xf0
Nov  1 21:06:26 whopper kernel: [30451.753588]  [<ffffffff81198a6b>] sys_umount+0x5b/0xd0
Nov  1 21:06:26 whopper kernel: [30451.753592]  [<ffffffff81663442>] system_call_fastpath+0x16/0x1b
Nov  1 21:08:26 whopper kernel: [30571.636117] INFO: task umount:21237 blocked for more than 120 seconds.
Nov  1 21:08:26 whopper kernel: [30571.636490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  1 21:08:26 whopper kernel: [30571.657955] umount          D ffffffff81806200     0 21237  21236 0x00000000
Nov  1 21:08:26 whopper kernel: [30571.657965]  ffff880219145d28 0000000000000082 ffff880219145cd8 ffffffffa012d069
Nov  1 21:08:26 whopper kernel: [30571.657978]  ffff880219145fd8 ffff880219145fd8 ffff880219145fd8 00000000000137c0
Nov  1 21:08:26 whopper kernel: [30571.657990]  ffff8804155e9700 ffff880151510000 ffff880219145d38 ffff8804150c8500
Nov  1 21:08:26 whopper kernel: [30571.658003] Call Trace:
Nov  1 21:08:26 whopper kernel: [30571.658044]  [<ffffffffa012d069>] ? dbuf_rele_and_unlock+0x159/0x200 [zfs]
Nov  1 21:08:26 whopper kernel: [30571.658058]  [<ffffffff81658f2f>] schedule+0x3f/0x60
Nov  1 21:08:26 whopper kernel: [30571.658075]  [<ffffffffa006bead>] __taskq_wait_id+0x7d/0x150 [spl]
Nov  1 21:08:26 whopper kernel: [30571.658083]  [<ffffffff811b76b4>] ? fsnotify_clear_marks_by_inode+0x94/0xf0
Nov  1 21:08:26 whopper kernel: [30571.658091]  [<ffffffff8108ab80>] ? add_wait_queue+0x60/0x60
Nov  1 21:08:26 whopper kernel: [30571.658102]  [<ffffffffa006cff3>] __taskq_wait+0x53/0xf0 [spl]
Nov  1 21:08:26 whopper kernel: [30571.658137]  [<ffffffffa0166679>] ? rrw_enter+0x69/0x190 [zfs]
Nov  1 21:08:26 whopper kernel: [30571.658174]  [<ffffffffa01abbfc>] zfs_sb_teardown+0x5c/0x3a0 [zfs]
Nov  1 21:08:26 whopper kernel: [30571.658181]  [<ffffffff81193033>] ? dispose_list+0x23/0x50
Nov  1 21:08:26 whopper kernel: [30571.658214]  [<ffffffffa01abfb0>] zfs_umount+0x30/0xc0 [zfs]
Nov  1 21:08:26 whopper kernel: [30571.658247]  [<ffffffffa01c710e>] zpl_put_super+0xe/0x10 [zfs]
Nov  1 21:08:26 whopper kernel: [30571.658254]  [<ffffffff8117a0e2>] generic_shutdown_super+0x62/0xe0
Nov  1 21:08:26 whopper kernel: [30571.658261]  [<ffffffff8117a1f6>] kill_anon_super+0x16/0x30
Nov  1 21:08:26 whopper kernel: [30571.658292]  [<ffffffffa01c6f0e>] zpl_kill_sb+0x1e/0x30 [zfs]
Nov  1 21:08:26 whopper kernel: [30571.658299]  [<ffffffff8117a83c>] deactivate_locked_super+0x3c/0xa0
Nov  1 21:08:26 whopper kernel: [30571.658305]  [<ffffffff8117b0be>] deactivate_super+0x4e/0x70
Nov  1 21:08:26 whopper kernel: [30571.658312]  [<ffffffff8119773d>] mntput_no_expire+0x9d/0xf0
Nov  1 21:08:26 whopper kernel: [30571.658319]  [<ffffffff81198a6b>] sys_umount+0x5b/0xd0
Nov  1 21:08:26 whopper kernel: [30571.658325]  [<ffffffff81663442>] system_call_fastpath+0x16/0x1b
Nov  1 21:10:26 whopper kernel: [30691.539760] INFO: task umount:21237 blocked for more than 120 seconds.
Nov  1 21:10:26 whopper kernel: [30691.551243] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  1 21:10:26 whopper kernel: [30691.573174] umount          D ffffffff81806200     0 21237  21236 0x00000000
Nov  1 21:10:26 whopper kernel: [30691.573180]  ffff880219145d28 0000000000000082 ffff880219145cd8 ffffffffa012d069
Nov  1 21:10:26 whopper kernel: [30691.573186]  ffff880219145fd8 ffff880219145fd8 ffff880219145fd8 00000000000137c0
Nov  1 21:10:26 whopper kernel: [30691.573190]  ffff8804155e9700 ffff880151510000 ffff880219145d38 ffff8804150c8500
Nov  1 21:10:26 whopper kernel: [30691.573195] Call Trace:
Nov  1 21:10:26 whopper kernel: [30691.573228]  [<ffffffffa012d069>] ? dbuf_rele_and_unlock+0x159/0x200 [zfs]
Nov  1 21:10:26 whopper kernel: [30691.573239]  [<ffffffff81658f2f>] schedule+0x3f/0x60
Nov  1 21:10:26 whopper kernel: [30691.573253]  [<ffffffffa006bead>] __taskq_wait_id+0x7d/0x150 [spl]
Nov  1 21:10:26 whopper kernel: [30691.573260]  [<ffffffff811b76b4>] ? fsnotify_clear_marks_by_inode+0x94/0xf0
Nov  1 21:10:26 whopper kernel: [30691.573269]  [<ffffffff8108ab80>] ? add_wait_queue+0x60/0x60
Nov  1 21:10:26 whopper kernel: [30691.573280]  [<ffffffffa006cff3>] __taskq_wait+0x53/0xf0 [spl]
Nov  1 21:10:26 whopper kernel: [30691.573314]  [<ffffffffa0166679>] ? rrw_enter+0x69/0x190 [zfs]
Nov  1 21:10:26 whopper kernel: [30691.573350]  [<ffffffffa01abbfc>] zfs_sb_teardown+0x5c/0x3a0 [zfs]
Nov  1 21:10:26 whopper kernel: [30691.573357]  [<ffffffff81193033>] ? dispose_list+0x23/0x50
Nov  1 21:10:26 whopper kernel: [30691.573390]  [<ffffffffa01abfb0>] zfs_umount+0x30/0xc0 [zfs]
Nov  1 21:10:26 whopper kernel: [30691.573424]  [<ffffffffa01c710e>] zpl_put_super+0xe/0x10 [zfs]
Nov  1 21:10:26 whopper kernel: [30691.573431]  [<ffffffff8117a0e2>] generic_shutdown_super+0x62/0xe0
Nov  1 21:10:26 whopper kernel: [30691.573438]  [<ffffffff8117a1f6>] kill_anon_super+0x16/0x30
Nov  1 21:10:26 whopper kernel: [30691.573469]  [<ffffffffa01c6f0e>] zpl_kill_sb+0x1e/0x30 [zfs]
Nov  1 21:10:26 whopper kernel: [30691.573476]  [<ffffffff8117a83c>] deactivate_locked_super+0x3c/0xa0
Nov  1 21:10:26 whopper kernel: [30691.573482]  [<ffffffff8117b0be>] deactivate_super+0x4e/0x70
Nov  1 21:10:26 whopper kernel: [30691.573489]  [<ffffffff8119773d>] mntput_no_expire+0x9d/0xf0
Nov  1 21:10:26 whopper kernel: [30691.573496]  [<ffffffff81198a6b>] sys_umount+0x5b/0xd0
Nov  1 21:10:26 whopper kernel: [30691.573503]  [<ffffffff81663442>] system_call_fastpath+0x16/0x1b

behlendorf · 2012-11-28T19:38:51Z

It looks like there are a couple unrelated issues mentioned here.

The original problem of zio_wait() being blocked for 120 seconds has been addressed in rc12. That message just indicates a particularly slow IO to the devices, not that anything has done wrong. As such it's been converted in to a zfs event and the console message suppressed to avoid confusion.

The latter issues look slightly different but there's not a ton of data to go on. So I'm going to close this issue and we can open a new one if these issues persist with -rc12 and latter.

…penzfs#796) = Description This commit allows us to add an object store bucket as a vdev in an existing pooll and it is the first part of the DOSE Migration project. = Note: Forcing Addition From `zpool add` Attempting to add an object-store vdev without `-f` yields the following error message: ``` $ sudo zpool add -o object-endpoint=etc.. testpool s3 cloudburst-data-2 invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool uses disk and new vdev is objectstore ``` This is done on purpose for now. Adding an objects-store vdev to a pool is an irreversible operation and should be handled with caution. = Note: Syncing Labels & The Uberblock When starting from a block-based pool and we add an object-store vdev there is a point where we have an object-store vdev in our config but that vdev is not accepting allocations and therefore we can sync the config to it. That point is the exact TXG where the vdev is added and we need to sync its config changes to the labels & uberblock of our block-based vdevs. For this reason, I adjusted all the codepaths under `vdev_config_sync()` to be able to handle the update of the labels and uberblock of the local devices even when there is an object-store vdev. This way, if the next TXG fails we have the new vdev somewhere on our config. For all TXGs from that point on, we always sync the object store's config first. This is also the config that we always look at first when opening the pool. With the above changes in `vdev_config_sync()` changes the behavior of existing pure (e.g. non-hybrid) object-store pools to occasionally update the labels of their slog devices (e.g. every time we dirty the pool's config). This should not really have any negative effect in existing pure object-store pools. On the contrary it should keep their labels up to date and potentially fix any extreme corner cases in pool import. = Note: ZIL allocations When the pool is in a hybrid state (e.g. backed by both an object store and block devices) with no slog devices, we could make zil allocations fall back to the embedded slog or normal class. I left that functionality as future work. This is not a prerequisite for DOSE migration as customers are expected to add zettacache(+ slog) devices as the first part of migration and therefore their VMs will always have at least one slog device when the object store vdev is added. = Note: Storage Pool Checkpoint Pure object-based pools (e.g. not hybrid ones) do the checkpoint rewinding process in the object agent. This is a different mechanism from the storage pool checkpoint in block-based pools. Until we have the need to make those two mechanism work well with each other we avoid any migrations to the object store while a zpool checkpoint is in effect. See `spa_ld_checkpoint_rewind()` usage in `spa_load_impl()` for more info. = Note: Ordering of import paths To import a hybrid pool we need to specify two import paths: (1) the path of the local block devices (e.g. `/dev/..etc`) and (2) the name of the bucket in the object store. Unfortunately given how `zpool_find_import_agent()` is implemented, importing hybrid pools only works if we specify (2) first and then (1) but not the opposite. Doing the opposite results in the zpool command hanging (again this is because of the current XXX in the aforementioned function). = Note: Testing Lossing Power Mid-Addition of the Object Store vdev I tested that by manually introducing panics like so: ``` diff --git a/module/zfs/spa.c b/module/zfs/spa.c index 5b55bb275..a82fab841 100644 --- a/module/zfs/spa.c +++ b/module/zfs/spa.c @@ -6969,7 +6969,9 @@ spa_vdev_add(spa_t *spa, nvlist_t *nvroot) * if we lose power at any point in this sequence, ... * steps will be completed the next time we load ... */ + ASSERT(B_FALSE); // <--- panic before config sync (void) spa_vdev_exit(spa, vd, txg, 0); + ASSERT(B_FALSE); // <--- panic before config sync mutex_enter(&spa_namespace_lock); spa_config_update(spa, SPA_CONFIG_UPDATE_POOL); ``` Importing the pool after the first panic we come back without the object store vdev as expected. Importing the pool after the second panic, we come back with the object store but don't allocate from it nor change the spa_pool_type until we create its metaslabs on disk. = Next Steps I'm planning to implement the augmented device removal logic which evacuates data from the block devices to the object-store. Once that is done, I'm planning to work on avoiding the insertion of all the evacuated data in the zettacache. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

behlendorf closed this as completed Nov 28, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long hung tasks #796

long hung tasks #796

mooinglemur commented Jun 23, 2012

behlendorf commented Jun 29, 2012

ryanolf commented Jul 10, 2012

ryao commented Oct 11, 2012

dswartz commented Oct 18, 2012

elan commented Nov 3, 2012

behlendorf commented Nov 28, 2012

long hung tasks #796

long hung tasks #796

Comments

mooinglemur commented Jun 23, 2012

behlendorf commented Jun 29, 2012

ryanolf commented Jul 10, 2012

ryao commented Oct 11, 2012

dswartz commented Oct 18, 2012

elan commented Nov 3, 2012

behlendorf commented Nov 28, 2012