Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on scrub #8664

Closed
marker5a opened this issue Apr 24, 2019 · 29 comments · Fixed by #9321
Closed

Crash on scrub #8664

marker5a opened this issue Apr 24, 2019 · 29 comments · Fixed by #9321
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@marker5a
Copy link

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version -
Linux Kernel 5.0.9-1-ck
Architecture x86_64
ZFS Version 0.8.0-rc4
SPL Version 0.8.0-rc4

Describe the problem you're observing

I have a ZFS pool running as my root filesystem, and upon starting a scrub on the pool, the system gets a kernel panic and the tty freezes up.

Describe how to reproduce the problem

Execute zfs scrub on the pool

Include any warning/errors/backtraces from the system logs

pool: zroot
state: ONLINE
scan: scrub paused since Tue Apr 23 20:09:21 2019
scrub started on Wed Feb 20 22:25:20 2019
0B scanned, 0B issued, 1.80T total
0B repaired, 0.00% done
config:

NAME                              STATE     READ WRITE CKSUM
zroot                             ONLINE       0     0     0
  raidz2-0                        ONLINE       0     0     0
    wwn-0x5000c500a5537122-part2  ONLINE       0     0     0
    wwn-0x5000c5006615a81c-part1  ONLINE       0     0     0
    wwn-0x5000c5007452f435        ONLINE       0     0     0
    wwn-0x50014ee25f72f76b        ONLINE       0     0     0

I'm trying to get a proper log, but wanted to at least get a picture that I was able to take... the system is not currently setup to save kernel panic info to file, so once I can get more info on that, I will post.

IMG_20190423_183323

@behlendorf
Copy link
Contributor

I can't be entirely certain based on the screenshot. But I believe this is due to a change in the latest 5.0 kernel related to cgroups. I'd suggest rolling back your kernel to 4.19 while we sort this out.

@behlendorf behlendorf added the Type: Building Indicates an issue related to building binaries label Apr 24, 2019
@behlendorf behlendorf added this to the 0.8.0 milestone Apr 24, 2019
@marker5a
Copy link
Author

Still seeing similar behavior on 4.18 (tried 4.19 as well), the only difference now is that dmesg isn't spitting anything out.

I'll go back to 5.0 kernel and try and get better logs, but is there anything in particular of good use to look at? My knowledge on these subjects is not even a fraction of you guys/gals so please bear with me!

@marker5a
Copy link
Author

I should also point out that booting into another to system and accessing this pool as a non-root partition works fine. I need to double check which kernel was running in that test... Will try and get that asap. Also, scrubbing a different pool from the same 5.0 kernel works fine too. It's weird that it seems to just be this combo

@behlendorf
Copy link
Contributor

Then are you saying everything appears to work fine using the 5.0.9-1-ck kernel on Arch Linux as long as you're not using it as a root pool? That would be strange.

is there anything in particular of good use to look at

If possible it would be great to get the first stack dump from the console logs.

@marker5a
Copy link
Author

marker5a commented May 1, 2019

Hey behlendorf

That seems to be the case that I have... I can scrub it fine with whatever kernel I throw at it, unless it's the root. I've been tempted to create a new pool from this dataset and see if the problem moves with the pool, but for the time being, this seems like a good test case for a potential issue... not sure though?

I was finally able to get more data from my setup using a serial cable to dump things out... I hope this helps further. I should've mentioned earlier... thanks for you (and all the developers) for all you do... I love the project! Thanks for the time at looking at this!


[ 489.388218]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.388221]  ? preempt_count_add+0x79/0xb0
[  489.388223]  schedule+0x32/0xb0
[  489.388228]  cv_wait_common+0xf9/0x120 [spl]
[  489.388231]  ? wait_woken+0x80/0x80
[  489.388319]  spa_config_enter+0xef/0x100 [zfs]
[  489.388370]  zio_vdev_io_start+0x2de/0x2f0 [zfs]
[  489.388376]  ? tsd_get_by_thread+0x2a/0x40 [spl]
[  489.388425]  zio_nowait+0xc3/0x180 [zfs]
[  489.388457]  arc_read+0x8d6/0xfe0 [zfs]
[  489.388490]  ? dbuf_rele_and_unlock+0x620/0x620 [zfs]
[  489.388523]  dbuf_read+0x265/0xb40 [zfs]
[  489.388563]  dmu_tx_check_ioerr+0x61/0xb0 [zfs]
[  489.388602]  dmu_tx_count_write+0xe2/0x1a0 [zfs]
[  489.388641]  dmu_tx_hold_write+0x3c/0x50 [zfs]
[  489.388691]  zfs_putpage+0x1af/0x520 [zfs]
[  489.388694]  ? rmap_walk_file+0x123/0x260
[  489.388698]  ? percpu_counter_add_batch+0x84/0xb0
[  489.388746]  zpl_putpage+0x31/0x40 [zfs]
[  489.388750]  write_cache_pages+0x1c2/0x430
[  489.388798]  ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[  489.388800]  ? native_usergs_sysret64+0xf/0x10
[  489.388849]  zpl_writepages+0x8a/0x150 [zfs]
[  489.388851]  do_writepages+0x41/0xd0
[  489.388854]  ? lock_timer_base+0x67/0x80
[  489.388856]  ? try_to_del_timer_sync+0x4d/0x80
[  489.388858]  __writeback_single_inode+0x3d/0x3c0
[  489.388860]  ? del_timer_sync+0x35/0x40
[  489.388862]  writeback_sb_inodes+0x1e8/0x4c0
[  489.388865]  __writeback_inodes_wb+0x5d/0xb0
[  489.388867]  wb_writeback+0x28f/0x340
[  489.388870]  ? cpumask_next+0x16/0x20
[  489.388872]  wb_workfn+0x35d/0x430
[  489.388876]  process_one_work+0x1eb/0x410
[  489.388878]  worker_thread+0x2d/0x3d0
[  489.388881]  ? process_one_work+0x410/0x410
[  489.388883]  kthread+0x112/0x130
[  489.388885]  ? kthread_park+0x80/0x80
[  489.388887]  ret_from_fork+0x35/0x40
[  489.388908] INFO: task txg_sync:606 blocked for more than 120 seconds.
[  489.396078]       Tainted: P           OE     5.0.9-1-ck #1
[  489.402978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.412320] txg_sync        D    0   606      2 0x80000000
[  489.412322] Call Trace:
[  489.412325]  ? __schedule+0x6d9/0xe60
[  489.412328]  ? preempt_count_add+0x79/0xb0
[  489.412330]  ? _raw_spin_lock+0x13/0x30
[  489.412331]  ? preempt_count_add+0x79/0xb0
[  489.412333]  schedule+0x32/0xb0
[  489.412338]  cv_wait_common+0xf9/0x120 [spl]
[  489.412340]  ? wait_woken+0x80/0x80
[  489.412391]  spa_config_enter+0xef/0x100 [zfs]
[  489.412440]  spa_sync+0x197/0xfb0 [zfs]
[  489.412443]  ? _raw_spin_lock+0x13/0x30
[  489.412445]  ? _raw_spin_unlock+0x16/0x30
[  489.412447]  ? preempt_count_add+0x79/0xb0
[  489.412448]  ? _raw_spin_lock+0x13/0x30
[  489.412449]  ? _raw_spin_unlock+0x16/0x30
[  489.412500]  ? spa_txg_history_set.part.2+0xbd/0xd0 [zfs]
[  489.412551]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  489.412601]  txg_sync_thread+0x28f/0x460 [zfs]
[  489.412653]  ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[  489.412659]  ? __thread_exit+0x20/0x20 [spl]
[  489.412665]  thread_generic_wrapper+0x6f/0x80 [spl]
[  489.412668]  kthread+0x112/0x130
[  489.412669]  ? kthread_park+0x80/0x80
[  489.412671]  ret_from_fork+0x35/0x40
[  489.412675] INFO: task journal-offline:14234 blocked for more than 120 seconds.
[  489.421012]       Tainted: P           OE     5.0.9-1-ck #1
[  489.421015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.437193] journal-offline D    0 14234      1 0x00000180
[  489.437197] Call Trace:
[  489.437206]  ? __schedule+0x6d9/0xe60
[  489.437281]  ? dmu_zfetch+0x530/0x540 [zfs]
[  489.437285]  ? preempt_count_add+0x79/0xb0
[  489.437287]  ? _raw_spin_lock+0x13/0x30
[  489.437289]  ? preempt_count_add+0x79/0xb0
[  489.437291]  schedule+0x32/0xb0
[  489.437300]  cv_wait_common+0xf9/0x120 [spl]
[  489.437303]  ? wait_woken+0x80/0x80
[  489.437336]  dbuf_read+0x3f5/0xb40 [zfs]
[  489.437370]  ? dbuf_hold_impl+0xa3/0xc0 [zfs]
[  489.437409]  dmu_tx_check_ioerr+0x61/0xb0 [zfs]
[  489.437448]  dmu_tx_count_write+0xe2/0x1a0 [zfs]
[  489.437487]  dmu_tx_hold_write+0x3c/0x50 [zfs]
[  489.437536]  zfs_putpage+0x1af/0x520 [zfs]
[  489.437540]  ? rmap_walk_file+0x123/0x260
[  489.437545]  ? percpu_counter_add_batch+0x84/0xb0
[  489.437593]  zpl_putpage+0x31/0x40 [zfs]
[  489.437596]  write_cache_pages+0x1c2/0x430
[  489.437644]  ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[  489.437693]  zpl_writepages+0x8a/0x150 [zfs]
[  489.437696]  do_writepages+0x41/0xd0
[  489.437699]  __filemap_fdatawrite_range+0xbe/0xf0
[  489.437702]  filemap_write_and_wait_range+0x30/0x80
[  489.437750]  zpl_fsync+0x37/0x90 [zfs]
[  489.437753]  ? __fget+0x6e/0xa0
[  489.437756]  do_fsync+0x38/0x70
[  489.437758]  __x64_sys_fsync+0x10/0x20
[  489.437761]  do_syscall_64+0x5b/0x170
[  489.437763]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  489.437766] RIP: 0033:0x7f7227fa7807
[  489.437771] Code: Bad RIP value.
[  489.437773] RSP: 002b:00007f7225101db0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[  489.437775] RAX: ffffffffffffffda RBX: 0000000000000016 RCX: 00007f7227fa7807
[  489.437776] RDX: 0000000000000000 RSI: 00007f7227cce8c6 RDI: 0000000000000016
[  489.437777] RBP: 00007f7227cd0120 R08: 00007f7225102700 R09: 00007f7225102700
[  489.437778] R10: 00000000000008c6 R11: 0000000000000293 R12: 0000000000000002
[  489.437779] R13: 00007ffdc2834bcf R14: 00007f7225102700 R15: 0000000000000015
[  489.437785] INFO: task sd-resolve:4715 blocked for more than 120 seconds.
[  489.445325]       Tainted: P           OE     5.0.9-1-ck #1
[  489.452233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.461538] sd-resolve      D    0  4715      1 0x00000180
[  489.461540] Call Trace:
[  489.461543]  ? __schedule+0x6d9/0xe60
[  489.461549]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.461554]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.461557]  ? preempt_count_add+0x79/0xb0
[  489.461559]  schedule+0x32/0xb0
[  489.461563]  cv_wait_common+0xf9/0x120 [spl]
[  489.461565]  ? wait_woken+0x80/0x80
[  489.461616]  spa_config_enter+0xef/0x100 [zfs]
[  489.461666]  zio_vdev_io_start+0x2de/0x2f0 [zfs]
[  489.461672]  ? tsd_get_by_thread+0x2a/0x40 [spl]
[  489.461721]  zio_nowait+0xc3/0x180 [zfs]
[  489.461753]  arc_read+0x8d6/0xfe0 [zfs]
[  489.461786]  ? dbuf_rele_and_unlock+0x620/0x620 [zfs]
[  489.461819]  dbuf_read+0x265/0xb40 [zfs]
[  489.461855]  dmu_buf_hold_array_by_dnode+0x104/0x470 [zfs]
[  489.461892]  dmu_read_uio_dnode+0x47/0xf0 [zfs]
[  489.461895]  ? _raw_spin_unlock+0x16/0x30
[  489.461945]  ? rangelock_enter+0x14d/0x510 [zfs]
[  489.461981]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[  489.462030]  zfs_read+0x127/0x480 [zfs]
[  489.462079]  zpl_read_common_iovec+0x87/0xd0 [zfs]
[  489.462128]  zpl_iter_read+0xd4/0x120 [zfs]
[  489.462131]  __vfs_read+0x14a/0x1a0
[  489.462134]  vfs_read+0x8a/0x130
[  489.462136]  ksys_read+0x4f/0xb0
[  489.462138]  do_syscall_64+0x5b/0x170
[  489.462140]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  489.462141] RIP: 0033:0x7fb4e6032098
[  489.462144] Code: Bad RIP value.
[  489.462145] RSP: 002b:00007fb4e466e248 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  489.462147] RAX: ffffffffffffffda RBX: 00007fb4dc000b20 RCX: 00007fb4e6032098
[  489.462148] RDX: 0000000000000200 RSI: 00007fb4dc000df0 RDI: 000000000000000f
[  489.462149] RBP: 0000000000000d68 R08: 0000000000000c00 R09: 00007fb4e46746e4
[  489.462150] R10: 00007fb4e466e7a0 R11: 0000000000000246 R12: 00007fb4e60fc5c0
[  489.462152] R13: 00007fb4e60fd1c0 R14: 0000000000000000 R15: 00007fb4e466e790
[  489.462157] INFO: task openvpn:5550 blocked for more than 120 seconds.
[  489.469400]       Tainted: P           OE     5.0.9-1-ck #1
[  489.476325] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.485628] openvpn         D    0  5550      1 0x00000080
[  489.485630] Call Trace:
[  489.485633]  ? __schedule+0x6d9/0xe60
[  489.485638]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.485644]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.485646]  ? preempt_count_add+0x79/0xb0
[  489.485647]  schedule+0x32/0xb0
[  489.485652]  cv_wait_common+0xf9/0x120 [spl]
[  489.485654]  ? wait_woken+0x80/0x80
[  489.485705]  spa_config_enter+0xef/0x100 [zfs]
[  489.485754]  zio_vdev_io_start+0x2de/0x2f0 [zfs]
[  489.485760]  ? tsd_get_by_thread+0x2a/0x40 [spl]
[  489.485809]  zio_nowait+0xc3/0x180 [zfs]
[  489.485840]  arc_read+0x8d6/0xfe0 [zfs]
[  489.485873]  ? dbuf_rele_and_unlock+0x620/0x620 [zfs]
[  489.485922]  ? zio_null+0x2b/0x30 [zfs]
[  489.485955]  dbuf_read+0x265/0xb40 [zfs]
[  489.485995]  dmu_tx_check_ioerr+0x61/0xb0 [zfs]
[  489.486033]  dmu_tx_count_write+0x65/0x1a0 [zfs]
[  489.486072]  dmu_tx_hold_write+0x3c/0x50 [zfs]
[  489.486121]  zfs_write+0x3e3/0xd60 [zfs]
[  489.486126]  ? poll_select_copy_remaining+0x1b0/0x1b0
[  489.486129]  ? kmem_cache_free+0x1d3/0x200
[  489.486177]  zpl_write_common_iovec+0x94/0xf0 [zfs]
[  489.486225]  zpl_iter_write+0xf0/0x130 [zfs]
[  489.486228]  __vfs_write+0x14d/0x1b0
[  489.486230]  vfs_write+0xa9/0x1a0
[  489.486232]  ksys_write+0x4f/0xb0
[  489.486235]  do_syscall_64+0x5b/0x170
[  489.486236]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  489.486238] RIP: 0033:0x7f9a3b1c81c5
[  489.486240] Code: Bad RIP value.
[  489.486241] RSP: 002b:00007ffee2190ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  489.486243] RAX: ffffffffffffffda RBX: 000055792ba5df00 RCX: 00007f9a3b1c81c5
[  489.486244] RDX: 000000000000009e RSI: 00007ffee2190f00 RDI: 0000000000000004
[  489.486245] RBP: 000000000000009e R08: 0000000000000164 R09: 0000000000000000
[  489.486247] R10: 00007ffee2191102 R11: 0000000000000246 R12: 00007ffee2190f00
[  489.486248] R13: 0000000000000004 R14: 000055792ba55500 R15: 0000000000000005
[  489.486252] INFO: task openvpn:5553 blocked for more than 120 seconds.
[  489.493492]       Tainted: P           OE     5.0.9-1-ck #1
[  489.500400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.509712] openvpn         D    0  5553      1 0x00000080
[  489.509713] Call Trace:
[  489.509716]  ? __schedule+0x6d9/0xe60
[  489.509722]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.509724]  ? preempt_count_add+0x79/0xb0
[  489.509726]  schedule+0x32/0xb0
[  489.509730]  cv_wait_common+0xf9/0x120 [spl]
[  489.509732]  ? wait_woken+0x80/0x80
[  489.509783]  spa_config_enter+0xef/0x100 [zfs]
[  489.509832]  zio_vdev_io_start+0x2de/0x2f0 [zfs]
[  489.509839]  ? tsd_get_by_thread+0x2a/0x40 [spl]
[  489.509887]  zio_nowait+0xc3/0x180 [zfs]
[  489.509919]  arc_read+0x8d6/0xfe0 [zfs]
[  489.509951]  ? dbuf_rele_and_unlock+0x620/0x620 [zfs]
[  489.510000]  ? zio_null+0x2b/0x30 [zfs]
[  489.510033]  dbuf_read+0x265/0xb40 [zfs]
[  489.510073]  dmu_tx_check_ioerr+0x61/0xb0 [zfs]
[  489.510111]  dmu_tx_count_write+0x65/0x1a0 [zfs]
[  489.510149]  dmu_tx_hold_write+0x3c/0x50 [zfs]
[  489.510199]  zfs_write+0x3e3/0xd60 [zfs]
[  489.510203]  ? __switch_to_asm+0x34/0x70
[  489.510204]  ? __switch_to_asm+0x40/0x70
[  489.510206]  ? __switch_to_asm+0x34/0x70
[  489.510207]  ? __switch_to_asm+0x40/0x70
[  489.510209]  ? __switch_to_asm+0x34/0x70
[  489.510210]  ? __switch_to_asm+0x40/0x70
[  489.510211]  ? __switch_to_asm+0x34/0x70
[  489.510213]  ? __switch_to_asm+0x40/0x70
[  489.510214]  ? __switch_to_asm+0x34/0x70
[  489.510216]  ? __switch_to_asm+0x40/0x70
[  489.510217]  ? _raw_spin_unlock+0x16/0x30
[  489.510266]  zpl_write_common_iovec+0x94/0xf0 [zfs]
[  489.510314]  zpl_iter_write+0xf0/0x130 [zfs]
[  489.510317]  __vfs_write+0x14d/0x1b0
[  489.510319]  vfs_write+0xa9/0x1a0
[  489.510321]  ksys_write+0x4f/0xb0
[  489.510323]  do_syscall_64+0x5b/0x170
[  489.510325]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  489.510327] RIP: 0033:0x7fa4dfdd11c5
[  489.510329] Code: Bad RIP value.
[  489.510330] RSP: 002b:00007ffc38482808 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  489.510332] RAX: ffffffffffffffda RBX: 000055db0fd0cc40 RCX: 00007fa4dfdd11c5
[  489.510333] RDX: 000000000000009e RSI: 00007ffc38482830 RDI: 0000000000000003
[  489.510334] RBP: 000000000000009e R08: 0000000000000164 R09: 0000000000000000
[  489.510335] R10: 00007ffc38482a32 R11: 0000000000000246 R12: 00007ffc38482830
[  489.510337] R13: 0000000000000003 R14: 000055db0fd04420 R15: 0000000000000000
[  489.510342] INFO: task ntpd:6079 blocked for more than 120 seconds.
[  489.517324]       Tainted: P           OE     5.0.9-1-ck #1
[  489.524237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.533556] ntpd            D    0  6079      1 0x00000080
[  489.533558] Call Trace:
[  489.533561]  ? __schedule+0x6d9/0xe60
[  489.533567]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.533572]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.533574]  ? preempt_count_add+0x79/0xb0
[  489.533576]  schedule+0x32/0xb0
[  489.533580]  cv_wait_common+0xf9/0x120 [spl]
[  489.533582]  ? wait_woken+0x80/0x80
[  489.533633]  spa_config_enter+0xef/0x100 [zfs]
[  489.533682]  zio_vdev_io_start+0x2de/0x2f0 [zfs]
[  489.533688]  ? tsd_get_by_thread+0x2a/0x40 [spl]
[  489.533737]  zio_nowait+0xc3/0x180 [zfs]
[  489.533768]  arc_read+0x8d6/0xfe0 [zfs]
[  489.533801]  ? dbuf_rele_and_unlock+0x620/0x620 [zfs]
[  489.533835]  dbuf_read+0x265/0xb40 [zfs]
[  489.533871]  dmu_buf_hold_array_by_dnode+0x104/0x470 [zfs]
[  489.533907]  dmu_read_uio_dnode+0x47/0xf0 [zfs]
[  489.533909]  ? _raw_spin_unlock+0x16/0x30
[  489.533959]  ? rangelock_enter+0x14d/0x510 [zfs]
[  489.533995]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[  489.534045]  zfs_read+0x127/0x480 [zfs]
[  489.534094]  zpl_read_common_iovec+0x87/0xd0 [zfs]
[  489.534142]  zpl_iter_read+0xd4/0x120 [zfs]
[  489.534145]  __vfs_read+0x14a/0x1a0
[  489.534147]  vfs_read+0x8a/0x130
[  489.534149]  ksys_read+0x4f/0xb0
[  489.534151]  do_syscall_64+0x5b/0x170
[  489.534153]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  489.534154] RIP: 0033:0x7fdacecc2098
[  489.534157] Code: Bad RIP value.
[  489.534158] RSP: 002b:00007fdacebc5478 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  489.534160] RAX: ffffffffffffffda RBX: 00007fdac8001180 RCX: 00007fdacecc2098
[  489.534161] RDX: 0000000000002000 RSI: 00007fdac8004430 RDI: 0000000000000004
[  489.534162] RBP: 0000000000000d68 R08: 0000000000000000 R09: 0000000000000003
[  489.534163] R10: 00007fdac80008d0 R11: 0000000000000246 R12: 00007fdaced8c5c0
[  489.534164] R13: 00007fdaced8d1c0 R14: 0000000000000000 R15: 0000000000000400
[  489.534171] INFO: task systemd:13225 blocked for more than 120 seconds.
[  489.541509]       Tainted: P           OE     5.0.9-1-ck #1
[  489.548420] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.557743] systemd         D    0 13225      1 0x00000080
[  489.557745] Call Trace:
[  489.557747]  ? __schedule+0x6d9/0xe60
[  489.557753]  ? spl_kmem_cache_alloc+0x112/0x760 [spl]
[  489.557756]  ? preempt_count_add+0x79/0xb0
[  489.557757]  schedule+0x32/0xb0
[  489.557761]  cv_wait_common+0xf9/0x120 [spl]
[  ûÿ489.557764]  ? wait_woken+0x80/0x80
[  489.557814]  spa_config_enter+0xef/0x100 [zfs]
[  489.557866]  zio_vdev_io_start+0x2de/0x2f0 [zfs]
[  489.557872]  ? tsd_get_by_thread+0x2a/0x40 [spl]
[  489.557921]  zio_nowait+0xc3/0x180 [zfs]
[  489.557952]  arc_read+0x8d6/0xfe0 [zfs]
[  489.557984]  ? dbuf_rele_and_unlock+0x620/0x620 [zfs]
[  489.558018]  dbuf_read+0x265/0xb40 [zfs]
[  489.558054]  dmu_buf_hold_array_by_dnode+0x104/0x470 [zfs]
[  489.558090]  dmu_read_impl+0xa3/0x170 [zfs]
[  489.558126]  dmu_read+0x56/0x90 [zfs]
[  489.558128]  ? _raw_spin_lock+0x13/0x30
[  489.558177]  zfs_getpage+0xe4/0x170 [zfs]
[  489.558226]  zpl_readpage+0x53/0xb0 [zfs]
[  489.558228]  filemap_fault+0x3d0/0x630
[  489.558231]  __do_fault+0x38/0x120
[  489.558233]  __handle_mm_fault+0xb37/0x12a0
[  489.558236]  handle_mm_fault+0x10a/0x250
[  489.558239]  __do_page_fault+0x264/0x4d0
[  489.558241]  do_page_fault+0x32/0x130
[  489.558244]  ? page_fault+0x8/0x30
[  489.558245]  page_fault+0x1e/0x30
[  489.558247] RIP: 0033:0x7f41663d4dd0
[  489.558249] Code: Bad RIP value.
[  489.558250] RSP: 002b:00007fffdb4f7ed8 EFLAGS: 00010202
[  489.558252] RAX: 0000556f1ebbb240 RBX: 0000556f1ebbb970 RCX: 0000556f1ebbb240
[  489.558253] RDX: 0000556f1ebbb800 RSI: 0000556f1ebbb1e0 RDI: 0000556f1ebaae18
[  489.558254] RBP: 0000556f1ebaae18 R08: 0000556f1ebbb229 R09: 0000000000000020
[  489.558255] R10: 00000000000000c0 R11: 00007f41664dfc40 R12: 0000000000000002
[  489.558256] R13: 0000000000000000 R14: 00007fffdb4f7f80 R15: 0000556f1da94e98
[  489.558260] INFO: task zpool:14217 blocked for more than 120 seconds.
[  489.565509]       Tainted: P           OE     5.0.9-1-ck #1
[  489.572425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.581748] zpool           D    0 14217  13847 0x00000080
[  489.581750] Call Trace:
[  489.581752]  ? __schedule+0x6d9/0xe60
[  489.581755]  ? __wake_up_common+0x77/0x140
[  489.581757]  ? preempt_count_add+0x79/0xb0
[  489.581808]  ? vdev_open_children+0x160/0x160 [zfs]
[  489.581809]  schedule+0x32/0xb0
[  489.581815]  taskq_wait+0x72/0xc0 [spl]
[  489.581818]  ? wait_woken+0x80/0x80
[  489.581824]  taskq_destroy+0x40/0x160 [spl]
[  489.581874]  vdev_open_children+0x10a/0x160 [zfs]
[  489.581926]  vdev_root_open+0x37/0x120 [zfs]
[  489.581977]  vdev_open+0xa8/0x710 [zfs]
[  489.582028]  ? vdev_cache_purge+0x6e/0x80 [zfs]
[  489.582078]  vdev_reopen+0x2e/0xc0 [zfs]
[  489.582124]  dsl_scan+0x36/0x120 [zfs]
[  489.582175]  zfs_ioc_pool_scan+0x5a/0xd0 [zfs]
[  489.582226]  zfsdev_ioctl+0x6cf/0x8c0 [zfs]
[  489.582229]  do_vfs_ioctl+0xa4/0x630
[  489.582231]  ? handle_mm_fault+0x10a/0x250
[  489.582233]  ? syscall_trace_enter+0x1d3/0x2d0
[  489.582235]  ksys_ioctl+0x60/0x90
[  489.582237]  __x64_sys_ioctl+0x16/0x20
[  489.582238]  do_syscall_64+0x5b/0x170
[  489.582240]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  489.582242] RIP: 0033:0x7fa734019cbb
[  489.582244] Code: Bad RIP value.
[  489.582245] RSP: 002b:00007ffc61385548 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  489.582247] RAX: ffffffffffffffda RBX: 00007ffc61385580 RCX: 00007fa734019cbb
[  489.582248] RDX: 00007ffc61385580 RSI: 0000000000005a07 RDI: 0000000000000003
[  489.582249] RBP: 00007ffc61388f70 R08: 0000000000000008 R09: 00007fa734065800
[  489.582250] R10: 0000563d7ddd4010 R11: 0000000000000246 R12: 0000000000000001
[  489.582251] R13: 0000563d7ddd7570 R14: 0000000000000000 R15: 0000563d7ddd5430
[  489.582254] INFO: task vdev_open:14218 blocked for more than 120 seconds.
[  489.589876]       Tainted: P           OE     5.0.9-1-ck #1
[  489.596781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.606093] vdev_open       D    0 14218      2 0x80000080
[  489.606095] Call Trace:
[  489.606098]  ? __schedule+0x6d9/0xe60
[  489.606101]  ? preempt_count_add+0x79/0xb0
[  489.606151]  ? vdev_open_children+0x160/0x160 [zfs]
[  489.606153]  schedule+0x32/0xb0
[  489.606159]  taskq_wait+0x72/0xc0 [spl]
[  489.606161]  ? wait_woken+0x80/0x80
[  489.606167]  taskq_destroy+0x40/0x160 [spl]
[  489.606217]  vdev_open_children+0x10a/0x160 [zfs]
[  489.606268]  vdev_raidz_open+0x40/0x170 [zfs]
[  489.606319]  vdev_open+0xa8/0x710 [zfs]
[  489.606321]  ? __wake_up_common_lock+0x89/0xc0
[  489.606371]  vdev_open_child+0x1e/0x40 [zfs]
[  489.606378]  taskq_thread+0x2ca/0x490 [spl]
[  489.606380]  ? wake_up_q+0x70/0x70
[  489.606387]  ? task_done+0x90/0x90 [spl]
[  489.606390]  kthread+0x112/0x130
[  489.606392]  ? kthread_park+0x80/0x80
[  489.606394]  ret_from_fork+0x35/0x40

@qix67
Copy link

qix67 commented May 10, 2019

Hello,

I have the same issue but with different kernel and zfs versions. I run a gentoo.

Until 2 weeks ago, my kernel was 4.14.52 with ZoL 0.7.6. My pools were:

  • root:
    3 way-mirror
    sda, sdb, sdc
    3x240GB SSD

  • data:
    raid-z1
    sdd, sde, sdf
    3x2TB 5900rpm HD

2 weeks ago, I slightly upgraded my hardware and I updated kernel and ZoL. Now, kernel version is 4.19.27 with ZoL 0.7.13. My pools are:

  • boot:
    3 way-mirror
    sda1, sdb1, sdc1
    3x300MB SSD partition

  • root:
    2 way-mirror
    nvme0, nvme1
    2x480GB NVMe

  • data
    raid-z1
    sdd, sde, sdf
    3x2TB 5900rpm HD

    • ZIL mirror:
      sdb2, sdc2
      2x239GB SSD
    • cache:
      sda2
      239GB SSD

The previous setting worked flawlessly during 1 year nearly without reboot, scrubbing root pool once per week.

Since update, I have no problem to scrub boot and data pool but, most of the time, kernel hangs when I scrub root pool and always without any message, I still can switch between tty. Each time, I scrub boot pool and then root pool. After each crash, my mariadb database is corrupted beyond recovery and I had to rollback zfs snapshot to restart it.

Yesterday evening, after 1 week uptime, when scrubbing root pool starts, kernel hanged. I rebooted my computer, remove data pool ZIL and cache devices and I was able to scrub root pool without problem.

This morning, I re-add data pool ZIL and cache devices and still encounter no problem to scrub root pool.

Server activity is quite low. Hardware is Xeon L5640 (6c/12t) with 72GB ECC RAM. At the moment, more than 32GB RAM is free and CPU load average is rougly 0.5.

Pool capacities are:

  • boot 10%
  • root 37%
  • data 65%
    At most, data ZIL contained 1GB and cache was nearly filled (after 1 week uptime).

NVMe, SSD and HD temperatures are ok. NVMe are below 47°C during scrub, SSD are below 23°C and HD are below 30°C.

No hardware problem seems to exist.

Thanks for all your work.

@xyzpublic
Copy link

0.7.5 w/kernel 4.15 . while resuming from s3 sleep, kernel randomly freezes. Any connection here?

@marker5a
Copy link
Author

marker5a commented May 23, 2019 via email

@qix67
Copy link

qix67 commented May 23, 2019

I also think so as scrubbing root on a freshly booted computer works for me.

Regarding my previous post, I confirm the problem. I disable root scrubbing and since then, my server has not crashed.

I do not use suspend/hibernate on my server.

@behlendorf behlendorf removed this from the 0.8.0 milestone May 23, 2019
@behlendorf behlendorf added Type: Defect Incorrect behavior (e.g. crash, hang) and removed Type: Building Indicates an issue related to building binaries labels May 23, 2019
@vanes32
Copy link

vanes32 commented Aug 5, 2019

I have the same issue on my 2 servers. Both are using root pools. Pool is newly created. When i run scrub right after reboot, it runs fine. After some uptime issue happens when i run scrub. Here are logs (journalctl -f and dmesg -wT) during scrub issue happens. Hardware is 100% fine
journalctl -f.txt
dmesg -wT.txt

@marker5a marker5a closed this as completed Aug 8, 2019
@marker5a marker5a reopened this Aug 8, 2019
@marker5a
Copy link
Author

marker5a commented Aug 8, 2019

I have the same issue on my 2 servers. Both are using root pools. Pool is newly created. When i run scrub right after reboot, it runs fine. After some uptime issue happens when i run scrub. Here are logs (journalctl -f and dmesg -wT) during scrub issue happens. Hardware is 100% fine
journalctl -f.txt
dmesg -wT.txt

Does the same behavior occur if you scrub the pool as a non-root pool?... curious to see if this is tied to it being a root fs as well

@vanes32
Copy link

vanes32 commented Aug 8, 2019

I will try to create additional non-rool pool, do some tests and write my results here

@marker5a
Copy link
Author

marker5a commented Aug 8, 2019

@behlendorf FWIW, I do have an interesting data point for this... not sure if it's helpful. If I initiate the scrub in my initramfs prior to kernel execution, the scrub behaves normally and proceeds ok. From that point on, I can boot in to the kernel and the scrub continues without issue.

@vanes32
Copy link

vanes32 commented Aug 11, 2019

Today was auto-scrub day. One of my to servers hanged during it. This is last what i saw
IMG_20190811_093108

@marker5a
Copy link
Author

Today was auto-scrub day. One of my to servers hanged during it. This is last what i saw
IMG_20190811_093108

Can you provide specs for your server? CPU and RAM? I've only got 8GB on my machine and to the point that @behlendorf made about cgroups, it seems suspect that maybe I just dont have enough RAM to initiate the scrub in the root environment. I did try downgrading to older kernels without success, but still wondering if memory restrictions are at play here

@vanes32
Copy link

vanes32 commented Aug 12, 2019

This is my test server. Small j4205 with 8Gb of ram. Latest proxmox VE 6.0.5 ZFS 0.8.1 . zfs_arc_max set to 3Gb. Problem began with new kernel.

@pongraczi
Copy link

Hi, on Proxmox forum there is a long thread now, which is about this kind of issue:
https://forum.proxmox.com/threads/proxmox-v6-servers-freeze-zvol-blocked-for-more-than-120s.57765/

In short:
Kernel 5.x, zfs 0.8.x, systems hangs unpredictable, some guys are able to trigger this hang with starting scrub directly.
It seems, not related to memory, cpu, zpool layout, wide range of configs are affected.

@pongraczi
Copy link

On the forum wasteground posted a kernel log about the situation:
proxmox3.log

He wrote this:

Okay so I managed to move things around a little earlier than expected, and so the new information
from me is:

  • I stopped the only VM on this machine (my storage VM providing backup space for the proxmox cluster), so no workloads are running on the node
  • i did "echo 1 > /proc/sys/kernel/sysrq " and "echo t > /proc/sysrq-trigger" with remote syslog enabled

I ran one "echo t" initially, then started the zpool scrub, then managed to run "echo t" several more times before everything started falling apart on the box (and then I had to power cycle it, it returns to service just fine and shows nothing about the zpool scrub at all - so I guess the scrub actually never really starts). Trying to run any process on the box at this point that requires disk I/O just stalls and cannot be returned from. Anything in memory/already running that doesn't need disk I/O seems to keep running just fine.

The log is attached - I annotated each section with ">>>" to show where the sysrq-triggers were fired, so hopefully this provides some useful info.

The last "section" of the log shows processes getting stuck because of no I/O i think. It's almost exactly as @sandor mentioned, it's like the box lost all disks.

@behlendorf
Copy link
Contributor

@pongraczi thank you for posting the logs. I believe I see what's causing the reported hangs. As a workaround you can try setting zfs_vdev_scheduler=none. This should avoid the problem, it looks like 4.12 and newer kernels might encounter this.

@pongraczi
Copy link

@behlendorf thank you, I posted it on the proxmox forum, I guess some guys have such systems to test it soon.

@pongraczi
Copy link

Just for the record, there are scheduler check before and after on the Proxmox forum and it seems,

  • the /sys/block/sdX/queue/scheduler is [mq-deadline] none before and after, too, so no changes
  • the /sys/module/zfs/parameters/zfs_vdev_scheduler changed from noop to none

One of my system was not affected, even it is a remote backup server via internet, but really slow bandwith. That has only hard disks as vdevs, only the log/zil are on ssd.

Please check the Proxmox forum for details, as other users post their configs.
Just let me know, if you need to post them here, I will copy.

@pongraczi
Copy link

Example about an affected config:

root@lm1:~# zpool status
  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 01:08:43 with 0 errors on Sun Jul 14 01:32:44 2019
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        sde2                                         ONLINE       0     0     0
        sdf2                                         ONLINE       0     0     0
      mirror-1                                       ONLINE       0     0     0
        ata-Samsung_SSD_850_PRO_1TB_S252NXAG805442N  ONLINE       0     0     0
        ata-Samsung_SSD_850_PRO_1TB_S252NXAG805106B  ONLINE       0     0     0

errors: No known data errors

  pool: zbackup
state: ONLINE
  scan: scrub repaired 0B in 1 days 03:32:58 with 0 errors on Mon Jul 15 03:57:00 2019
config:

    NAME                                          STATE     READ WRITE CKSUM
    zbackup                                       ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3UZC1F1  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M7ZESTA2  ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M7ZEST6D  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6AVE83Y  ONLINE       0     0     0

errors: No known data errors

The following is the same in before/after changing zfs_vdev_scheduler

root@lm1:~# for i in a b c d e f g h; do cat /sys/block/sd$i/queue/scheduler; done
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none

Before:

root@lm1:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
noop

After

root@lm1:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
none

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 11, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this can result in a system hang if the
usermodehelper does not return.  This is because it may be called
with the spa config lock held as a writter.  For example via the
zfs_ioc_pool_scan() -> spa_scan() -> spa_scan() -> vdev_reopen()
callpath.

Additionally, there's a reasonable argument that setting the elevator
is the resonsiblity of whatever software is being used to configure
the system.  Therefore, it's proposed that the zfs_vdev_elevator
be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
@behlendorf
Copy link
Contributor

Thanks for verifying the proposed workaround. For those interested, this is caused by an interaction between the scrub, which attempted to reopen the devices, and the compatibility code required for the zfs.ko to set the block device's scheduler. Setting zfs_vdev_scheduler=none prevents the issue by instructing the zfs.ko to never modify the scheduler.

I've opened #9317 which proposes to remove the zfs_vdev_scheduler module option and delegate setting the scheduler as needed entirely to user space. Additional information is provided in the PR and discussion is welcome. This change would apply to the next major release, we should be able to do something less disruptive for the current release.

behlendorf pushed a commit that referenced this issue Sep 14, 2019
#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue #8664 
Closes #9321
@cyril23
Copy link

cyril23 commented Sep 14, 2019

@behlendorf Maybe you could have a look at my logs at https://forum.proxmox.com/threads/proxmox-v6-servers-freeze-zvol-blocked-for-more-than-120s.57765/post-267351 just to confirm your solution fits my case, too? That would make me feel a lot better! Thanks

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 16, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this can result in a system hang if the
usermodehelper does not return.  This is because it may be called
with the spa config lock held as a writter.  For example via the
zfs_ioc_pool_scan() -> spa_scan() -> spa_scan() -> vdev_reopen()
callpath.

Additionally, there's a reasonable argument that setting the elevator
is the resonsiblity of whatever software is being used to configure
the system.  Therefore, it's proposed that the zfs_vdev_elevator
be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Issue openzfs#9317
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 17, 2019
openzfs#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue openzfs#8664 
Closes openzfs#9321
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
openzfs#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue openzfs#8664 
Closes openzfs#9321
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
openzfs#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue openzfs#8664 
Closes openzfs#9321
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
openzfs#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue openzfs#8664 
Closes openzfs#9321
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 19, 2019
openzfs#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue openzfs#8664 
Closes openzfs#9321
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 23, 2019
openzfs#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue openzfs#8664 
Closes openzfs#9321
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 23, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d418.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Issue openzfs#9317
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 24, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d418.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Issue openzfs#9317
behlendorf added a commit that referenced this issue Sep 25, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d418.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9317
tonyhutter pushed a commit that referenced this issue Sep 26, 2019
#9321)

Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue #8664
Closes #9321
behlendorf added a commit that referenced this issue Nov 27, 2019
As described in commit f81d5ef the zfs_vdev_elevator module
option is being removed.  Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9417
Closes #9609
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 24, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d418.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Closes openzfs#9317
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 26, 2019
As described in commit f81d5ef the zfs_vdev_elevator module
option is being removed.  Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Closes openzfs#9417
Closes openzfs#9609
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 27, 2019
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d418.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Closes openzfs#9317
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 27, 2019
As described in commit f81d5ef the zfs_vdev_elevator module
option is being removed.  Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8664
Closes openzfs#9417
Closes openzfs#9609
tonyhutter pushed a commit that referenced this issue Jan 23, 2020
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d418.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9317
tonyhutter pushed a commit that referenced this issue Jan 23, 2020
As described in commit f81d5ef the zfs_vdev_elevator module
option is being removed.  Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9417
Closes #9609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants