Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux 6.9: delegate/zfs_allow_010_pos hangs in zfs create -V #16089

Closed
robn opened this issue Apr 14, 2024 · 8 comments · Fixed by #16282
Closed

Linux 6.9: delegate/zfs_allow_010_pos hangs in zfs create -V #16089

robn opened this issue Apr 14, 2024 · 8 comments · Fixed by #16282
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@robn
Copy link
Member

robn commented Apr 14, 2024

System information

Type Version/Name
Distribution Name Debian
Distribution Version 12
Kernel Version 6.9-rc[37]
Architecture x86_64
OpenZFS Version zfs-2.2.99-475_g04bae5ec9

Describe the problem you're observing

Under 6.9-rcX, the test delegate/zfs_allow_010_pos hangs, and is eventually killed by the test runner. This does not appear to happen on earlier kernels (checked with 6.8.x, 6.1.x and 5.10.x).

The process list shows a zfs create -V process pegging a core:

$ ps -fo pid,user,time,pcpu,args -p 8620
    PID USER         TIME %CPU COMMAND
   8620 staff1   01:20:16  100 zfs create -V 150m testpool/testfs/nvol.create.staff1.14706

Inspecting the process shows it spinning in this loop at the bottom of zfs_ioc_create():

			/*
			 * Volumes will return EBUSY and cannot be destroyed
			 * until all asynchronous minor handling (e.g. from
			 * setting the volmode property) has completed. Wait for
			 * the spa_zvol_taskq to drain then retry.
			 */
			error2 = dsl_destroy_head(fsname);
			while ((error2 == EBUSY) && (type == DMU_OST_ZVOL)) {
				error2 = spa_open(fsname, &spa, FTAG);
				if (error2 == 0) {
					taskq_wait(spa->spa_zvol_taskq);
					spa_close(spa, FTAG);
				}
				error2 = dsl_destroy_head(fsname);
			}

Setting zfs_flags=512 shows that EBUSY is consistently being returned from this test at the top of dsl_destroy_head_check_impl():

        if (zfs_refcount_count(&ds->ds_longholds) != expected_holds)
                return (SET_ERROR(EBUSY));

Typical trace is:

    dsl_destroy_head_check_impl+1
    dsl_destroy_head_check+83
    dsl_sync_task_common+346
    dsl_sync_task+22
    dsl_destroy_head+263
    zfs_ioc_create+517
    zfsdev_ioctl_common+640
    zfsdev_ioctl+79
    __x64_sys_ioctl+147
    do_syscall_64+134
    entry_SYSCALL_64_after_hwframe+113

My gut feeling is that its something around mounts, but I've not had chance to look properly.

Describe how to reproduce the problem

$ uname -a
Linux shop 6.9.0-rc3 #1 SMP PREEMPT_DYNAMIC Sun Apr 14 20:37:05 AEST 2024 x86_64 GNU/Linux
$ /usr/local/share/zfs/zfs-tests.sh -vKx -t tests/functional/delegate/zfs_allow_010_pos
...
@robn robn added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 14, 2024
@robn
Copy link
Member Author

robn commented May 8, 2024

Retested with 6.9-rc7 (likely final 6.9-rc), no change.

@robn
Copy link
Member Author

robn commented May 12, 2024

I did look into this a bit last night, though ultimately didn't get anywhere. There's more change in Linux in block/bdev.c in Linux in 6.8->6.9 than I realised; it's not as simple as just replacing struct bdev_handle with plain old struct file. It looks like there might be a little more required of block devices than before, around exclusive opens and "claims", so zvol might need a bit more work. I haven't been able to pin it down yet, but clearly something somewhere is not dropping a hold on the dataset when it should.

Incidentall, I think my shim for vdev_blkdev_put() should probably be bdev_fput(), not plain fput(). But that's not the whole problem here.

If anyone knows this better than me and is in the mood, feel free to take a look. I'm not reserving this issue to myself, just aware that 6.9 will drop soon and it'd be nice to work properly there. I'll keep noodling on this as time permits, probably a couple of hours each weekend :)

@robn
Copy link
Member Author

robn commented May 14, 2024

Confirmed in 6.9.0, not that there was a doubt.

@darkbasic
Copy link

I guess it would be advisable to avoid 6.9 until this gets fixed, right?

@q66
Copy link
Contributor

q66 commented Jun 1, 2024

6.8 is now EOL, so this is kind of unfortunate...

@robn robn mentioned this issue Jun 6, 2024
13 tasks
@tonyhutter
Copy link
Contributor

tonyhutter commented Jun 10, 2024

I'm able to reproduce this on Ubuntu 24 using the prebuilt 6.9.3 kernel debs from https://kernel.ubuntu.com/mainline/v6.9.3/. I noticed this in dmesg when I run zfs_allow_010_pos.ksh:

[   39.256293] UBSAN: array-index-out-of-bounds in /home/hutter/zfs/module/zfs/zap_micro.c:473:34
[   39.256296] index 2 is out of range for type 'mzap_ent_phys_t [1]'
[   39.256298] CPU: 8 PID: 2594 Comm: zpool Tainted: P           OE      6.9.3-060903-generic #202405300957
[   39.256300] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
[   39.256300] Call Trace:
[   39.256302]  <TASK>
[   39.256304]  dump_stack_lvl+0x76/0xa0
[   39.256310]  dump_stack+0x10/0x20
[   39.256311]  __ubsan_handle_out_of_bounds+0xcb/0x110
[   39.256314]  zap_lockdir_impl+0xb29/0xb40 [zfs]
[   39.256450]  zap_lockdir+0xc7/0x110 [zfs]
[   39.256549]  zap_lookup+0x58/0xd0 [zfs]
[   39.256644]  dsl_scan_init+0x156/0x610 [zfs]
[   39.256763]  ? _raw_spin_unlock+0xe/0x40
[   39.256772]  ? arc_cksum_compute.part.0+0x92/0x220 [zfs]
[   39.256882]  ? _raw_spin_unlock+0xe/0x40
[   39.256884]  ? dnode_rele_and_unlock+0x80/0x230 [zfs]
[   39.257003]  ? dnode_rele+0x48/0x90 [zfs]
[   39.257120]  ? zap_create_claim_norm_dnsize+0x14e/0x190 [zfs]
[   39.257231]  dsl_pool_create+0xcc/0x4a0 [zfs]
[   39.257352]  spa_create+0x8b6/0xe30 [zfs]
[   39.257471]  zfs_ioc_pool_create+0xaa/0x340 [zfs]
[   39.257579]  zfsdev_ioctl_common+0x82d/0xae0 [zfs]
[   39.257685]  ? __check_object_size.part.0+0x72/0x150
[   39.257688]  zfsdev_ioctl+0x57/0xf0 [zfs]
[   39.257791]  __x64_sys_ioctl+0xa0/0xf0
[   39.257793]  x64_sys_call+0x143b/0x25c0
[   39.257795]  do_syscall_64+0x7e/0x180
[   39.257797]  ? __memcg_slab_free_hook+0x115/0x180
[   39.257800]  ? fput+0xdb/0x130
[   39.257802]  ? kmem_cache_free+0x3dc/0x400
[   39.257803]  ? fput+0xdb/0x130
[   39.257805]  ? path_openat+0xd3/0x2c0
[   39.257806]  ? do_syscall_64+0x8b/0x180
[   39.257807]  ? do_filp_open+0xc0/0x170
[   39.257809]  ? putname+0x5b/0x80
[   39.257811]  ? do_sys_openat2+0x9f/0xe0
[   39.257813]  ? __x64_sys_openat+0x55/0xa0
[   39.257815]  ? syscall_exit_to_user_mode+0x81/0x270
[   39.257817]  ? do_syscall_64+0x8b/0x180
[   39.257818]  ? syscall_exit_to_user_mode+0x81/0x270
[   39.257820]  ? do_syscall_64+0x8b/0x180
[   39.257821]  ? switch_fpu_return+0x50/0xe0
[   39.257824]  ? syscall_exit_to_user_mode+0x81/0x270
[   39.257825]  ? do_syscall_64+0x8b/0x180
[   39.257827]  ? irqentry_exit_to_user_mode+0x76/0x270
[   39.257828]  ? irqentry_exit+0x43/0x50
[   39.257830]  ? clear_bhb_loop+0x15/0x70
[   39.257832]  ? clear_bhb_loop+0x15/0x70
[   39.257833]  ? clear_bhb_loop+0x15/0x70
[   39.257834]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   39.257835] RIP: 0033:0x79ccf8d24ded
[   39.257845] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[   39.257846] RSP: 002b:00007ffeecdaa2f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   39.257848] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000079ccf8d24ded
[   39.257848] RDX: 00007ffeecdaa3b0 RSI: 0000000000005a00 RDI: 0000000000000003
[   39.257849] RBP: 00007ffeecdaa340 R08: 0000000000000000 R09: 0000000000000000
[   39.257850] R10: 000079ccf901a7a8 R11: 0000000000000246 R12: 000060e82bceb2c0
[   39.257850] R13: 000060e82bcf5cb0 R14: 00007ffeecdaa3b0 R15: 00007ffeecdad9a0
[   39.257852]  </TASK>

I'm using master (20c8bdd)

@tonyhutter
Copy link
Contributor

UBSAN: array-index-out-of-bounds in /home/hutter/zfs/module/zfs/zap_micro.c:473:34
index 2 is out of range for type 'mzap_ent_phys_t [1]'

Nevermind, this may just be UBSAN noise.

tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 18, 2024
The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return to
userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@tonyhutter
Copy link
Contributor

Fix is here: #16282

TL;DR: The 6.9 kernel async releases block devices at return to userspace time, while 6.8 and older release synchronously. We need syncronous release since we do create+destroy in ZFS_IOC_CREATE, and all references to the zvol need to be released before we do the "destroy" part of it. Workaround is to call add_disk() in a kernel thread which changes the kernel's release codepath to do what we want.

tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 18, 2024
The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return
to userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 19, 2024
The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return
to userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 25, 2024
The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return
to userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
calccrypto pushed a commit to hpc/zfs that referenced this issue Jul 3, 2024
…penzfs#16282)

The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return
to userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
robn pushed a commit to robn/zfs that referenced this issue Jul 17, 2024
…penzfs#16282)

The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return
to userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Sep 4, 2024
…penzfs#16282)

The 6.9 kernel behaves differently in how it releases block devices.  In
the common case it will async release the device only after the return
to userspace.  This is different from the 6.8 and older kernels which
release the block devices synchronously.  To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect.  This stops
zfs_allow_010_pos from hanging.

Fixes: openzfs#16089

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants