WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) #14879

tonyhutter · 2023-05-18T20:11:06Z

Motivation and Context

Description

We have recently been seeing a lot of zvol_misc test failures when blk-mq was enabled on F38 and Centos 9 (#14872). The failures look to be caused by kernel memory corruption.

This fix removes a slightly dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state".

How Has This Been Tested?

Ran zvol_misc with debug (asserts enabled) on Fedora 38 and reproduced the crashes in first 1-3 runs. Re-ran zvol_misc with the fix 10 times and didn't see the issue.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2023-05-19T00:29:10Z

Darn it... looks like it is still failing on buildbot...

[15182.246076] VERIFY(zh->zh_claim_txg == 0) failed
[15182.249873] PANIC at zil.c:904:zil_create()
[15182.253289] Showing stack for process 146962
[15182.256925] CPU: 1 PID: 146962 Comm: zvol Tainted: P           OE      6.2.15-300.fc38.x86_64 #1
[15182.263694] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[15182.267893] Call Trace:
[15182.270656]  <TASK>
[15182.273315]  dump_stack_lvl+0x43/0x60
[15182.276535]  spl_panic+0xfc/0x120 [spl]
[15182.279832]  ? preempt_count_add+0x6a/0xa0
[15182.283178]  ? _raw_spin_lock+0x13/0x40
[15182.286431]  spl_assert+0x17/0x20 [zfs]
[15182.289996]  zil_create+0x428/0x590 [zfs]
[15182.293592]  zil_process_commit_list+0x343/0x530 [zfs]
[15182.297556]  ? _raw_spin_lock+0x13/0x40
[15182.300823]  zil_commit_writer+0x100/0x1b0 [zfs]
[15182.304617]  zil_commit_impl+0x5d/0xa0 [zfs]
[15182.308292]  zvol_write+0x39b/0x510 [zfs]
[15182.311882]  ? __schedule+0x3d4/0x13e0
[15182.315120]  zvol_write_task+0xe/0x20 [zfs]
[15182.318783]  taskq_thread+0x1fa/0x410 [spl]
[15182.322187]  ? __pfx_default_wake_function+0x10/0x10
[15182.325849]  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
[15182.329777]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[15182.333357]  kthread+0xe9/0x110
[15182.336365]  ? __pfx_kthread+0x10/0x10
[15182.339624]  ret_from_fork+0x2c/0x50
[15182.342819]  </TASK>

http://build.zfsonlinux.org/builders/Fedora%2038%20x86_64%20%28TEST%29/builds/241/steps/shell_4/logs/console

behlendorf

It's probably a good idea to drop this optimization either way since it seems like it's assuming a little too much.

tonyhutter · 2023-05-19T18:09:55Z

@behlendorf yea, there may actually be two issue afoot here. The crashes that were fixed by the commit, and the VERIFY(zh->zh_claim_txg == 0) problem. I'm going to have to do some debugging though buildbot since I can't reproduce the VERIFY() locally.

So don't pull this in yet.

We have recently been seeing a lot of zvol_misc test failures when blk-mq was enabled on F38 and Centos 9 (openzfs#14872). The failures look to be caused by kernel memory corruption. This fix removes a slightly dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". Fixes: openzfs#14872 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Requires-builders: fedora38

This reverts commit e197bb2.

ryao · 2023-05-23T18:08:31Z

module/zfs/zil.c

@@ -1134,6 +1170,7 @@
 	if (zh->zh_claim_txg == 0 && !BP_IS_HOLE(&zh->zh_log)) {
 		(void) zil_parse(zilog, zil_claim_log_block,
 		    zil_claim_log_record, tx, first_txg, B_FALSE);
+		zil_log(zilog, "%s: setting %p to %d\n", __func__, zh, first_txg);


This should not have been dismissed. It is an actual bug that will cause us to print the upper 32-bits on big-endian machines. Also, high txg numbers (2^31 and up) will not be printed correctly on any machine.

We should use %llu and typecast to (u_longlong_t) here like we do elsewhere in the code.

@ryao I'm just doing some "debug via buildbot" testing right now. This is just instrumentation that will not be checked in. I'll mark this PR as a WIP for now.

Carry on then. :)

module/zfs/dmu_objset.c

+
+#ifdef _KERNEL
+       zil_log(os->os_zil, "%s: beginn %p txg %llu, txg os_zil_header %llu, os->os_phys->os_zil_header %p", __func__, os->os_zil->zl_header, os->os_zil->zl_header->zh_claim_txg,
+		os->os_phys->os_zil_header.zh_claim_txg, os->os_phys->os_zil_header);


tonyhutter · 2023-05-25T18:35:02Z

I have a little more information now.

The issue is that the zvol's zil header has zh->zh_claim_txg == 103 when it should be 0. This often happens in the zvol_misc_trim test, which does the following:

< setup zvol_misc_trim pool >
< export pool >
< import it... >
...
Psudocode for the import:

// Read in the objset for the zvol

dmu_objset_open_impl()
	err = arc_read(NULL, spa, os->os_rootbp,
            arc_getbuf_func, &os->os_phys_buf,
            ZIO_PRIORITY_SYNC_READ, zio_flags, &aflags, &zb);

	
	// os->os_phys_buf now contains the zvol metadata.
	// overlay our 'os_phys' metadata struct on top of the data
	os->os_phys = os->os_phys_buf->b_data;

	// The zh_claim_txg is 103 when we expected 0
	os->os_phys->os_zil_header.zh_claim_txg == 103

So it appears the 103 txg was written out to disk first, and we just read it back on import. I'm still digging..

Signed-off-by: Tony Hutter <hutter2@llnl.gov> Requires-builders: fedora38

tonyhutter · 2023-06-06T05:10:34Z

I'm not super familiar with this code, but I added in some debug to print the block pointers:

dmu_objset_open_impl()
	...
              zil_log_blkptr(os->os_rootbp);

              dprintf_bp(os->os_rootbp, "reading %s", "");
              err = arc_read(NULL, spa, os->os_rootbp,
                  arc_getbuf_func, &os->os_phys_buf,
                  ZIO_PRIORITY_SYNC_READ, zio_flags, &aflags, &zb);

My zil_log_blkptr() function prints the values for os->os_rootbp (below). In this particular test run, I saw "bad" block pointers with zh_claim_txg == 27 some of the time (note the "birth=27L/27P) and the good block pointers with zh_claim_txg == 0 some of the time ("birth=5L/5P"). I also noticed the bad block points had fill=2 while the good ones had fill=6:

[ 1101.098857] 6: 1100715961623 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178
[ 1101.124620] 10: 1100715961623 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.187188] 20: 1100716961635 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.229582] 27: 1100718961659 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.271953] 34: 1100720961683 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.314321] 41: 1100720961683 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178
[ 1101.358674] 48: 1100721961696 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.401093] 55: 1100722961708 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.533617] 62: 1100723961720 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.579798] 70: 1100741961937 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178
[ 1101.624386] 77: 1100776962361 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178

sempervictus · 2023-09-07T15:45:02Z

@tonyhutter - what is the state of this PR? It is both marked as WIP and approved while addressing what seems to be a critical concern (data corruption).

tonyhutter · 2023-09-07T16:10:36Z

@sempervictus long story short - this PR doesn't fix the problem as I had originally thought. The issue also seems to affect the non-blk-mq case as well. I'm not sure this is necessarily data corruption.

I started going down the rabbit hole investigating it in #14879 (comment) but it fell off my radar when some higher-priority stuff from work came in. I do see there's been a bunch of zil chances since I put out this PR, so I'd be curious to see if this is still an issue:

eda3fcd ZIL: Second attempt to reduce scope of zl_issuer_lock.
8e20e0f ZIL: Replay blocks without next block pointer.
b22bab2 Remove fastwrite mechanism.
2848de1 Remove zl_issuer_lock from zil_suspend().
2cb992a ZIL: Fix config lock deadlock.
233425a Again fix race between zil_commit() and zil_suspend().
a9d6b06 ZIL: Fix another use-after-free.
8e8acab Fix memory leak in zil_parse().
55b1842 ZIL: Fix race introduced by f63811f.
482da24 ZIL: Allow to replay blocks of any size.
b6fbe61 zil: Add some more statistics.
f63811f ZIL: Reduce scope of per-dataset zl_issuer_lock.

amotin · 2024-11-01T02:01:24Z

This seems outdated. This PR is reported to not fix the issue, while while it is already closed by other PR.

tonyhutter mentioned this pull request May 18, 2023

ZTS: zvol_misc_trim disable blk mq #14870

Merged

bwatkinson approved these changes May 18, 2023

View reviewed changes

ryao approved these changes May 18, 2023

View reviewed changes

tonyhutter mentioned this pull request May 19, 2023

ZTS: zvol_misc_trim VERIFY(zh->zh_claim_txg == 0) on F38 #14872

Closed

behlendorf approved these changes May 19, 2023

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label May 19, 2023

tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 5fc370c to 964ed30 Compare May 19, 2023 18:03

tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 964ed30 to e5beb37 Compare May 19, 2023 18:11

tonyhutter force-pushed the blkmq-fix-zvol-misc branch from e5beb37 to 85f2e36 Compare May 19, 2023 22:26

Revert "Optionally skip zil_close during zvol_create_minor_impl"

a0df678

This reverts commit e197bb2.

tonyhutter force-pushed the blkmq-fix-zvol-misc branch 2 times, most recently from f911bbf to 6f432ec Compare May 23, 2023 00:33

github-advanced-security bot found potential problems May 23, 2023

View reviewed changes

tonyhutter changed the title ~~zvol: Fix zvol_misc crashes when using blk-mq~~ WIP: zvol: Fix zvol_misc crashes when using blk-mq May 23, 2023

tonyhutter force-pushed the blkmq-fix-zvol-misc branch 6 times, most recently from 1d21963 to 1a65401 Compare May 24, 2023 17:29

behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Accepted Ready to integrate (reviewed, tested) labels May 24, 2023

tonyhutter force-pushed the blkmq-fix-zvol-misc branch 5 times, most recently from 8b1f67d to 41866b7 Compare May 24, 2023 23:26

tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 41866b7 to 100ffed Compare May 24, 2023 23:52

github-advanced-security bot found potential problems May 25, 2023

View reviewed changes

tonyhutter changed the title ~~WIP: zvol: Fix zvol_misc crashes when using blk-mq~~ WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) May 25, 2023

tonyhutter force-pushed the blkmq-fix-zvol-misc branch 13 times, most recently from f0beef6 to 032c449 Compare May 26, 2023 22:51

TEST ONLY Force blk-mq to be unsettable for sanity

14f3491

Signed-off-by: Tony Hutter <hutter2@llnl.gov> Requires-builders: fedora38

tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 032c449 to 14f3491 Compare June 2, 2023 00:56

amotin closed this Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) #14879

WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) #14879

tonyhutter commented May 18, 2023

tonyhutter commented May 19, 2023

behlendorf left a comment

tonyhutter commented May 19, 2023

ryao May 23, 2023 •

edited

Loading

tonyhutter May 23, 2023

ryao May 23, 2023

tonyhutter commented May 25, 2023

tonyhutter commented Jun 6, 2023

sempervictus commented Sep 7, 2023

tonyhutter commented Sep 7, 2023

amotin commented Nov 1, 2024

WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) #14879

WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) #14879

Conversation

tonyhutter commented May 18, 2023

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter commented May 19, 2023

behlendorf left a comment

Choose a reason for hiding this comment

tonyhutter commented May 19, 2023

ryao May 23, 2023 • edited Loading

Choose a reason for hiding this comment

tonyhutter May 23, 2023

Choose a reason for hiding this comment

ryao May 23, 2023

Choose a reason for hiding this comment

tonyhutter commented May 25, 2023

tonyhutter commented Jun 6, 2023

sempervictus commented Sep 7, 2023

tonyhutter commented Sep 7, 2023

amotin commented Nov 1, 2024

ryao May 23, 2023 •

edited

Loading