Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) #14879

Closed
wants to merge 3 commits into from

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

Fix #14872

Description

We have recently been seeing a lot of zvol_misc test failures when blk-mq was enabled on F38 and Centos 9 (#14872). The failures look to be caused by kernel memory corruption.

This fix removes a slightly dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state".

How Has This Been Tested?

Ran zvol_misc with debug (asserts enabled) on Fedora 38 and reproduced the crashes in first 1-3 runs. Re-ran zvol_misc with the fix 10 times and didn't see the issue.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter
Copy link
Contributor Author

Darn it... looks like it is still failing on buildbot...

[15182.246076] VERIFY(zh->zh_claim_txg == 0) failed
[15182.249873] PANIC at zil.c:904:zil_create()
[15182.253289] Showing stack for process 146962
[15182.256925] CPU: 1 PID: 146962 Comm: zvol Tainted: P           OE      6.2.15-300.fc38.x86_64 #1
[15182.263694] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[15182.267893] Call Trace:
[15182.270656]  <TASK>
[15182.273315]  dump_stack_lvl+0x43/0x60
[15182.276535]  spl_panic+0xfc/0x120 [spl]
[15182.279832]  ? preempt_count_add+0x6a/0xa0
[15182.283178]  ? _raw_spin_lock+0x13/0x40
[15182.286431]  spl_assert+0x17/0x20 [zfs]
[15182.289996]  zil_create+0x428/0x590 [zfs]
[15182.293592]  zil_process_commit_list+0x343/0x530 [zfs]
[15182.297556]  ? _raw_spin_lock+0x13/0x40
[15182.300823]  zil_commit_writer+0x100/0x1b0 [zfs]
[15182.304617]  zil_commit_impl+0x5d/0xa0 [zfs]
[15182.308292]  zvol_write+0x39b/0x510 [zfs]
[15182.311882]  ? __schedule+0x3d4/0x13e0
[15182.315120]  zvol_write_task+0xe/0x20 [zfs]
[15182.318783]  taskq_thread+0x1fa/0x410 [spl]
[15182.322187]  ? __pfx_default_wake_function+0x10/0x10
[15182.325849]  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
[15182.329777]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[15182.333357]  kthread+0xe9/0x110
[15182.336365]  ? __pfx_kthread+0x10/0x10
[15182.339624]  ret_from_fork+0x2c/0x50
[15182.342819]  </TASK>

http://build.zfsonlinux.org/builders/Fedora%2038%20x86_64%20%28TEST%29/builds/241/steps/shell_4/logs/console

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to drop this optimization either way since it seems like it's assuming a little too much.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label May 19, 2023
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 5fc370c to 964ed30 Compare May 19, 2023 18:03
@tonyhutter
Copy link
Contributor Author

@behlendorf yea, there may actually be two issue afoot here. The crashes that were fixed by the commit, and the VERIFY(zh->zh_claim_txg == 0) problem. I'm going to have to do some debugging though buildbot since I can't reproduce the VERIFY() locally.

So don't pull this in yet.

@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 964ed30 to e5beb37 Compare May 19, 2023 18:11
We have recently been seeing a lot of zvol_misc test failures when
blk-mq was enabled on F38 and Centos 9 (openzfs#14872).  The failures look
to be caused by kernel memory corruption.

This fix removes a slightly dubious optimization in
zfs_uiomove_bvec_rq() that saved the iterator contents of a
rq_for_each_segment().  This optimization allowed restoring the "saved
state" from a previous rq_for_each_segment() call on the same uio so
that you wouldn't need to iterate though each bvec on every
zfs_uiomove_bvec_rq() call.  However, if the kernel is manipulating
the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq()
calls, then it could result in corruption from using the "saved state".

Fixes: openzfs#14872
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Requires-builders: fedora38
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch from e5beb37 to 85f2e36 Compare May 19, 2023 22:26
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch 2 times, most recently from f911bbf to 6f432ec Compare May 23, 2023 00:33
@@ -1134,6 +1170,7 @@
if (zh->zh_claim_txg == 0 && !BP_IS_HOLE(&zh->zh_log)) {
(void) zil_parse(zilog, zil_claim_log_block,
zil_claim_log_record, tx, first_txg, B_FALSE);
zil_log(zilog, "%s: setting %p to %d\n", __func__, zh, first_txg);

Check failure

Code scanning / CodeQL

Wrong type of arguments to formatting function

This argument should be of type 'int' but is of type 'unsigned long long'. This argument should be of type 'int' but is of type 'unsigned long'.
Copy link
Contributor

@ryao ryao May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have been dismissed. It is an actual bug that will cause us to print the upper 32-bits on big-endian machines. Also, high txg numbers (2^31 and up) will not be printed correctly on any machine.

We should use %llu and typecast to (u_longlong_t) here like we do elsewhere in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryao I'm just doing some "debug via buildbot" testing right now. This is just instrumentation that will not be checked in. I'll mark this PR as a WIP for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carry on then. :)

@tonyhutter tonyhutter changed the title zvol: Fix zvol_misc crashes when using blk-mq WIP: zvol: Fix zvol_misc crashes when using blk-mq May 23, 2023
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch 6 times, most recently from 1d21963 to 1a65401 Compare May 24, 2023 17:29
@behlendorf behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Accepted Ready to integrate (reviewed, tested) labels May 24, 2023
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch 5 times, most recently from 8b1f67d to 41866b7 Compare May 24, 2023 23:26
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 41866b7 to 100ffed Compare May 24, 2023 23:52

#ifdef _KERNEL
zil_log(os->os_zil, "%s: beginn %p txg %llu, txg os_zil_header %llu, os->os_phys->os_zil_header %p", __func__, os->os_zil->zl_header, os->os_zil->zl_header->zh_claim_txg,
os->os_phys->os_zil_header.zh_claim_txg, os->os_phys->os_zil_header);

Check failure

Code scanning / CodeQL

Wrong type of arguments to formatting function

This argument should be of type 'void *' but is of type 'zil_header'.
@tonyhutter
Copy link
Contributor Author

I have a little more information now.

The issue is that the zvol's zil header has zh->zh_claim_txg == 103 when it should be 0. This often happens in the zvol_misc_trim test, which does the following:

< setup zvol_misc_trim pool >
< export pool >
< import it... >
...
Psudocode for the import:

// Read in the objset for the zvol

dmu_objset_open_impl()
	err = arc_read(NULL, spa, os->os_rootbp,
            arc_getbuf_func, &os->os_phys_buf,
            ZIO_PRIORITY_SYNC_READ, zio_flags, &aflags, &zb);

	
	// os->os_phys_buf now contains the zvol metadata.
	// overlay our 'os_phys' metadata struct on top of the data
	os->os_phys = os->os_phys_buf->b_data;

	// The zh_claim_txg is 103 when we expected 0
	os->os_phys->os_zil_header.zh_claim_txg == 103

So it appears the 103 txg was written out to disk first, and we just read it back on import. I'm still digging..

@tonyhutter tonyhutter changed the title WIP: zvol: Fix zvol_misc crashes when using blk-mq WIP: zvol: Fix zvol_misc VERIFY(zh->zh_claim_txg == 0) May 25, 2023
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch 13 times, most recently from f0beef6 to 032c449 Compare May 26, 2023 22:51
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Requires-builders: fedora38
@tonyhutter tonyhutter force-pushed the blkmq-fix-zvol-misc branch from 032c449 to 14f3491 Compare June 2, 2023 00:56
@tonyhutter
Copy link
Contributor Author

I'm not super familiar with this code, but I added in some debug to print the block pointers:

dmu_objset_open_impl()
	...
              zil_log_blkptr(os->os_rootbp);

              dprintf_bp(os->os_rootbp, "reading %s", "");
              err = arc_read(NULL, spa, os->os_rootbp,
                  arc_getbuf_func, &os->os_phys_buf,
                  ZIO_PRIORITY_SYNC_READ, zio_flags, &aflags, &zb);

My zil_log_blkptr() function prints the values for os->os_rootbp (below). In this particular test run, I saw "bad" block pointers with zh_claim_txg == 27 some of the time (note the "birth=27L/27P) and the good block pointers with zh_claim_txg == 0 some of the time ("birth=5L/5P"). I also noticed the bad block points had fill=2 while the good ones had fill=6:

[ 1101.098857] 6: 1100715961623 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178
[ 1101.124620] 10: 1100715961623 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.187188] 20: 1100716961635 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.229582] 27: 1100718961659 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.271953] 34: 1100720961683 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.314321] 41: 1100720961683 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178
[ 1101.358674] 48: 1100721961696 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.401093] 55: 1100722961708 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.533617] 62: 1100723961720 DVA[0]=<0:20001c00:200> DVA[1]=<0:30001c00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=5L/5P fill=6 cksum=0000000e143d7d70:000005721de153dd:000116dfa095ccfa:0026591ebe7c0265
[ 1101.579798] 70: 1100741961937 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178
[ 1101.624386] 77: 1100776962361 DVA[0]=<0:90000000:200> DVA[1]=<0:a0000000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=27L/27P fill=2 cksum=0000000d102d2340:0000050e160e48be:000101a68036040b:0023298747d9b178

@sempervictus
Copy link
Contributor

@tonyhutter - what is the state of this PR? It is both marked as WIP and approved while addressing what seems to be a critical concern (data corruption).

@tonyhutter
Copy link
Contributor Author

@sempervictus long story short - this PR doesn't fix the problem as I had originally thought. The issue also seems to affect the non-blk-mq case as well. I'm not sure this is necessarily data corruption.

I started going down the rabbit hole investigating it in #14879 (comment) but it fell off my radar when some higher-priority stuff from work came in. I do see there's been a bunch of zil chances since I put out this PR, so I'd be curious to see if this is still an issue:

eda3fcd ZIL: Second attempt to reduce scope of zl_issuer_lock.
8e20e0f ZIL: Replay blocks without next block pointer.
b22bab2 Remove fastwrite mechanism.
2848de1 Remove zl_issuer_lock from zil_suspend().
2cb992a ZIL: Fix config lock deadlock.
233425a Again fix race between zil_commit() and zil_suspend().
a9d6b06 ZIL: Fix another use-after-free.
8e8acab Fix memory leak in zil_parse().
55b1842 ZIL: Fix race introduced by f63811f.
482da24 ZIL: Allow to replay blocks of any size.
b6fbe61 zil: Add some more statistics.
f63811f ZIL: Reduce scope of per-dataset zl_issuer_lock.

@amotin
Copy link
Member

amotin commented Nov 1, 2024

This seems outdated. This PR is reported to not fix the issue, while while it is already closed by other PR.

@amotin amotin closed this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZTS: zvol_misc_trim VERIFY(zh->zh_claim_txg == 0) on F38
6 participants