Add scratch object creation at the beginning of the reflow process #10

fuporovvStack · 2020-08-21T05:13:56Z

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

ahrens · 2020-08-25T16:39:08Z

module/zfs/vdev_raidz.c

+	vdev_raidz_expand_t *vre = zio->io_private;
+
+	if (zio->io_error == 0)
+		vre->vre_scratch_devices++;


need some locking on this (or use atomic ops), since this can be called concurrently when each child completes its write.

TODO: FIX IT

ahrens · 2020-08-25T16:54:20Z

module/zfs/vdev_raidz.c

+		zio_nowait(zio_vdev_child_io(write_zio, NULL,
+		    raidvd->vdev_child[i],
+		    0,
+		    abd, VDEV_BOOT_SIZE,


Are we sure that the boot region can't be used with RAIDZ?

Can we add some checks that the boot region is big enough that we won't have overlapping writes? I think it just has to be at least nchildren <<ashift.

When I considered boot sector as scratch object, the next calculations were used:
VDEV_BOOT_SIZE eq 3.5M (7ULL << 19) per child. Max number of childs is 255.
255*(7ULL << 19) = 935854080 bytes => max scratch object size

nchildren <<ashift in case of 4k ashift:
255 << 12 = 1044480 bytes

So, if I am correct it should be enough, of course excluding bootloader if it present, but I prefere do not accaunt it for now.
Later it will be needed to add logic which will detect the boot loader and check it size.
The simplest way will be to zero boot sector on pool creatrion phase and check non-zeroed data size to accaunt it.

Also, you mentioned "Copy first new_ncols^2 sectors to scratch object".
Could you please explain why new_ncols^2 and from other side "nchildren <<ashift". I cannot get it clear.

Sure, I think the design doc talks about "Copy first new_ncols^2 sectors to scratch object". This is the same as each device having "nchildren << ashift" bytes copied to it. Since each device has nchildren sectors, and there are nchildren devices, we have nchildren * nchildren = nchildren^2 sectors copied total (across all devices).

The reason we need to copy this much data is so that after the scratch data, there is at least one whole stripe (aka one whole row, which is at most old_nchildren << ashift bytes) of not-yet-overwritten old data before the next old block to copy. We need this because we read this old data (from just before the "next old block to copy" when there's a stripe that crosses the "next block to copy" boundary, vre_offset_phys. This is implemented by the code in vdev_raidz_map_alloc_expanded() following this comment:

* If we are in the middle of a reflow, and any part of this * row has not been copied, then use the old location of * this row.

ahrens · 2020-08-25T16:55:33Z

module/zfs/vdev_raidz.c

@@ -3141,9 +3217,12 @@ raidz_reflow_impl(vdev_t *vd, vdev_raidz_expand_t *vre, range_tree_t *rt,
 	    ZIO_FLAG_CANFAIL,
 	    raidz_reflow_write_done, rra);

+	roffset = (blkid / old_children) << ashift;
+	if (vre->vre_scratch_devices != 0)
+		roffset -= VDEV_BOOT_SIZE;


in this case the roffset has to be < VDEV_BOOT_SIZE, and this goes negative, right?

Because of next expression in the zio_vdev_child_io():
if (vd->vdev_ops->vdev_op_leaf) {
ASSERT0(vd->vdev_children);
offset += VDEV_LABEL_START_SIZE;
}

ahrens · 2020-08-25T17:02:53Z

module/zfs/vdev_raidz.c

+	 * Invalidate scratch object on first vre_offset_phys update.
+	 * Enable first metaslab.


I think we can assert that progress has been made past the scratch size?

The code flow here of when we use the scratch object is a little hard for me to follow and verify that it's correct. I wonder if we could be more explicit about invalidating the scratch object once we make progress past its size.

I placed the logs to raidz_reflow_impl() and logged cases when device indexes and offsets are equal for reading and writing.
Had found that it does not work for now, because we can get sectors overwriting after the first vre_offset_phys update.
It is possible do not invalidate scrach object at all, at least until next expansion process, but it is needed to enable first metaslab.

ahrens · 2020-08-25T17:27:01Z

module/zfs/vdev_raidz.c

@@ -3141,9 +3217,12 @@ raidz_reflow_impl(vdev_t *vd, vdev_raidz_expand_t *vre, range_tree_t *rt,
 	    ZIO_FLAG_CANFAIL,
 	    raidz_reflow_write_done, rra);

+	roffset = (blkid / old_children) << ashift;
+	if (vre->vre_scratch_devices != 0)
+		roffset -= VDEV_BOOT_SIZE;


The problem that the scratch space is addressing is:

we just started an expansion

a single stripe has overwritten itself, such that it now has 2 sectors on the same disk

that disk dies

we now need to reconstruct this stripe, but we don't have enough of its sectors

To address this we need to ensure that when moving blocks, a stripe never overwrites itself, so that the stripe is always available intact at its old location.

So the place where we need to use the scratch copy is when doing a normal (or scrub/resilver) read. We don't need to use the scratch copy when doing the reflow read (although it shouldn't hurt).

I don't think we can test this case currently, because we don't handle disk failure during reflow. That's another thing on the to-do list. We need to pause the reflow and wait for the disk to be replaced and resilvered.

However, I think we can test this by reading blocks that are in the beginning of the disk, while we are reflowing the beginning of the disk. Today this will simply read from the old location, see this code in vdev_raidz_map_alloc_expanded():

/* * If we are in the middle of a reflow, and any part of this * row has not been copied, then use the old location of * this row. */ int row_phys_cols = physical_cols; if (b + (logical_cols - nparity) > reflow_offset >> ashift) row_phys_cols--;

I think that if the expansion has overwritten a stripe with itself, the read will not get the right data (because the old location has been partially overwritten). We could test this by pausing the reflow in the very beginning, and then reading the block containing the partially-overwritten stripe. I think that @stuartmaybee is working on some code for pausing the reflow for testing purposes.

Note that in this test case, we actually still have all the data and we could write code to read this "split stripe" partially from the old location and partially from the new location. But that code doesn't exist, because it won't be needed once we solve the problem of handling resilvering while in the middle of reflow (by using the scratch copy).

Oh, the reflow-pausing code is already there, zfs_raidz_expand_max_offset_pause

ahrens · 2020-08-25T17:35:46Z

module/zfs/vdev.c

+		vdev_raidz_t *vdrz = (vdev_raidz_t *)vd->vdev_tsd;
+		if (vdrz->vd_physical_width - 1 ==
+		    vdrz->vn_vre.vre_scratch_devices)
+			metaslab_disable(vd->vdev_ms[0]);


Can we assert that the scratch size is entirely contained in the first metaslab?

Yep, it needed to be checked.

fuporovvStack · 2020-09-01T11:15:17Z

Ok, thanks a lot for your comments, seems like, I need some more time to understand it.
Also, plus and cons of storing scratch object in MOS as alternative of current implementation.

ahrens · 2020-09-01T17:38:28Z

@fuporovvStack I think that using the boot space can work, and is probably simpler than storing the scratch data in the MOS. I think that there's enough space in the boot region and it isn't typically used for boot blocks with RAIDZ. We just need to be able to verify those assumptions.

- make scratch_devices variable increment atomic - improve metaslab disabling logic - invalidate scratch object when it offset exceeded instead of first reflow sync

fuporovvStack · 2021-02-10T13:11:52Z

Close because is not actual.
See: openzfs@9310a69

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501

Add scratch object creation at the beginning of the reflow process

d77525f

ahrens reviewed Aug 25, 2020

View reviewed changes

Scratch object improvements:

b72fc80

- make scratch_devices variable increment atomic - improve metaslab disabling logic - invalidate scratch object when it offset exceeded instead of first reflow sync

fuporovvStack closed this Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scratch object creation at the beginning of the reflow process #10

Add scratch object creation at the beginning of the reflow process #10

fuporovvStack commented Aug 21, 2020

ahrens Aug 25, 2020

fuporovvStack Sep 1, 2020

ahrens Aug 25, 2020

fuporovvStack Sep 1, 2020

ahrens Sep 1, 2020

ahrens Aug 25, 2020

fuporovvStack Sep 1, 2020

ahrens Aug 25, 2020

fuporovvStack Sep 1, 2020

ahrens Aug 25, 2020

ahrens Aug 26, 2020

ahrens Aug 25, 2020

fuporovvStack Sep 1, 2020

fuporovvStack commented Sep 1, 2020

ahrens commented Sep 1, 2020

fuporovvStack commented Feb 10, 2021

		* Invalidate scratch object on first vre_offset_phys update.
		* Enable first metaslab.

Add scratch object creation at the beginning of the reflow process #10

Add scratch object creation at the beginning of the reflow process #10

Conversation

fuporovvStack commented Aug 21, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuporovvStack commented Sep 1, 2020

ahrens commented Sep 1, 2020

fuporovvStack commented Feb 10, 2021