-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scratch object creation at the beginning of the reflow process #10
Conversation
module/zfs/vdev_raidz.c
Outdated
vdev_raidz_expand_t *vre = zio->io_private; | ||
|
||
if (zio->io_error == 0) | ||
vre->vre_scratch_devices++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need some locking on this (or use atomic ops), since this can be called concurrently when each child completes its write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: FIX IT
zio_nowait(zio_vdev_child_io(write_zio, NULL, | ||
raidvd->vdev_child[i], | ||
0, | ||
abd, VDEV_BOOT_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that the boot region can't be used with RAIDZ?
Can we add some checks that the boot region is big enough that we won't have overlapping writes? I think it just has to be at least nchildren <<ashift
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I considered boot sector as scratch object, the next calculations were used:
VDEV_BOOT_SIZE eq 3.5M (7ULL << 19) per child. Max number of childs is 255.
255*(7ULL << 19) = 935854080 bytes => max scratch object size
nchildren <<ashift in case of 4k ashift:
255 << 12 = 1044480 bytes
So, if I am correct it should be enough, of course excluding bootloader if it present, but I prefere do not accaunt it for now.
Later it will be needed to add logic which will detect the boot loader and check it size.
The simplest way will be to zero boot sector on pool creatrion phase and check non-zeroed data size to accaunt it.
Also, you mentioned "Copy first new_ncols^2 sectors to scratch object".
Could you please explain why new_ncols^2 and from other side "nchildren <<ashift". I cannot get it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think the design doc talks about "Copy first new_ncols^2 sectors to scratch object". This is the same as each device having "nchildren << ashift" bytes copied to it. Since each device has nchildren sectors, and there are nchildren devices, we have nchildren * nchildren = nchildren^2 sectors copied total (across all devices).
The reason we need to copy this much data is so that after the scratch data, there is at least one whole stripe (aka one whole row, which is at most old_nchildren << ashift
bytes) of not-yet-overwritten old data before the next old block to copy. We need this because we read this old data (from just before the "next old block to copy" when there's a stripe that crosses the "next block to copy" boundary, vre_offset_phys
. This is implemented by the code in vdev_raidz_map_alloc_expanded() following this comment:
* If we are in the middle of a reflow, and any part of this
* row has not been copied, then use the old location of
* this row.
@@ -3141,9 +3217,12 @@ raidz_reflow_impl(vdev_t *vd, vdev_raidz_expand_t *vre, range_tree_t *rt, | |||
ZIO_FLAG_CANFAIL, | |||
raidz_reflow_write_done, rra); | |||
|
|||
roffset = (blkid / old_children) << ashift; | |||
if (vre->vre_scratch_devices != 0) | |||
roffset -= VDEV_BOOT_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case the roffset has to be < VDEV_BOOT_SIZE, and this goes negative, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of next expression in the zio_vdev_child_io():
if (vd->vdev_ops->vdev_op_leaf) {
ASSERT0(vd->vdev_children);
offset += VDEV_LABEL_START_SIZE;
}
module/zfs/vdev_raidz.c
Outdated
* Invalidate scratch object on first vre_offset_phys update. | ||
* Enable first metaslab. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can assert that progress has been made past the scratch size?
The code flow here of when we use the scratch object is a little hard for me to follow and verify that it's correct. I wonder if we could be more explicit about invalidating the scratch object once we make progress past its size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I placed the logs to raidz_reflow_impl() and logged cases when device indexes and offsets are equal for reading and writing.
Had found that it does not work for now, because we can get sectors overwriting after the first vre_offset_phys update.
It is possible do not invalidate scrach object at all, at least until next expansion process, but it is needed to enable first metaslab.
@@ -3141,9 +3217,12 @@ raidz_reflow_impl(vdev_t *vd, vdev_raidz_expand_t *vre, range_tree_t *rt, | |||
ZIO_FLAG_CANFAIL, | |||
raidz_reflow_write_done, rra); | |||
|
|||
roffset = (blkid / old_children) << ashift; | |||
if (vre->vre_scratch_devices != 0) | |||
roffset -= VDEV_BOOT_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem that the scratch space is addressing is:
- we just started an expansion
- a single stripe has overwritten itself, such that it now has 2 sectors on the same disk
- that disk dies
- we now need to reconstruct this stripe, but we don't have enough of its sectors
To address this we need to ensure that when moving blocks, a stripe never overwrites itself, so that the stripe is always available intact at its old location.
So the place where we need to use the scratch copy is when doing a normal (or scrub/resilver) read. We don't need to use the scratch copy when doing the reflow read (although it shouldn't hurt).
I don't think we can test this case currently, because we don't handle disk failure during reflow. That's another thing on the to-do list. We need to pause the reflow and wait for the disk to be replaced and resilvered.
However, I think we can test this by reading blocks that are in the beginning of the disk, while we are reflowing the beginning of the disk. Today this will simply read from the old location, see this code in vdev_raidz_map_alloc_expanded()
:
/*
* If we are in the middle of a reflow, and any part of this
* row has not been copied, then use the old location of
* this row.
*/
int row_phys_cols = physical_cols;
if (b + (logical_cols - nparity) > reflow_offset >> ashift)
row_phys_cols--;
I think that if the expansion has overwritten a stripe with itself, the read will not get the right data (because the old location has been partially overwritten). We could test this by pausing the reflow in the very beginning, and then reading the block containing the partially-overwritten stripe. I think that @stuartmaybee is working on some code for pausing the reflow for testing purposes.
Note that in this test case, we actually still have all the data and we could write code to read this "split stripe" partially from the old location and partially from the new location. But that code doesn't exist, because it won't be needed once we solve the problem of handling resilvering while in the middle of reflow (by using the scratch copy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the reflow-pausing code is already there, zfs_raidz_expand_max_offset_pause
module/zfs/vdev.c
Outdated
vdev_raidz_t *vdrz = (vdev_raidz_t *)vd->vdev_tsd; | ||
if (vdrz->vd_physical_width - 1 == | ||
vdrz->vn_vre.vre_scratch_devices) | ||
metaslab_disable(vd->vdev_ms[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that the scratch size is entirely contained in the first metaslab?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it needed to be checked.
Ok, thanks a lot for your comments, seems like, I need some more time to understand it. |
@fuporovvStack I think that using the boot space can work, and is probably simpler than storing the scratch data in the MOS. I think that there's enough space in the boot region and it isn't typically used for boot blocks with RAIDZ. We just need to be able to verify those assumptions. |
- make scratch_devices variable increment atomic - improve metaslab disabling logic - invalidate scratch object when it offset exceeded instead of first reflow sync
Close because is not actual. |
Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501
Motivation and Context
Description
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.