-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for autoexpand property #7629
Conversation
37581be
to
5cca586
Compare
@behlendorf I downloaded this patch and have been trying to verify it locally without any success yet. The |
Never mind! I hadn't reloaded properly and the new version of |
Rebased on master to drop unrelated changes from the patch stack, no other changes included. |
I believe that something for non-zvol vdevs has been broken in this change. When I resize a scsi device with vmware, rescan the scsi bus and and run |
@shartse it sounds like the kernel code may be getting the stale cached version of the partition table. Can you check if reinstating the |
@behlendorf Sure - I'll test that now. And yeah, I agree that we should have some non-zvol tests. I don't know a lot about the scsi_debug devices - I noticed we could power them on and off in |
@shartse we should be able too, they're basically a tiny ramdisk layered below the scsi driver. You can use them to setup tests which might otherwise be difficult to orchestrate, like injecting IO or data errors and making sure they're handled properly all the way up the stack. Or in our case resizing an actual scsi device without needing a real scsi device. |
@behlendorf Ok, that's good to know. I can start working on a set of And adding back in the call to |
@shartse I was able to reproduce the issue you were seeing with scsi devices and have added a commit which resolves the issue in my testing on Ubuntu 18.04. I still need to verify it works as intended on older kernels, but it's there for your review. |
79a648a
to
03f5015
Compare
Refreshed and ready for review. The updated PR addresses all known issues and includes @shartse's additional test case which uses a real (scsi_debug) device. |
@rebased on master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep looking and test this locally as well
module/zfs/vdev_disk.c
Outdated
uint64_t available = sectors - used; | ||
return (available << SECTOR_BITS); | ||
/* | ||
* Returns the maximum expansion capacity of the block device, When the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify the the units of the value returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
module/zfs/vdev_disk.c
Outdated
/* | ||
* Returns the maximum expansion capacity of the block device, When the | ||
* vdev has been created as a 'wholedisk' then expansion may be possible. | ||
* Before any expansion is performed the partition layout is verified to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of this comment makes it seem like the layout verification and resizing happens in this function. Maybe specify which functions those things happen in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll clarify this.
module/zfs/vdev_disk.c
Outdated
|
||
if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) { | ||
available = i_size_read(bdev->bd_contains->bd_inode) - | ||
((EFI_MIN_RESV_SIZE + NEW_START_BLOCK + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment (here or at the top of the function) explaining why we're doing this calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
module/zfs/vdev_disk.c
Outdated
if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) { | ||
available = i_size_read(bdev->bd_contains->bd_inode) - | ||
((EFI_MIN_RESV_SIZE + NEW_START_BLOCK + | ||
PARTITION_END_ALIGNMENT) << SECTOR_BITS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this calculation matches the slice_size calculation in libzfs_pool.c zpool_label_disk()? The current code there aligns the slice size. This does not. The failure mode is that the pool can think it is bigger than slice 0 (aka part0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding when I worked on this was that EFI_MIN_RESV_SIZE + NEW_START_BLOCK + PARTITION_END_ALIGNMENT
is a conservative estimate of the non-part0 space on the device.
In zpool_label_disk
the size of part0 is effectively computed as P2ALIGN(total_size - (EFI_MIN_RESV_SIZE + NEW_START_BLOCK), PARTITION_END_ALIGNMENT)
. The size of the padding caused by the alignment can be anything from 0 to PARTITION_END_ALIGNMENT
. By assuming we were padded with the full PARTITION_END_ALIGNMENT
bytes, we actually report a size for part0 that could be slightly smaller than is really available. However, I don't think this is a big issue because this is just the size reported to the user as expandsize and once the pool is actually expanded, it will use as much space as fits evenly into metaslabs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except when it isn't. Observe the following failure:
[root@host4602 relling]# zpool create xp /dev/sddj
Message from syslogd@host4602 at Jul 5 11:49:36 ...
kernel:VERIFY3(io_offset + io_size <= bdev->bd_inode->i_size) failed (10000820355072 <= 10000747462656)
And the size of part0 is 10000747462656. I believe this is because we're picking the wrong size. Though I haven't compiled a new version to test. Meanwhile we workaround by creating the pool with part0 instead of whole disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - can you describe the circumstances of the failure a little more? Was /dev/sddj
ever expanded? Is it still showing up with the latest version of this patch?
I'm not sure why this calculation would ever end up determining the size of the partition. My understanding is that efi_use_whole_disk
is where the partition size is updated after an expand and that doesn't use what we computed for asize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite simply, cannot create a pool on a 10TB disk using whole-disk method. Prior to the commit for expansion, we were able to create such a pool. It is unclear to me whether this is noticeable on a non-debug kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardelling can you post the exact size of the device in bytes, I'd like to see if I can reproduce what your seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardelling I believe I see what's going on, and this commit should in fact resolve the issue accidentally introduced by the first expansion commit. It was possible bdev_capacity()
could slightly over report the partition size leading to the ASSERT your seeing in the wholedisk
case. This is no longer possible since bdev_capacity()
was split in to bdev_capacity()
and bdev_max_capacity()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good!
@richardelling this updated version should address your concerns. If possible it would be great if you could verify the wholedisk case on the system where you ran in to problems. |
Thanks @shartse I agree with your analysis. @behlendorf I'll schedule time on the system to test and report back. |
Codecov Report
@@ Coverage Diff @@
## master #7629 +/- ##
=========================================
+ Coverage 78.08% 78.28% +0.2%
=========================================
Files 368 368
Lines 111899 111952 +53
=========================================
+ Hits 87378 87644 +266
+ Misses 24521 24308 -213
Continue to review full report at Codecov.
|
While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place with a few modifications for Linux it can be supported. Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid. From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report thee maximum possible expansion size. Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used. In order to make all of the above possible the following changes were required: * Updated the zpool_expand_001_pos and zpool_expand_003_pos tests. These tests now create a pool which is layered on a loopback, scsi_debug, and file vdev. This allows for testing of non- partitioned block device (loopback), a partition block device (scsi_debug), and a file which does not receive udev change events. This provided for better test coverage, and by removing the layering on ZFS volumes there issues surrounding layering one pool on another are avoided. * zpool_find_vdev_by_physpath() updated to accept a vdev guid. This allows for matching by guid rather than path which is a more reliable way for the ZED to reference a vdev. * Fixed zfs_zevent_wait() signal handling which could result in the ZED spinning when a signal was not handled. * Removed vdev_disk_rrpart() functionality which can be abandoned in favor of kernel provided blkdev_reread_part() function. * Added a rwlock which is held as a writer while a disk is being reopened. This is important to prevent errors from occurring for any configuration related IOs which bypass the SCL_ZIO lock. The zpool_reopen_007_pos.ksh test case was added to verify IO error are never observed when reopening. This is not expected to impact IO performance. Additional fixes which aren't critical but were discovered and resolved in the course of developing this functionality. * Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for ZFS volumes. This is as good as a unique physical path, while the volumes are not used in the test cases anymore for other reasons this improvement was included. Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#120 Issue openzfs#2437 Issue openzfs#5771 Issue openzfs#7366 Issue openzfs#7582
@shartse tests complete and the results are:
|
Great! |
`zvol_disk_open()` waits for up to `zfs_vdev_open_timeout_ms` (1 second by default) (e.g. if the block device does not exist). While in this loop, it calls `schedule_timeout()`. The problem is that `schedule_timeout()` may not actually cause the thread to go off-CPU. Per the "documentation" (comment in the source code): ``` * The function behavior depends on the current task state: * %TASK_RUNNING - the scheduler is called, but the task does not sleep * at all. That happens because sched_submit_work() does nothing for * tasks in %TASK_RUNNING state. ``` In my experience, `schedule_timeout()` never sleeps from this code path. This is especially noticeable if `zfs_vdev_open_timeout_ms` has been increased from its default. This commit uses `msleep()` to actually sleep. Note that this is how it was before openzfs#7629.
Description
While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place with a few modifications for Linux it can be supported.
Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid.
From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report thee maximum possible expansion size.
Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used.
Motivation and Context
Issue #120
Issue #2437
Issue #5771
Issue #7366
Issue #7582
How Has This Been Tested?
Full local run of ZTS on CentOS 7. Overnight run of the
zpool_expand
ZTS test group in a loop. No failures observed. Ready for additional buildbot testing for other distributions.Types of changes
Checklist:
Signed-off-by
.