Add support for autoexpand property #7629

behlendorf · 2018-06-13T17:47:01Z

Description

While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place with a few modifications for Linux it can be supported.

Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report thee maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used.

Motivation and Context

Issue #120
Issue #2437
Issue #5771
Issue #7366
Issue #7582

How Has This Been Tested?

Full local run of ZTS on CentOS 7. Overnight run of the zpool_expand ZTS test group in a loop. No failures observed. Ready for additional buildbot testing for other distributions.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

shartse · 2018-06-15T21:46:11Z

@behlendorf I downloaded this patch and have been trying to verify it locally without any success yet. The pos_1 test fails and running manually through the steps, I can't see any sign autoexpand. I'm now trying to figure out at which stage things are going wrong - should I expect to see a udevadm CHANGE event after running zfs set volsize on my zvol?

shartse · 2018-06-15T22:04:10Z

Never mind! I hadn't reloaded properly and the new version of zed wasn't running. Works for me now

behlendorf · 2018-06-15T23:13:02Z

Rebased on master to drop unrelated changes from the patch stack, no other changes included.

shartse · 2018-06-19T21:20:55Z

I believe that something for non-zvol vdevs has been broken in this change. When I resize a scsi device with vmware, rescan the scsi bus and and run zpool reopen, I see the new expandsize. However, if I run zpool online -e <pool> <dev>, the new space is not added to the pool nor removed from the expandsize.

behlendorf · 2018-06-19T21:43:19Z

@shartse it sounds like the kernel code may be getting the stale cached version of the partition table. Can you check if reinstating the vdev_disk_rrpart() functionality resolves the issue. It also sounds like it would be a good idea to add a version of the expand tests that use the scsi_debug device.

shartse · 2018-06-19T21:56:40Z

@behlendorf Sure - I'll test that now. And yeah, I agree that we should have some non-zvol tests. I don't know a lot about the scsi_debug devices - I noticed we could power them on and off in blkdev.shlib - is it possible to resize them?

behlendorf · 2018-06-19T22:03:26Z

@shartse we should be able too, they're basically a tiny ramdisk layered below the scsi driver. You can use them to setup tests which might otherwise be difficult to orchestrate, like injecting IO or data errors and making sure they're handled properly all the way up the stack. Or in our case resizing an actual scsi device without needing a real scsi device.

shartse · 2018-06-19T22:16:08Z

@behlendorf Ok, that's good to know. I can start working on a set of zpool_expand tests that use scsi_debug devices.

And adding back in the call to vdev_disk_rrpart() hasn't solved the issue I was seeing.

behlendorf · 2018-06-21T01:14:27Z

@shartse I was able to reproduce the issue you were seeing with scsi devices and have added a commit which resolves the issue in my testing on Ubuntu 18.04. I still need to verify it works as intended on older kernels, but it's there for your review.

behlendorf · 2018-06-28T21:22:59Z

Refreshed and ready for review. The updated PR addresses all known issues and includes @shartse's additional test case which uses a real (scsi_debug) device.

behlendorf · 2018-07-09T20:10:15Z

@rebased on master.

shartse

I'll keep looking and test this locally as well

shartse · 2018-07-09T17:45:30Z

module/zfs/vdev_disk.c

-		uint64_t available = sectors - used;
-		return (available << SECTOR_BITS);
+/*
+ * Returns the maximum expansion capacity of the block device,  When the


Can you specify the the units of the value returned?

shartse · 2018-07-09T17:50:58Z

module/zfs/vdev_disk.c

+/*
+ * Returns the maximum expansion capacity of the block device,  When the
+ * vdev has been created as a 'wholedisk' then expansion may be possible.
+ * Before any expansion is performed the partition layout is verified to


The rest of this comment makes it seem like the layout verification and resizing happens in this function. Maybe specify which functions those things happen in?

Good idea, I'll clarify this.

shartse · 2018-07-09T17:52:07Z

module/zfs/vdev_disk.c

+
+	if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) {
+		available = i_size_read(bdev->bd_contains->bd_inode) -
+		    ((EFI_MIN_RESV_SIZE + NEW_START_BLOCK +


Can you add a comment (here or at the top of the function) explaining why we're doing this calculation?

richardelling · 2018-07-09T23:26:40Z

module/zfs/vdev_disk.c

+	if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) {
+		available = i_size_read(bdev->bd_contains->bd_inode) -
+		    ((EFI_MIN_RESV_SIZE + NEW_START_BLOCK +
+		    PARTITION_END_ALIGNMENT) << SECTOR_BITS);


Are we sure this calculation matches the slice_size calculation in libzfs_pool.c zpool_label_disk()? The current code there aligns the slice size. This does not. The failure mode is that the pool can think it is bigger than slice 0 (aka part0)

My understanding when I worked on this was that EFI_MIN_RESV_SIZE + NEW_START_BLOCK + PARTITION_END_ALIGNMENT is a conservative estimate of the non-part0 space on the device.

In zpool_label_disk the size of part0 is effectively computed as P2ALIGN(total_size - (EFI_MIN_RESV_SIZE + NEW_START_BLOCK), PARTITION_END_ALIGNMENT). The size of the padding caused by the alignment can be anything from 0 to PARTITION_END_ALIGNMENT. By assuming we were padded with the full PARTITION_END_ALIGNMENT bytes, we actually report a size for part0 that could be slightly smaller than is really available. However, I don't think this is a big issue because this is just the size reported to the user as expandsize and once the pool is actually expanded, it will use as much space as fits evenly into metaslabs.

Except when it isn't. Observe the following failure:
[root@host4602 relling]# zpool create xp /dev/sddj

Message from syslogd@host4602 at Jul 5 11:49:36 ...
kernel:VERIFY3(io_offset + io_size <= bdev->bd_inode->i_size) failed (10000820355072 <= 10000747462656)

And the size of part0 is 10000747462656. I believe this is because we're picking the wrong size. Though I haven't compiled a new version to test. Meanwhile we workaround by creating the pool with part0 instead of whole disk.

Interesting - can you describe the circumstances of the failure a little more? Was /dev/sddj ever expanded? Is it still showing up with the latest version of this patch?

I'm not sure why this calculation would ever end up determining the size of the partition. My understanding is that efi_use_whole_disk is where the partition size is updated after an expand and that doesn't use what we computed for asize.

Quite simply, cannot create a pool on a 10TB disk using whole-disk method. Prior to the commit for expansion, we were able to create such a pool. It is unclear to me whether this is noticeable on a non-debug kernel.

@richardelling can you post the exact size of the device in bytes, I'd like to see if I can reproduce what your seeing.

@richardelling I believe I see what's going on, and this commit should in fact resolve the issue accidentally introduced by the first expansion commit. It was possible bdev_capacity() could slightly over report the partition size leading to the ASSERT your seeing in the wholedisk case. This is no longer possible since bdev_capacity() was split in to bdev_capacity() and bdev_max_capacity().

shartse

Look good!

behlendorf · 2018-07-13T00:05:56Z

@richardelling this updated version should address your concerns. If possible it would be great if you could verify the wholedisk case on the system where you ran in to problems.

richardelling · 2018-07-13T03:57:36Z

Thanks @shartse I agree with your analysis. @behlendorf I'll schedule time on the system to test and report back.

codecov · 2018-07-13T08:41:33Z

Codecov Report

Merging #7629 into master will increase coverage by 0.2%.
The diff coverage is 84.61%.

@@            Coverage Diff            @@
##           master    #7629     +/-   ##
=========================================
+ Coverage   78.08%   78.28%   +0.2%     
=========================================
  Files         368      368             
  Lines      111899   111952     +53     
=========================================
+ Hits        87378    87644    +266     
+ Misses      24521    24308    -213

Flag	Coverage Δ
#kernel	`78.78% <85.18%> (+0.05%)`	⬆️
#user	`67.36% <84.37%> (+0.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e5dc44...b49b09f. Read the comment docs.

While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place with a few modifications for Linux it can be supported. Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid. From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report thee maximum possible expansion size. Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used. In order to make all of the above possible the following changes were required: * Updated the zpool_expand_001_pos and zpool_expand_003_pos tests. These tests now create a pool which is layered on a loopback, scsi_debug, and file vdev. This allows for testing of non- partitioned block device (loopback), a partition block device (scsi_debug), and a file which does not receive udev change events. This provided for better test coverage, and by removing the layering on ZFS volumes there issues surrounding layering one pool on another are avoided. * zpool_find_vdev_by_physpath() updated to accept a vdev guid. This allows for matching by guid rather than path which is a more reliable way for the ZED to reference a vdev. * Fixed zfs_zevent_wait() signal handling which could result in the ZED spinning when a signal was not handled. * Removed vdev_disk_rrpart() functionality which can be abandoned in favor of kernel provided blkdev_reread_part() function. * Added a rwlock which is held as a writer while a disk is being reopened. This is important to prevent errors from occurring for any configuration related IOs which bypass the SCL_ZIO lock. The zpool_reopen_007_pos.ksh test case was added to verify IO error are never observed when reopening. This is not expected to impact IO performance. Additional fixes which aren't critical but were discovered and resolved in the course of developing this functionality. * Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for ZFS volumes. This is as good as a unique physical path, while the volumes are not used in the test cases anymore for other reasons this improvement was included. Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#120 Issue openzfs#2437 Issue openzfs#5771 Issue openzfs#7366 Issue openzfs#7582

richardelling · 2018-07-17T16:46:17Z

@shartse tests complete and the results are:

from master 2e5dc44 zpool create on 10T disk fails with:

kernel:VERIFY3(io_offset + io_size <= bdev->bd_inode->i_isze)  failed (10000820355072 <= 10000747462656)

with this latest diff, pool is created successfully (woohoo!) asize = 10000742744064 which is close enough

shartse · 2018-07-17T16:58:15Z

Great!

`zvol_disk_open()` waits for up to `zfs_vdev_open_timeout_ms` (1 second by default) (e.g. if the block device does not exist). While in this loop, it calls `schedule_timeout()`. The problem is that `schedule_timeout()` may not actually cause the thread to go off-CPU. Per the "documentation" (comment in the source code): ``` * The function behavior depends on the current task state: * %TASK_RUNNING - the scheduler is called, but the task does not sleep * at all. That happens because sched_submit_work() does nothing for * tasks in %TASK_RUNNING state. ``` In my experience, `schedule_timeout()` never sleeps from this code path. This is especially noticeable if `zfs_vdev_open_timeout_ms` has been increased from its default. This commit uses `msleep()` to actually sleep. Note that this is how it was before openzfs#7629.

behlendorf requested review from shartse, tonyhutter and don-brady June 13, 2018 17:47

behlendorf mentioned this pull request Jun 13, 2018

Linux compat 4.18: check_disk_size_change() #7611

Merged

13 tasks

behlendorf added the Status: Work in Progress Not yet ready for general review label Jun 13, 2018

behlendorf force-pushed the auto-expand branch 2 times, most recently from 37581be to 5cca586 Compare June 14, 2018 23:26

behlendorf removed the Status: Work in Progress Not yet ready for general review label Jun 14, 2018

behlendorf force-pushed the auto-expand branch from 5cca586 to f6ad12e Compare June 15, 2018 23:12

behlendorf mentioned this pull request Jun 19, 2018

zed and udev thrashing with repeated online events #7366

Closed

behlendorf force-pushed the auto-expand branch from f6ad12e to 4fb5d63 Compare June 20, 2018 21:05

behlendorf force-pushed the auto-expand branch 3 times, most recently from 79a648a to 03f5015 Compare June 28, 2018 00:32

behlendorf force-pushed the auto-expand branch from 03f5015 to d61dbb1 Compare June 28, 2018 17:41

behlendorf force-pushed the auto-expand branch from d61dbb1 to fd0a16d Compare June 29, 2018 16:45

behlendorf force-pushed the auto-expand branch from fd0a16d to 2bd1f52 Compare July 9, 2018 20:09

shartse reviewed Jul 9, 2018

View reviewed changes

richardelling reviewed Jul 9, 2018

View reviewed changes

behlendorf force-pushed the auto-expand branch from 2bd1f52 to 21f7c97 Compare July 12, 2018 23:56

shartse approved these changes Jul 12, 2018

View reviewed changes

behlendorf force-pushed the auto-expand branch from 21f7c97 to b49b09f Compare July 13, 2018 20:20

richardelling mentioned this pull request Jul 17, 2018

Lots of 'attempt to access beyond end of device' in syslog #7724

Closed

behlendorf merged commit d441e85 into openzfs:master Jul 23, 2018

behlendorf mentioned this pull request Mar 25, 2019

cannot expand pool after disk resizing #8529

Closed

behlendorf deleted the auto-expand branch April 19, 2021 19:29

ahrens mentioned this pull request Dec 10, 2023

zvol_disk_open() may spin on CPU #15658

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for autoexpand property #7629

Add support for autoexpand property #7629

behlendorf commented Jun 13, 2018 •

edited

Loading

shartse commented Jun 15, 2018

shartse commented Jun 15, 2018

behlendorf commented Jun 15, 2018

shartse commented Jun 19, 2018

behlendorf commented Jun 19, 2018

shartse commented Jun 19, 2018

behlendorf commented Jun 19, 2018 •

edited

Loading

shartse commented Jun 19, 2018

behlendorf commented Jun 21, 2018

behlendorf commented Jun 28, 2018

behlendorf commented Jul 9, 2018

shartse left a comment

shartse Jul 9, 2018

behlendorf Jul 12, 2018

shartse Jul 9, 2018

behlendorf Jul 12, 2018

shartse Jul 9, 2018

behlendorf Jul 12, 2018

richardelling Jul 9, 2018

shartse Jul 10, 2018

richardelling Jul 10, 2018

shartse Jul 10, 2018

richardelling Jul 12, 2018

behlendorf Jul 12, 2018

behlendorf Jul 12, 2018

shartse left a comment

behlendorf commented Jul 13, 2018

richardelling commented Jul 13, 2018

codecov bot commented Jul 13, 2018 •

edited

Loading

richardelling commented Jul 17, 2018

shartse commented Jul 17, 2018

Add support for autoexpand property #7629

Add support for autoexpand property #7629

Conversation

behlendorf commented Jun 13, 2018 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

shartse commented Jun 15, 2018

shartse commented Jun 15, 2018

behlendorf commented Jun 15, 2018

shartse commented Jun 19, 2018

behlendorf commented Jun 19, 2018

shartse commented Jun 19, 2018

behlendorf commented Jun 19, 2018 • edited Loading

shartse commented Jun 19, 2018

behlendorf commented Jun 21, 2018

behlendorf commented Jun 28, 2018

behlendorf commented Jul 9, 2018

shartse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shartse left a comment

Choose a reason for hiding this comment

behlendorf commented Jul 13, 2018

richardelling commented Jul 13, 2018

codecov bot commented Jul 13, 2018 • edited Loading

Codecov Report

richardelling commented Jul 17, 2018

shartse commented Jul 17, 2018

behlendorf commented Jun 13, 2018 •

edited

Loading

behlendorf commented Jun 19, 2018 •

edited

Loading

codecov bot commented Jul 13, 2018 •

edited

Loading