vdev expansion doesn't work #808

dechamps · 2012-07-06T09:11:19Z

Using latest SPL/ZFS from master on Debian squeeze (6.0.5) with linux 3.2.22, neither autoexpand=on, expand=on nor zpool online -e work:

# cat /sys/block/sdb/size 
10485760
# zpool create -f -o autoexpand=on homez sdb
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G   108K  4.97G     0%  1.00x  ONLINE  -

(...expanded sdb from 5G to 10G...)

# echo 1 > /sys/block/sdb/device/rescan
# cat /sys/block/sdb/size 
20971520
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G  91.5K  4.97G     0%  1.00x  ONLINE  -
# zpool online -e homez sdb
: cannot relabel 'sdb1': unable to open device: 2
# cd /dev && zpool online -e homez sdb
: cannot relabel 'sdb1': unable to read disk capacity
root@linux-zfs-test:~# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G   128K  4.97G     0%  1.00x  ONLINE  -
# zpool export homez
# zpool import -o expand=on homez
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G   142K  4.97G     0%  1.00x  ONLINE  -

Note the serious WTF with zpool online -e behaving differently depending on whether I am running it from /dev or not.

A comment from 10 months ago seems to indicate that it worked at the time for at least one person, so I guess this is a regression?

The text was updated successfully, but these errors were encountered:

dechamps · 2012-07-06T13:31:24Z

Found it. There are blatant mistakes in zpool_vdev_online and zpool_relabel_disk for this code path. I'm fixing them, expect a pull request soon.

dechamps · 2012-07-06T14:19:47Z

Work in progress: https://github.com/dechamps/zfs/compare/expand

Even with these fixes, I'm still getting errors:

# zpool online -e homez sdb
the kernel failed to rescan the partition table: 16
cannot expand sdb: cannot relabel '/dev/sdb': unable to read disk capacity

This means the BLKRRPART ioctl called by efi_write() returns EBUSY, which is probably caused by the fact that the disk device is opened.

Also, zpool import -o expand=on doesn't seem to use this code path (in fact, I suspect it doesn't even try to touch the partition table at all) which means there are probably other issues to deal with.

pyavdr · 2012-07-08T08:26:21Z

I looked for another EBUSY problem (read through issue #250 and issue #440 and also issue #4) some months ago, but with no success too. You may try this with files as vdevs, it seems like it does work then, at least for the zpool replace vdev with spare command. It may be related. It will be great, if there will be a solution for both cases.

behlendorf · 2012-07-10T17:32:53Z

This might be caused by the fact that ZFS opens all of the block devices O_EXCL to prevent accidental concurrent access. Several other open issues as mentioned by pyavdr may be caused by the same root cause. A good solution for this needs to be found.

As for autoexpand not working that's because we don't have a zeventd in place yet to handle the expand event and invoke zpool online -e from user space.

Still your initial fixes looks like a good start!

dechamps · 2012-07-10T17:41:14Z

This might be caused by the fact that ZFS opens all of the block devices O_EXCL to prevent accidental concurrent access.

It's worse than that, actually: no matter if EXCL is used or not, as long as one of the disk's partitions is open, BLKRRPART will return EBUSY.

I just wrote a working fix which moves the BLKRRPART ioctl from userspace to the module in vdev_disk_open(). This way, we can call the ioctl without the partition open. The code is non-trivial since it uses kernel APIs to get to the whole disk device from the partition device, but it seems to work. I'll clean and publish the fix soon.

It seems I'm hitting yet another issue though: the algorithm in efi_use_whole_disk() doesn't seem to work properly - it doesn't expand anything, the main partition is still the same size. I'm investigating.

dechamps · 2012-07-11T10:23:34Z

Mmm... regarding zpool -o expand=on, I'm beginning to wonder if expand actually exists as a pool import option. I can't seem to find the code for it, there's nothing in the man page, and then there's this:

# zpool import homez -o expand=on -o autoexpand=on
property 'autoexpand' specified multiple times

I really have no idea why it thinks autoexpand is the same thing as expand, and what code triggers this behavior. Anyway, grepping the whole code base doesn't yield anything, so I'm inclined to think expand is not, and never was, a pool import option. If someone has a manpage or some code which suggests the contrary, I'd be glad to hear it.

In the mean time, I'm making good progress on zpool online -e.

behlendorf · 2012-07-11T16:51:08Z

It looks as is zprop_name_to_prop() will use the column name as an alias for the property name when called in user space by the zpool command. See propname_match(). That's likely why both expand and autoexpand work. However, it's just and alias and not the official property name so I wouldn't suggest people use it.

As for getting zpool online -e working that's great. I'm looking forward to seeing the patches. Maybe I can twist you arm to work on fixing zpool replace next, issue #250 and #440, now that your familiar with this code. :)

dechamps · 2012-07-11T16:55:03Z

Here you go: #822

Maybe I can twist you arm to work on fixing zpool replace next, issue #250 and #440, now that your familiar with this code. :)

Sure, if I have some time, I'll look into that.

The error handling code around zpool_relabel_disk() is either inexistent or wrong. The function call itself is not checked, and zpool_relabel_disk() is generating error messages from an unitialized buffer. Before: # zpool online -e homez sdb; echo $? `: cannot relabel 'sdb1': unable to open device: 2 0 After: # zpool online -e homez sdb; echo $? cannot expand sdb: cannot relabel 'sdb1': unable to open device: 2 1 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#808

Currently, zpool_vdev_online() calls zpool_relabel_disk() with a short partition device name, which is obviously wrong because (1) zpool_relabel_disk() expects a full, absolute path to use with open() and (2) efi_write() must be called on an opened disk device, not a partition device. With this patch, zpool_relabel_disk() gets called with a full disk device path. The path is determined using the same algorithm as zpool_find_vdev(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#808

Commit e5dc681 changed EFI_NUMPAR from 9 to 128. This means that the on-disk EFI label has efi_nparts = 128 instead of 9. The index of the reserved partition, however, is still 8. This breaks efi_use_whole_disk(), which uses efi_nparts-1 as the index of the reserved partition. This commit fixes efi_use_whole_disk() when the index of the reserved partition is not efi_nparts-1. It rewrites the algorithm and makes it more robust by using the order of the partitions instead of their numbering. It assumes that the last non-empty partition is the reserved partition, and that the non-empty partition before that is the data partition. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#808

When the zettacache does an index merge, it also adds the new and changed entries to the index entry cache. In certain configurations, manipulating the index entry cache can be very time consuming and also have a big impact on the performance of concurrent zettacache activities. This is especially noticeable when the zettacache doesn't have a lot of data in it, e.g. when initially filling up. Additionally, the current data structure used for the index entry cache is not very efficient with memory; a lot of memory is used by its internal overheads. This commit changes the data structure used by the index entry cache to be a sharded 16-way associative roughly-LRU cache. Each entry can be stored in any of 16 "slots", which are searched when doing a lookup. When inserting and all 16 slots are full, the slot whose IndexValue has the oldest Atime is evicted. Each shard of the index is locked separately, allowing concurrent access to the overall entry cache. This improves performance in several ways: The index entry cache can be updated concurrently with lookups, so zettacache lookup/insert performance is not impacted as much by merging. On a workload of random reads causing inserts to the zettacache via sibling block ingestion, without this commit a merge causes insertion performance to drop to ~45% (420,000 -> 190,000 inserts/sec). The time to update the index entry cache is reduced, so the overall time to do a merge is reduced. The time to perform a merge when the index size is small, is reduced to 20% (5x improvement, 93 -> 19 seconds). The number of entries that can be cached in the given RAM budget is roughly trippled. The new memory usage per entry is 37% of previous (65 -> 24 bytes per entry; the IndexEntry size is 23 bytes).

…ndex run (openzfs#845) The zettacache index cache is updated as part of merging the PendingChanges into the on-disk index. The merge task sends the updates to the checkpoint task, as part of a `MergeProgress` message. The index cache updates are then made from a spawned blocking (CPU-bound) task. The updates are completed (waited for) before the next checkpoint completes. During the merge, it's expected that lookups can see IndexEntry's from the old index, either from reading the old index itself, or from the index entry cache. These stale entries are "corrected" by either `PendingChanges::update()`'s call to `Remap::remap()`, or `MergeState::entry_disposition()`'s check of `PendingChanges::freeing()`. When the `MergeMessage::Complete` is received it calls `Locked::rotate_index()` which deletes the old on-disk index, and calls `PendingChanges::set_remap(None)` and `Locked::merge.take()`. This ends the stale entry "corrections" mentioned above, which are no longer necessary because we can no longer see stale entries from the old on-disk index. The problem occurs when the `MergeMessage::Complete` is received and processed before the spawned blocking task completes. In this case, we end the stale entry "corrections", but we can still see stale entries from the index cache. This PR addresses the problem by waiting for the index cache updates to complete before processing the `MergeMessage::Complete`. The problem was introduced by openzfs#808.

dechamps mentioned this issue Jul 11, 2012

Fix dynamic vdev expansion (zpool online -e) #822

Closed

dechamps mentioned this issue Jul 15, 2012

Command "zpool online -e" doesn't work #788

Closed

behlendorf closed this as completed in b5a2880 Jul 17, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdev expansion doesn't work #808

vdev expansion doesn't work #808

dechamps commented Jul 6, 2012

dechamps commented Jul 6, 2012

dechamps commented Jul 6, 2012

pyavdr commented Jul 8, 2012

behlendorf commented Jul 10, 2012

dechamps commented Jul 10, 2012

dechamps commented Jul 11, 2012

behlendorf commented Jul 11, 2012

dechamps commented Jul 11, 2012

vdev expansion doesn't work #808

vdev expansion doesn't work #808

Comments

dechamps commented Jul 6, 2012

dechamps commented Jul 6, 2012

dechamps commented Jul 6, 2012

pyavdr commented Jul 8, 2012

behlendorf commented Jul 10, 2012

dechamps commented Jul 10, 2012

dechamps commented Jul 11, 2012

behlendorf commented Jul 11, 2012

dechamps commented Jul 11, 2012