Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev expansion doesn't work #808

Closed
dechamps opened this issue Jul 6, 2012 · 8 comments
Closed

vdev expansion doesn't work #808

dechamps opened this issue Jul 6, 2012 · 8 comments
Milestone

Comments

@dechamps
Copy link
Contributor

dechamps commented Jul 6, 2012

Using latest SPL/ZFS from master on Debian squeeze (6.0.5) with linux 3.2.22, neither autoexpand=on, expand=on nor zpool online -e work:

# cat /sys/block/sdb/size 
10485760
# zpool create -f -o autoexpand=on homez sdb
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G   108K  4.97G     0%  1.00x  ONLINE  -

(...expanded sdb from 5G to 10G...)

# echo 1 > /sys/block/sdb/device/rescan
# cat /sys/block/sdb/size 
20971520
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G  91.5K  4.97G     0%  1.00x  ONLINE  -
# zpool online -e homez sdb
: cannot relabel 'sdb1': unable to open device: 2
# cd /dev && zpool online -e homez sdb
: cannot relabel 'sdb1': unable to read disk capacity
root@linux-zfs-test:~# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G   128K  4.97G     0%  1.00x  ONLINE  -
# zpool export homez
# zpool import -o expand=on homez
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
homez  4.97G   142K  4.97G     0%  1.00x  ONLINE  -

Note the serious WTF with zpool online -e behaving differently depending on whether I am running it from /dev or not.

A comment from 10 months ago seems to indicate that it worked at the time for at least one person, so I guess this is a regression?

@dechamps
Copy link
Contributor Author

dechamps commented Jul 6, 2012

Found it. There are blatant mistakes in zpool_vdev_online and zpool_relabel_disk for this code path. I'm fixing them, expect a pull request soon.

@dechamps
Copy link
Contributor Author

dechamps commented Jul 6, 2012

Work in progress: https://github.com/dechamps/zfs/compare/expand

Even with these fixes, I'm still getting errors:

# zpool online -e homez sdb
the kernel failed to rescan the partition table: 16
cannot expand sdb: cannot relabel '/dev/sdb': unable to read disk capacity

This means the BLKRRPART ioctl called by efi_write() returns EBUSY, which is probably caused by the fact that the disk device is opened.

Also, zpool import -o expand=on doesn't seem to use this code path (in fact, I suspect it doesn't even try to touch the partition table at all) which means there are probably other issues to deal with.

@pyavdr
Copy link
Contributor

pyavdr commented Jul 8, 2012

I looked for another EBUSY problem (read through issue #250 and issue #440 and also issue #4) some months ago, but with no success too. You may try this with files as vdevs, it seems like it does work then, at least for the zpool replace vdev with spare command. It may be related. It will be great, if there will be a solution for both cases.

@behlendorf
Copy link
Contributor

This might be caused by the fact that ZFS opens all of the block devices O_EXCL to prevent accidental concurrent access. Several other open issues as mentioned by pyavdr may be caused by the same root cause. A good solution for this needs to be found.

As for autoexpand not working that's because we don't have a zeventd in place yet to handle the expand event and invoke zpool online -e from user space.

Still your initial fixes looks like a good start!

@dechamps
Copy link
Contributor Author

This might be caused by the fact that ZFS opens all of the block devices O_EXCL to prevent accidental concurrent access.

It's worse than that, actually: no matter if EXCL is used or not, as long as one of the disk's partitions is open, BLKRRPART will return EBUSY.

I just wrote a working fix which moves the BLKRRPART ioctl from userspace to the module in vdev_disk_open(). This way, we can call the ioctl without the partition open. The code is non-trivial since it uses kernel APIs to get to the whole disk device from the partition device, but it seems to work. I'll clean and publish the fix soon.

It seems I'm hitting yet another issue though: the algorithm in efi_use_whole_disk() doesn't seem to work properly - it doesn't expand anything, the main partition is still the same size. I'm investigating.

@dechamps
Copy link
Contributor Author

Mmm... regarding zpool -o expand=on, I'm beginning to wonder if expand actually exists as a pool import option. I can't seem to find the code for it, there's nothing in the man page, and then there's this:

# zpool import homez -o expand=on -o autoexpand=on
property 'autoexpand' specified multiple times

I really have no idea why it thinks autoexpand is the same thing as expand, and what code triggers this behavior. Anyway, grepping the whole code base doesn't yield anything, so I'm inclined to think expand is not, and never was, a pool import option. If someone has a manpage or some code which suggests the contrary, I'd be glad to hear it.

In the mean time, I'm making good progress on zpool online -e.

@behlendorf
Copy link
Contributor

It looks as is zprop_name_to_prop() will use the column name as an alias for the property name when called in user space by the zpool command. See propname_match(). That's likely why both expand and autoexpand work. However, it's just and alias and not the official property name so I wouldn't suggest people use it.

As for getting zpool online -e working that's great. I'm looking forward to seeing the patches. Maybe I can twist you arm to work on fixing zpool replace next, issue #250 and #440, now that your familiar with this code. :)

@dechamps
Copy link
Contributor Author

Here you go: #822

Maybe I can twist you arm to work on fixing zpool replace next, issue #250 and #440, now that your familiar with this code. :)

Sure, if I have some time, I'll look into that.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jul 12, 2012
The error handling code around zpool_relabel_disk() is either inexistent
or wrong. The function call itself is not checked, and
zpool_relabel_disk() is generating error messages from an unitialized
buffer.

Before:

    # zpool online -e homez sdb; echo $?
    `: cannot relabel 'sdb1': unable to open device: 2
    0

After:

    # zpool online -e homez sdb; echo $?
    cannot expand sdb: cannot relabel 'sdb1': unable to open device: 2
    1

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#808
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jul 12, 2012
Currently, zpool_vdev_online() calls zpool_relabel_disk() with a short
partition device name, which is obviously wrong because (1)
zpool_relabel_disk() expects a full, absolute path to use with open()
and (2) efi_write() must be called on an opened disk device, not a
partition device.

With this patch, zpool_relabel_disk() gets called with a full disk
device path. The path is determined using the same algorithm as
zpool_find_vdev().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#808
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jul 12, 2012
Commit e5dc681 changed EFI_NUMPAR from 9 to 128. This means that the
on-disk EFI label has efi_nparts = 128 instead of 9. The index of the
reserved partition, however, is still 8. This breaks
efi_use_whole_disk(), which uses efi_nparts-1 as the index of the
reserved partition.

This commit fixes efi_use_whole_disk() when the index of the reserved
partition is not efi_nparts-1. It rewrites the algorithm and makes it
more robust by using the order of the partitions instead of their
numbering. It assumes that the last non-empty partition is the reserved
partition, and that the non-empty partition before that is the data
partition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#808
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
When the zettacache does an index merge, it also adds the new and
changed entries to the index entry cache.  In certain configurations,
manipulating the index entry cache can be very time consuming and also
have a big impact on the performance of concurrent zettacache
activities.  This is especially noticeable when the zettacache doesn't
have a lot of data in it, e.g. when initially filling up.

Additionally, the current data structure used for the index entry cache
is not very efficient with memory; a lot of memory is used by its
internal overheads.

This commit changes the data structure used by the index entry cache to
be a sharded 16-way associative roughly-LRU cache.  Each entry can be
stored in any of 16 "slots", which are searched when doing a lookup.
When inserting and all 16 slots are full, the slot whose IndexValue has
the oldest Atime is evicted.  Each shard of the index is locked
separately, allowing concurrent access to the overall entry cache.

This improves performance in several ways:

The index entry cache can be updated concurrently with lookups, so
zettacache lookup/insert performance is not impacted as much by merging.
On a workload of random reads causing inserts to the zettacache via
sibling block ingestion, without this commit a merge causes insertion
performance to drop to ~45% (420,000 -> 190,000 inserts/sec).

The time to update the index entry cache is reduced, so the overall time
to do a merge is reduced. The time to perform a merge when the index
size is small, is reduced to 20% (5x improvement, 93 -> 19 seconds).

The number of entries that can be cached in the given RAM budget is
roughly trippled.  The new memory usage per entry is 37% of previous
(65 -> 24 bytes per entry; the IndexEntry size is 23 bytes).
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
…ndex run (openzfs#845)

The zettacache index cache is updated as part of merging the
PendingChanges into the on-disk index.  The merge task sends the updates
to the checkpoint task, as part of a `MergeProgress` message.  The index
cache updates are then made from a spawned blocking (CPU-bound) task.
The updates are completed (waited for) before the next checkpoint
completes.

During the merge, it's expected that lookups can see IndexEntry's from
the old index, either from reading the old index itself, or from the
index entry cache.  These stale entries are "corrected" by either
`PendingChanges::update()`'s call to `Remap::remap()`, or
`MergeState::entry_disposition()`'s check of
`PendingChanges::freeing()`.

When the `MergeMessage::Complete` is received it calls
`Locked::rotate_index()` which deletes the old on-disk index, and calls
`PendingChanges::set_remap(None)` and `Locked::merge.take()`.  This ends
the stale entry "corrections" mentioned above, which are no longer
necessary because we can no longer see stale entries from the old
on-disk index.

The problem occurs when the `MergeMessage::Complete` is received and
processed before the spawned blocking task completes.  In this case, we
end the stale entry "corrections", but we can still see stale entries
from the index cache.

This PR addresses the problem by waiting for the index cache updates to
complete before processing the `MergeMessage::Complete`.

The problem was introduced by openzfs#808.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants