-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add pool prop partition
to disable auto partition
#6277
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to optionally provide this functionality. I know I've wanted it for a long time.
In addition to legacy and raw as values for the partition property it would be useful if it took a size for the primary partition. For example, setting partition=100G would result in 100G partitions for ZFS leaving the rest of the device available. One use case for this would be wanting to manually over-provision an SSD.
We should also consider removing the code which creates the unused -part9 which is confusing. Although we want to avoid actually including that small amount of space in -part1.
Since part of the motivation for this change is getting autoexpand working those test cases should be updated and enabled as part of this work.
cmd/zpool/zpool_main.c
Outdated
/* Make new pools default to partition=raw */ | ||
if (add_prop_list_default(zpool_prop_to_name(ZPOOL_PROP_PARTITION), | ||
"raw", &props, B_TRUE)) | ||
goto errout; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we leave the default behavior as "legacy" at least initially to minimize any disruption. The existing partitioning scheme does provide a few benefits which users probably don't realize they're depending on.
- The primary partition (-part1) created for ZFS is automatically aligned on a 1MB boundary. This alignment can be critical for performance when creating a pool on some hardware RAID products.
- Leaving a small bit of space at the end of the device (-part9) helps avoid problems when mixing any matching drives from different vendors which can differ very slightly in total capacity.
- Potentially other issues related to autoreplace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have a few questions, would using raw device cause it to not align to 1MB?
And the leaving space at the end is only done current on wholedisk=1 case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have a few questions, would using raw device cause it to not align to 1MB?
Unless ZFS creates -part1 which is explicitly aligned at 1MB things will just be aligned at the start of the device. That's usually fine for standard hard drives but it can lead to performance problems with other kinds block devices which may include a small header which can mess up the alignment. This has lead to issues with Linux multipath devices or harware/software RAID device which have an internal RAID stripe size. Starting at a 1MB offset in to the device helps avoid these kind of issues.
And the leaving space at the end is only done current on wholedisk=1 case?
Right. The wholedisk=0
case is effectively the same as the raw case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand how it works. If RAID or whatever mess up the alignment, why would starting at 1MB offset fix it? Wouldn't it still aligns to the same place, just 1MB apart?
cmd/zpool/zpool_vdev.c
Outdated
update_vdev_config_dev_strs(nv); | ||
|
||
(void) zero_label(path); | ||
if (!is_spare(NULL, path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why special case spares here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero_label will fail to open the device because it's already open exclusively. It's already done for the legacy wholedisk case below, so it should be done here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually independent of this patch. Currently if you have a wholedisk=0 spare, and you do a replace with that spare, it will say device busy, but continue to do the replace, which is really confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why don't we pull this small fix in to it's own PR.
Is this really a good idea? I thought ZFS wouldn't perform well if it's close to full, and forcing it to use a smaller partition would mean it would get close to full easily. |
100G was just an example, and I don't think this would be commonly used by most people. But there are definitely cases where it would be useful. Manually over-provisioned SSD are a good example of this. We've deployed ZFS pools on SSDs where we're more concerned about both the lifespan and performance of the SSD than its total capacity. In this case, by creating a partition which only uses half of the total capacity. Usually the SSD can then do much better job wear leveling and garbage collecting since it has a lot of erased flash. Historically another use case for this king of thing was short-stroking traditional hard drives. Although that's less common these days. |
There is a yet-to-be-ported OpenZFS change somewhat related to this which also needs to be considered. OpenZFS 7446 zpool create should support efi system partition |
Update: Rebase to master, change pool create default to legacy, make zed follow partition property. I'll leave the partition=100G stuff out for anyone who's interested in it. |
Likely not relevant for proper server installs, moreso workstations, servers with a GUI installed, or desktop users but we finally got the udisks2 udev.d rules into a pull request upstream which have the solaris/bsd/ZoL PartID's and such (so device mapper and udisks2 doesn't attempt to do anything weird with zfs_member stuff or grab locks on them, preventing re-scanning of changed zfs layouts.). Edit: relevant upstream reports relating to device mapper/dmsetup & udisks2: so this makes me wonder how some installs will handle scanning of raw devices if a partition-less scheme is used. (Edit: I think dm will still attempt to scan it) afaik even btrfs uses partitions which is kinda of the linux way of things. I do fully support this, however, if it works as intended but identifying ZFS disks could become problematic as well as proper alignment maybe? |
zfsonlinux always partition disk when it detects the device given is a whole disk. This legacy behavior from Illumos, however, has no apparent benefit on Linux, but has some down sides besides confusion. E.g. autoexpand, switching to dm device requires partprobe. We add a pool property `partition` to be set during pool create. It currently has two values, legacy and raw. When setting it to legacy, it will behave as it did. When setiing it to raw, it will always use the device as is without partitioning even if it's a whole disk. This property applies to all commands that add disks to pool, so zpool add/attach/replace will partition or not partition based on the property on the target pool. A pool without this property will be treated as legacy. Newly created pool will by default have partition=legacy. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
@Trucido Edit: Also, blkid will detect zfs_member regardless if it's on a partition or not. So if you want a udev rule to block all zfs_member, just use zfs_member as the key. |
Setting partition size(or maybe some kind of reservation) may be also useful to be sure, that replacement drives will fit, because after few years drive can be discontinued by manufacturer, or can be unavailable in stores and different model/manufacturer can be few hundreds of MBs smaller, causing zfs to refuse resilvering. |
I like that idea too. |
Is there has any update for this request? I do need this change actually. Although it wouldn't affect the performance or other things, but what I met is I'm using zfs on an EBS device, after I resized the EBS volume size, that partition 9 just at middle of disk, and blocked zpool auto expand function, I have to delete that partition 9 manually and reboot the the EC2 instance, because the partprobe can't recognize size change somehow. That's really annoying. Please fix this change and merge it ASAP |
@tuxoko The wiki is perhaps misworded about the purpose of the udev rule. The last bit is the important part: This simply appends an extra ENV to udev which instructs udisks2 (now storaged?) to ignore partitions which match that rule. For example if you have a GUI workstation with udisks2 running and open a file manager you won't see a ton of spam with EVERY single partition of every zfs disk and non-functional mount/unmount/eject stuff. This also had an unfortunate sideaffect of triggering device mapper to attempt to map stuff it shouldn't have, or udisks would grab some sort of lock preventing updating of pools when destroying/creating and adding stuff. I forget now since it's been so long but I ran into this issue on a Debian testbed (with minimal GUI) awhile ago. We're not explicitly telling device mapper to ignore zfs or zvols here, just telling udisks2 to mind it's own business (which really needs to be merged upstream if it hasn't already, they've already got a long list of raid related ignore rules). I created the rule based on some rules in There's more fine grained rules (already upstream) to handle mapping zvols if that was your concern ( Edit: speaking of udev rules, this is another I use to fix some grub/dracut derps. I forget where it came from or what it's purpose even was anymore since the other rules should in theory cover this stuff. Maybe it was to fix the broken systemd import service before the fix for import-cache/import-scan |
@tuxoko in the interests of keeping the OpenZFS implementations in sync as much as possible, what do you think about adopting the upstream change #94 (comment) to resolve this. |
@kpande no, you'd need to manually set the bootfs property. |
I noticed that this PR hasn’t been updated in some time. We would like to see this feature added (or resolve it as mentioned above) , but it seems that it isn’t quite complete, so we’re going to close it for now. When you have time to revisit this work, feel free to reopen this PR or open a new one. Thanks for your contribution to OpenZFS and please continue to open PR’s to address issues that you encounter! |
Please people, someone, get this done! |
Description
zfsonlinux always partition disk when it detects the device given is a
whole disk. This legacy behavior from Illumos, however, has no apparent
benefit on Linux, but has some down sides besides confusion. E.g.
autoexpand, switching to dm device requires partprobe.
We add a pool property
partition
to be set during pool create. Itcurrently has two values, legacy and raw. When setting it to legacy, it
will behave as it did. When setiing it to raw, it will always use the
device as is without partitioning even if it's a whole disk.
This property applies to all commands that add disks to pool, so zpool
add/attach/replace will partition or not partition based on the property
on the target pool.
A pool without this property will be treated as legacy. Newly created
pool will by default have partition=legacy.
Signed-off-by: Chunwei Chen david.chen@osnexus.com
Note
I use PROP_ONETIME for the property, but it seems that this is not enforced at all, so you can still modify it after the fact. But you shouldn't change it after the fact, as it would cause device name appending wrong.