Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add pool prop partition to disable auto partition #6277

Closed
wants to merge 2 commits into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Jun 27, 2017

Description

zfsonlinux always partition disk when it detects the device given is a
whole disk. This legacy behavior from Illumos, however, has no apparent
benefit on Linux, but has some down sides besides confusion. E.g.
autoexpand, switching to dm device requires partprobe.

We add a pool property partition to be set during pool create. It
currently has two values, legacy and raw. When setting it to legacy, it
will behave as it did. When setiing it to raw, it will always use the
device as is without partitioning even if it's a whole disk.

This property applies to all commands that add disks to pool, so zpool
add/attach/replace will partition or not partition based on the property
on the target pool.

A pool without this property will be treated as legacy. Newly created
pool will by default have partition=legacy.

Signed-off-by: Chunwei Chen david.chen@osnexus.com

Note

I use PROP_ONETIME for the property, but it seems that this is not enforced at all, so you can still modify it after the fact. But you shouldn't change it after the fact, as it would cause device name appending wrong.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 27, 2017

Related to #3452 #3458

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to optionally provide this functionality. I know I've wanted it for a long time.

In addition to legacy and raw as values for the partition property it would be useful if it took a size for the primary partition. For example, setting partition=100G would result in 100G partitions for ZFS leaving the rest of the device available. One use case for this would be wanting to manually over-provision an SSD.

We should also consider removing the code which creates the unused -part9 which is confusing. Although we want to avoid actually including that small amount of space in -part1.

Since part of the motivation for this change is getting autoexpand working those test cases should be updated and enabled as part of this work.

/* Make new pools default to partition=raw */
if (add_prop_list_default(zpool_prop_to_name(ZPOOL_PROP_PARTITION),
"raw", &props, B_TRUE))
goto errout;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we leave the default behavior as "legacy" at least initially to minimize any disruption. The existing partitioning scheme does provide a few benefits which users probably don't realize they're depending on.

  • The primary partition (-part1) created for ZFS is automatically aligned on a 1MB boundary. This alignment can be critical for performance when creating a pool on some hardware RAID products.
  • Leaving a small bit of space at the end of the device (-part9) helps avoid problems when mixing any matching drives from different vendors which can differ very slightly in total capacity.
  • Potentially other issues related to autoreplace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a few questions, would using raw device cause it to not align to 1MB?
And the leaving space at the end is only done current on wholedisk=1 case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a few questions, would using raw device cause it to not align to 1MB?

Unless ZFS creates -part1 which is explicitly aligned at 1MB things will just be aligned at the start of the device. That's usually fine for standard hard drives but it can lead to performance problems with other kinds block devices which may include a small header which can mess up the alignment. This has lead to issues with Linux multipath devices or harware/software RAID device which have an internal RAID stripe size. Starting at a 1MB offset in to the device helps avoid these kind of issues.

And the leaving space at the end is only done current on wholedisk=1 case?

Right. The wholedisk=0 case is effectively the same as the raw case.

Copy link
Contributor Author

@tuxoko tuxoko Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand how it works. If RAID or whatever mess up the alignment, why would starting at 1MB offset fix it? Wouldn't it still aligns to the same place, just 1MB apart?

update_vdev_config_dev_strs(nv);

(void) zero_label(path);
if (!is_spare(NULL, path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why special case spares here?

Copy link
Contributor Author

@tuxoko tuxoko Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero_label will fail to open the device because it's already open exclusively. It's already done for the legacy wholedisk case below, so it should be done here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually independent of this patch. Currently if you have a wholedisk=0 spare, and you do a replace with that spare, it will say device busy, but continue to do the replace, which is really confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why don't we pull this small fix in to it's own PR.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 29, 2017

In addition to legacy and raw as values for the partition property it would be useful if it took a size for the primary partition. For example, setting partition=100G would result in 100G partitions for ZFS leaving the rest of the device available. One use case for this would be wanting to manually over-provision and SSD.

Is this really a good idea? I thought ZFS wouldn't perform well if it's close to full, and forcing it to use a smaller partition would mean it would get close to full easily.

@behlendorf
Copy link
Contributor

behlendorf commented Jun 29, 2017

Is this really a good idea? I thought ZFS wouldn't perform well if it's close to full,

100G was just an example, and I don't think this would be commonly used by most people. But there are definitely cases where it would be useful.

Manually over-provisioned SSD are a good example of this. We've deployed ZFS pools on SSDs where we're more concerned about both the lifespan and performance of the SSD than its total capacity. In this case, by creating a partition which only uses half of the total capacity. Usually the SSD can then do much better job wear leveling and garbage collecting since it has a lot of erased flash.

Historically another use case for this king of thing was short-stroking traditional hard drives. Although that's less common these days.

@behlendorf
Copy link
Contributor

There is a yet-to-be-ported OpenZFS change somewhat related to this which also needs to be considered.

OpenZFS 7446 zpool create should support efi system partition

@behlendorf behlendorf added Type: Feature Feature request or new feature Status: Work in Progress Not yet ready for general review labels Jul 10, 2017
@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 14, 2017

Update: Rebase to master, change pool create default to legacy, make zed follow partition property. I'll leave the partition=100G stuff out for anyone who's interested in it.

@Trucido
Copy link

Trucido commented Aug 14, 2017

Likely not relevant for proper server installs, moreso workstations, servers with a GUI installed, or desktop users but we finally got the udisks2 udev.d rules into a pull request upstream which have the solaris/bsd/ZoL PartID's and such (so device mapper and udisks2 doesn't attempt to do anything weird with zfs_member stuff or grab locks on them, preventing re-scanning of changed zfs layouts.).

Edit: relevant upstream reports relating to device mapper/dmsetup & udisks2:
https://bugs.freedesktop.org/show_bug.cgi?id=100864
storaged-project/udisks#376
Edit2: And our related wiki entry (though we're still missing SLOG ID): https://github.com/zfsonlinux/zfs/wiki/FAQ#udisks2-creating-devmapper-entries-for-zvol

so this makes me wonder how some installs will handle scanning of raw devices if a partition-less scheme is used. (Edit: I think dm will still attempt to scan it) afaik even btrfs uses partitions which is kinda of the linux way of things. I do fully support this, however, if it works as intended but identifying ZFS disks could become problematic as well as proper alignment maybe?

Chunwei Chen added 2 commits August 14, 2017 14:37
zfsonlinux always partition disk when it detects the device given is a
whole disk. This legacy behavior from Illumos, however, has no apparent
benefit on Linux, but has some down sides besides confusion. E.g.
autoexpand, switching to dm device requires partprobe.

We add a pool property `partition` to be set during pool create. It
currently has two values, legacy and raw. When setting it to legacy, it
will behave as it did. When setiing it to raw, it will always use the
device as is without partitioning even if it's a whole disk.

This property applies to all commands that add disks to pool, so zpool
add/attach/replace will partition or not partition based on the property
on the target pool.

A pool without this property will be treated as legacy. Newly created
pool will by default have partition=legacy.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 14, 2017

@Trucido
https://github.com/zfsonlinux/zfs/wiki/FAQ#udisks2-creating-devmapper-entries-for-zvol
This wiki entry doesn't make any sense to me. The description says it wants to ignore zvol changes. But the udev rule below is trying to ignore zfs_member devices, which are pool devices, not zvol.

Edit: Also, blkid will detect zfs_member regardless if it's on a partition or not. So if you want a udev rule to block all zfs_member, just use zfs_member as the key.

@serg-ku
Copy link

serg-ku commented Aug 15, 2017

Setting partition size(or maybe some kind of reservation) may be also useful to be sure, that replacement drives will fit, because after few years drive can be discontinued by manufacturer, or can be unavailable in stores and different model/manufacturer can be few hundreds of MBs smaller, causing zfs to refuse resilvering.
For example some raid cards truncates useful space of drive to nearest 10GB multiplier, I've seen a lot of situations, when after few years it was difficult to find exactly same drive, all spares are already used and another drives have different size in MB, so the only solution is buying and installing much more expensive drive with 1.5x or even 2x capacity.

@jumbi77
Copy link
Contributor

jumbi77 commented Aug 15, 2017

For example, setting partition=100G would result in 100G partitions for ZFS leaving the rest of the device available. One use case for this would be wanting to manually over-provision an SSD.

I like that idea too.

@quakelee
Copy link

Is there has any update for this request? I do need this change actually. Although it wouldn't affect the performance or other things, but what I met is I'm using zfs on an EBS device, after I resized the EBS volume size, that partition 9 just at middle of disk, and blocked zpool auto expand function, I have to delete that partition 9 manually and reboot the the EC2 instance, because the partprobe can't recognize size change somehow. That's really annoying. Please fix this change and merge it ASAP

@Trucido
Copy link

Trucido commented Feb 7, 2018

@Trucido
https://github.com/zfsonlinux/zfs/wiki/FAQ#udisks2-creating-devmapper-entries-for-zvol
This wiki entry doesn't make any sense to me. The description says it wants to ignore zvol changes. But the udev rule below is trying to ignore zfs_member devices, which are pool devices, not zvol.

Edit: Also, blkid will detect zfs_member regardless if it's on a partition or not. So if you want a udev rule to block all zfs_member, just use zfs_member as the key.

@tuxoko
Sorry for late response to this, I noticed it from the above comment that bumped this issue/pull.

The wiki is perhaps misworded about the purpose of the udev rule. The last bit is the important part: ENV{UDISKS_IGNORE}="1"
(I went a little overkill on matching to be sure we wouldn't mismatch)

This simply appends an extra ENV to udev which instructs udisks2 (now storaged?) to ignore partitions which match that rule. For example if you have a GUI workstation with udisks2 running and open a file manager you won't see a ton of spam with EVERY single partition of every zfs disk and non-functional mount/unmount/eject stuff.

This also had an unfortunate sideaffect of triggering device mapper to attempt to map stuff it shouldn't have, or udisks would grab some sort of lock preventing updating of pools when destroying/creating and adding stuff. I forget now since it's been so long but I ran into this issue on a Debian testbed (with minimal GUI) awhile ago.

We're not explicitly telling device mapper to ignore zfs or zvols here, just telling udisks2 to mind it's own business (which really needs to be merged upstream if it hasn't already, they've already got a long list of raid related ignore rules). I created the rule based on some rules in /usr/lib/udev/rules.d/80-udisks2.rules which were intended to ignore certain raid related stuff.

There's more fine grained rules (already upstream) to handle mapping zvols if that was your concern (60-zvol.rules, 90-zfs.rules, etc. and various grub related ones floating around too),

Edit: speaking of udev rules, this is another I use to fix some grub/dracut derps. I forget where it came from or what it's purpose even was anymore since the other rules should in theory cover this stuff. Maybe it was to fix the broken systemd import service before the fix for import-cache/import-scan
#/etc/udev/rules.d/70-zfs-grub-fix.rules
KERNEL=="sd*[0-9]", IMPORT{parent}=="ID_*", ENV{ID_FS_TYPE}=="zfs_member", SYMLINK+="$env{ID_BUS}-$env{ID_SERIAL}-part%n"

@behlendorf
Copy link
Contributor

@tuxoko in the interests of keeping the OpenZFS implementations in sync as much as possible, what do you think about adopting the upstream change #94 (comment) to resolve this.

@behlendorf
Copy link
Contributor

@kpande no, you'd need to manually set the bootfs property.

@ahrens
Copy link
Member

ahrens commented Jun 4, 2021

I noticed that this PR hasn’t been updated in some time. We would like to see this feature added (or resolve it as mentioned above) , but it seems that it isn’t quite complete, so we’re going to close it for now. When you have time to revisit this work, feel free to reopen this PR or open a new one. Thanks for your contribution to OpenZFS and please continue to open PR’s to address issues that you encounter!

@pepa65
Copy link

pepa65 commented Nov 28, 2023

Please people, someone, get this done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Status: Revision Needed Changes are required for the PR to be accepted Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants