Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for hotplug issues #13797

Merged
merged 1 commit into from
Sep 28, 2022
Merged

Fix for hotplug issues #13797

merged 1 commit into from
Sep 28, 2022

Conversation

ixhamza
Copy link
Contributor

@ixhamza ixhamza commented Aug 24, 2022

Motivation and Context

Fix for hotplug issues

Description

ZED relies on udev to match vdev guids when the device is removed. However,
udev does not contain the correct blkid information for the vdev due to
which the vdev failed to match when detached and we have to rely on
fault handler to make the device unavailable. This PR allows the vdev to trigger
a Disk Change event whenever a new vdev is added to sync blkid
information with udev.
This PR also changes the device state to REMOVED whenever the device is
unplugged instead of UNAVAIL.

How Has This Been Tested?

By hotplugging vdevs through QEMU

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ixhamza
Copy link
Contributor Author

ixhamza commented Aug 24, 2022

cc: @amotin, @freqlabs

@ghost ghost requested review from behlendorf and removed request for behlendorf August 24, 2022 19:27
cmd/zed/agents/zfs_retire.c Show resolved Hide resolved
module/os/linux/spl/spl-generic.c Outdated Show resolved Hide resolved
module/os/linux/zfs/vdev_disk.c Outdated Show resolved Hide resolved
module/zfs/vdev.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
@ghost
Copy link

ghost commented Aug 24, 2022

For more context:

The first part of this change deals with the fact that udev caches blkid information, which includes fields parsed from vdev labels:

#  blkid --probe /dev/sde2
/dev/sde2: VERSION="5000" LABEL="storage" UUID="5081467553143690352" UUID_SUB="8098199377452528262" BLOCK_SIZE="4096" TYPE="zfs_member" USAGE="filesystem" PART_ENTRY_SCHEME="gpt" PART_ENTRY_UUID="550c557e-c90e-11e9-bd70-ac1f6b0adb04" PART_ENTRY_TYPE="516e7cba-6ecf-11d6-8ff8-00022d09712b" PART_ENTRY_NUMBER="2" PART_ENTRY_OFFSET="4194432" PART_ENTRY_SIZE="1949330696" PART_ENTRY_DISK="8:64"
# udevadm info /dev/sde2 | grep ID_FS_
E: ID_FS_VERSION=5000
E: ID_FS_LABEL=storage
E: ID_FS_LABEL_ENC=storage
E: ID_FS_UUID=5081467553143690352
E: ID_FS_UUID_ENC=5081467553143690352
E: ID_FS_UUID_SUB=8098199377452528262
E: ID_FS_UUID_SUB_ENC=8098199377452528262
E: ID_FS_TYPE=zfs_member
E: ID_FS_USAGE=filesystem

The LABEL field is the pool name, UUID is the pool guid, UUID_SUB the vdev guid, among other fields.
A change event on the block device is needed for udev to invalidate its cache and re-probe the device whenever we update these in a vdev label, for example when we create/reguid/destroy a pool or add/replace/remove vdevs. Otherwise udev will report incorrect guids to zed and zed will ignore events for the device.

The second part of this change deals with device removal. There exists VDEV_STATE_REMOVED for a device that has been detached from the system. Both FreeBSD and Linux already use this state when an I/O failure indicates the device is gone. On FreeBSD there is also a callback in the kernel when a device is removed, which we use to update the state accordingly without having to wait for I/O to fail. There is no such mechanism on Linux, so we have to rely on udev/zed.
There is currently no way to request VDEV_STATE_REMOVED from userland, so it has been added to libzfs using the existing ZFS_IOC_VDEV_SET_STATE ioctl, and zed now uses this to mark the vdev as removed when it receives the removal event from udev.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can continue long about cosmetics, but I have no real objections.

cmd/zed/agents/zfs_retire.c Outdated Show resolved Hide resolved
cmd/zed/agents/zfs_retire.c Outdated Show resolved Hide resolved
module/zfs/vdev.c Outdated Show resolved Hide resolved
@ghost ghost requested a review from tonyhutter August 25, 2022 16:14
@ghost ghost added the Status: Code Review Needed Ready for review and testing label Aug 25, 2022
@tonyhutter
Copy link
Contributor

Couple of thoughts before I take a look at the code:

We've debated in the past whether or not to remove a vdev from a pool on a udev remove event. On the surface it seems perfectly logical. udev tells you the disk has been removed, so remove it from the pool. Makes sense.

However, at very large scale (1000s of disks), you often see odd disk behavior that you have to account for. Like, multiple NVMe drives, on multiple nodes, all simultaneously disappearing and re-appearing within a few seconds of each other. I've seen SAS disks disappear/re-appear as well. And as long as the disks come back within the IO timeout amount of time, and are healthy after they come back, then the user wouldn't notice anything (maybe even more so if you're running Lustre on top of ZFS, with it's own timeouts/retries on top of it?). Just to quantify this strangeness a little - I'm seeing over 1000 zed "EC_dev_remove" events across all our disks in the last month (of which only a handful are actual disk replacements). Had this "remove vdev on udev remove" been enabled, we would have seen tons of pools fault at the same time, causing mass downtime and admin intervention. Since we didn't have it enabled, there were no issues - all the IOs rode though the intermittent period of disks disappearing/re-appearing.

In short, I think you should make this configurable as a new autoremove pool property that is off by default. We already have an autoreplace property, so it makes sense to have an autoremove property too.

Also, I'm not in favor of the UNAVAIL -> REMOVED rename. "UNAVAIL" already has a lot of historical weight (just google "ZFS unavail"), and it might break scripts if we rename it. UNAVAIL also implies that the drive could come back to life later, which would certainly be accurate in the case of our disappearing/re-appearing drives.

@amotin
Copy link
Member

amotin commented Aug 26, 2022

@tonyhutter From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return. Any way rebuild will likely take hours, so few minutes do not matter much. Pardon me if that logic is already there. But if disk really disappeared according to the block level, it won't handle any new I/O, so hiding it from ZFS is not very productive. On opposite, we saw that disk references held by ZFS cause actual problems for NVMe and surrounding subsystems, that can't free the resources.

We also have a huge fleet of ZFS systems, having may be not multiple thousands, but up to a thousand disks each, just based on FreeBSD. And on FreeBSD disk disappearance is immediately reported to ZFS straight within kernel without any additional daemon participation. Daemon handles only replacement. And we feel good enough about that.

@ghost ghost added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Aug 29, 2022
@ghost
Copy link

ghost commented Aug 29, 2022

We found more issues with removal detection to be fixed.

@tonyhutter
Copy link
Contributor

@amotin

From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return.

Yea, something like an "autoremove after N seconds" timer would be good. We would probably want the autoreplace timer to be longer than the IO timeout (plus the IO retry). The autoremove timer could be added in a future PR.

@ghost
Copy link

ghost commented Aug 29, 2022

Physically removed vdevs are already supposed to transition to VDEV_STATE_REMOVED on IO errors even on Linux, but it doesn't work for several reasons.

zfs_check_media_change() for kernels 5.10+ has a bug: it always returns 0.

Beyond that, the bdev_check_media_change() KPI it uses doesn't work to check for a hot-plugged device. For NVMe, the check_events member of struct block_device_operations isn't implemented, so it can't work at all. For the drivers that do implement it, it can only work if the device identifies as removable (/sys/block/*/removable). For example, in a few systems I have handy, a USB flash drive is removable, but a SATA disk in a hot-swap enclosure attached to a SAS controller is not.

This would affect older kernels with check_media_change() as well. So in most cases the check to see if a device was removed just isn't working as intended.

UPDATE: even SAS disks report 0 for removable. This ever working would be very rare indeed.

@ixhamza
Copy link
Contributor Author

ixhamza commented Sep 1, 2022

@freqlabs I just added an updated version and split my work into multiple commits.
cc: @amotin

@ixhamza ixhamza force-pushed the NAS-115238 branch 6 times, most recently from 6be1cf4 to be4e34c Compare September 2, 2022 13:35
@tonyhutter
Copy link
Contributor

I tested this a little using a draid1:4d:9c:1s pool of 9 NVMe drives. I tested powering off/on a NVMe drive using /sys/bus/pci/slots/<slot>/power and all seemed to work as expected. I plan on doing some more real world testing next week with spinning disks in a SAS enclosure.

@ixhamza
Copy link
Contributor Author

ixhamza commented Sep 18, 2022

@tonyhutter Thank you for your description. I debugged the issue and it seems to be related to virtio disks. Vdev should have devid "/dev/disk/by-id", so it can be identified during removal. However, by default virtio disks do not seem to have devid unless the serial attribute is hardcoded for the specified disk (systemd/systemd#17670). Even if we provide the serial attribute for virtio disks, zed won't be able to find devid since there is no handling for virtio disks in zfs_device_get_devid(). If you test this on real hardware or emulated disks, it should not be a problem. Normal vdevs won't be affected by this since they are identified by pool_guid and vdev_guid. Maybe I can dig more into it in future.
@freqlabs

@ixhamza ixhamza force-pushed the NAS-115238 branch 2 times, most recently from 6a45dd2 to 4bd16a4 Compare September 20, 2022 14:34
@tonyhutter
Copy link
Contributor

@ixhamza - I went back to testing on real-world NVMe devices and have some new results. Good news is that the cache device now correctly shows REMOVED on removal:

# zpool status
  pool: tank
 state: ONLINE
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    ONLINE       0     0     0
	spares
	  nvme1n1    AVAIL   

# echo 0 > /sys/bus/pci/slots/0/power 
# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    REMOVED      0     0     0
	spares
	  nvme1n1    AVAIL   

The bad news is that the spares do not show REMOVED:

# echo 0 > /sys/bus/pci/slots/1/power 
# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    REMOVED      0     0     0
	spares
	  nvme1n1    AVAIL   

errors: No known data errors

# file /dev/nvme1n1
/dev/nvme1n1: cannot open `/dev/nvme1n1' (No such file or directory)

I also verified that in both removal cases (the cache and spare), that udev remove events were generated. So ZED should be able to key off those removal events for spares as well, although it might require some additional code.

@ixhamza
Copy link
Contributor Author

ixhamza commented Sep 20, 2022

@tonyhutter you are right, hotplugging of spare is indeed not supported in this patch since it does not contain a label that we use to identify other vdevs.
@freqlabs - do you think this is something we may cover in future work as it does not seem to be very straightforward to cover in this PR with reference to our earlier discussion?

@ghost
Copy link

ghost commented Sep 20, 2022

Yes, we can address handling spares next.

@ixhamza
Copy link
Contributor Author

ixhamza commented Sep 26, 2022

@behlendorf, @tonyhutter It would be appreciated if you guys can provide feedback or tag relevant guys here to review this PR. These patches are very critical for our TrueNAS BETA release.
@amotin, @freqlabs

@tonyhutter
Copy link
Contributor

@ixhamza apologies for the delay, I'll take a look right now

Comment on lines 3076 to 3084
/*
* Remove the specified vdev. Called from zed on udev remove event.
*/
int
zpool_vdev_remove_wanted(zpool_handle_t *zhp, const char *path)
Copy link
Contributor

@tonyhutter tonyhutter Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both zpool_vdev_remove_wanted() and zpool_vdev_remove() are exported by libzfs and have the same signature:

/*
 * Remove the given device.
 */
int
zpool_vdev_remove(zpool_handle_t *zhp, const char *path)

Can you add some additional comments to differentiate what zpool_vdev_remove_wanted() does vs zpool_vdev_remove()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point. zpool_vdev_remove_wanted() asynchronously removes the vdev whereas zpool_vdev_remove() does it synchronously. I will test it again to recall what issues I faced when using zpool_vdev_remove() and will add the comment accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just update the comment. Thanks!

@tonyhutter
Copy link
Contributor

@ixhamza it looks good - I don't really don't have any more comments. I would recommend you squash your commits and use the (slightly modified) commit message:

zed: mark disks as REMOVED when they are removed

ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be propertly marked as REMOVED.

@ixhamza ixhamza force-pushed the NAS-115238 branch 2 times, most recently from 3d01ee7 to c357488 Compare September 26, 2022 23:38
@ixhamza
Copy link
Contributor Author

ixhamza commented Sep 26, 2022

@tonyhutter Thank you so much for your review and for verifying it on your hardware. I have incorporated your feedback into this PR.

@ghost
Copy link

ghost commented Sep 27, 2022

I've offered a few more suggestions in private for Ameer to incorporate, including fixing a nearby memory leak when "remove" events are ignored.

ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be propertly marked as REMOVED.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza
Copy link
Contributor Author

ixhamza commented Sep 27, 2022

Thank you @freqlabs. I have incorporated your suggested changes.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 27, 2022
@behlendorf behlendorf merged commit 55c1272 into openzfs:master Sep 28, 2022
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 2, 2022
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes openzfs#13797
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 2, 2022
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes openzfs#13797
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 2, 2022
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes openzfs#13797
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 6, 2022
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes openzfs#13797
@ixhamza ixhamza deleted the NAS-115238 branch August 2, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants