Fix for hotplug issues #13797

ixhamza · 2022-08-24T18:31:49Z

Motivation and Context

Fix for hotplug issues

Description

ZED relies on udev to match vdev guids when the device is removed. However,
udev does not contain the correct blkid information for the vdev due to
which the vdev failed to match when detached and we have to rely on
fault handler to make the device unavailable. This PR allows the vdev to trigger
a Disk Change event whenever a new vdev is added to sync blkid
information with udev.
This PR also changes the device state to REMOVED whenever the device is
unplugged instead of UNAVAIL.

How Has This Been Tested?

By hotplugging vdevs through QEMU

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

ixhamza · 2022-08-24T18:34:56Z

cc: @amotin, @freqlabs

cmd/zed/agents/zfs_retire.c

module/os/linux/spl/spl-generic.c

module/os/linux/zfs/vdev_disk.c

module/zfs/vdev.c

module/zfs/spa.c

ghost · 2022-08-24T20:10:58Z

For more context:

The first part of this change deals with the fact that udev caches blkid information, which includes fields parsed from vdev labels:

#  blkid --probe /dev/sde2
/dev/sde2: VERSION="5000" LABEL="storage" UUID="5081467553143690352" UUID_SUB="8098199377452528262" BLOCK_SIZE="4096" TYPE="zfs_member" USAGE="filesystem" PART_ENTRY_SCHEME="gpt" PART_ENTRY_UUID="550c557e-c90e-11e9-bd70-ac1f6b0adb04" PART_ENTRY_TYPE="516e7cba-6ecf-11d6-8ff8-00022d09712b" PART_ENTRY_NUMBER="2" PART_ENTRY_OFFSET="4194432" PART_ENTRY_SIZE="1949330696" PART_ENTRY_DISK="8:64"
# udevadm info /dev/sde2 | grep ID_FS_
E: ID_FS_VERSION=5000
E: ID_FS_LABEL=storage
E: ID_FS_LABEL_ENC=storage
E: ID_FS_UUID=5081467553143690352
E: ID_FS_UUID_ENC=5081467553143690352
E: ID_FS_UUID_SUB=8098199377452528262
E: ID_FS_UUID_SUB_ENC=8098199377452528262
E: ID_FS_TYPE=zfs_member
E: ID_FS_USAGE=filesystem

The LABEL field is the pool name, UUID is the pool guid, UUID_SUB the vdev guid, among other fields.
A change event on the block device is needed for udev to invalidate its cache and re-probe the device whenever we update these in a vdev label, for example when we create/reguid/destroy a pool or add/replace/remove vdevs. Otherwise udev will report incorrect guids to zed and zed will ignore events for the device.

The second part of this change deals with device removal. There exists VDEV_STATE_REMOVED for a device that has been detached from the system. Both FreeBSD and Linux already use this state when an I/O failure indicates the device is gone. On FreeBSD there is also a callback in the kernel when a device is removed, which we use to update the state accordingly without having to wait for I/O to fail. There is no such mechanism on Linux, so we have to rely on udev/zed.
There is currently no way to request VDEV_STATE_REMOVED from userland, so it has been added to libzfs using the existing ZFS_IOC_VDEV_SET_STATE ioctl, and zed now uses this to mark the vdev as removed when it receives the removal event from udev.

amotin

I can continue long about cosmetics, but I have no real objections.

cmd/zed/agents/zfs_retire.c

module/zfs/vdev.c

tonyhutter · 2022-08-26T00:19:36Z

Couple of thoughts before I take a look at the code:

We've debated in the past whether or not to remove a vdev from a pool on a udev remove event. On the surface it seems perfectly logical. udev tells you the disk has been removed, so remove it from the pool. Makes sense.

However, at very large scale (1000s of disks), you often see odd disk behavior that you have to account for. Like, multiple NVMe drives, on multiple nodes, all simultaneously disappearing and re-appearing within a few seconds of each other. I've seen SAS disks disappear/re-appear as well. And as long as the disks come back within the IO timeout amount of time, and are healthy after they come back, then the user wouldn't notice anything (maybe even more so if you're running Lustre on top of ZFS, with it's own timeouts/retries on top of it?). Just to quantify this strangeness a little - I'm seeing over 1000 zed "EC_dev_remove" events across all our disks in the last month (of which only a handful are actual disk replacements). Had this "remove vdev on udev remove" been enabled, we would have seen tons of pools fault at the same time, causing mass downtime and admin intervention. Since we didn't have it enabled, there were no issues - all the IOs rode though the intermittent period of disks disappearing/re-appearing.

In short, I think you should make this configurable as a new autoremove pool property that is off by default. We already have an autoreplace property, so it makes sense to have an autoremove property too.

Also, I'm not in favor of the UNAVAIL -> REMOVED rename. "UNAVAIL" already has a lot of historical weight (just google "ZFS unavail"), and it might break scripts if we rename it. UNAVAIL also implies that the drive could come back to life later, which would certainly be accurate in the case of our disappearing/re-appearing drives.

amotin · 2022-08-26T01:35:23Z

@tonyhutter From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return. Any way rebuild will likely take hours, so few minutes do not matter much. Pardon me if that logic is already there. But if disk really disappeared according to the block level, it won't handle any new I/O, so hiding it from ZFS is not very productive. On opposite, we saw that disk references held by ZFS cause actual problems for NVMe and surrounding subsystems, that can't free the resources.

We also have a huge fleet of ZFS systems, having may be not multiple thousands, but up to a thousand disks each, just based on FreeBSD. And on FreeBSD disk disappearance is immediately reported to ZFS straight within kernel without any additional daemon participation. Daemon handles only replacement. And we feel good enough about that.

ghost · 2022-08-29T14:37:45Z

We found more issues with removal detection to be fixed.

tonyhutter · 2022-08-29T17:36:22Z

@amotin

From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return.

Yea, something like an "autoremove after N seconds" timer would be good. We would probably want the autoreplace timer to be longer than the IO timeout (plus the IO retry). The autoremove timer could be added in a future PR.

ghost · 2022-08-29T21:47:09Z

Physically removed vdevs are already supposed to transition to VDEV_STATE_REMOVED on IO errors even on Linux, but it doesn't work for several reasons.

zfs_check_media_change() for kernels 5.10+ has a bug: it always returns 0.

Beyond that, the bdev_check_media_change() KPI it uses doesn't work to check for a hot-plugged device. For NVMe, the check_events member of struct block_device_operations isn't implemented, so it can't work at all. For the drivers that do implement it, it can only work if the device identifies as removable (/sys/block/*/removable). For example, in a few systems I have handy, a USB flash drive is removable, but a SATA disk in a hot-swap enclosure attached to a SAS controller is not.

This would affect older kernels with check_media_change() as well. So in most cases the check to see if a device was removed just isn't working as intended.

UPDATE: even SAS disks report 0 for removable. This ever working would be very rare indeed.

ixhamza · 2022-09-01T13:05:28Z

@freqlabs I just added an updated version and split my work into multiple commits.
cc: @amotin

tonyhutter · 2022-09-09T00:51:07Z

I tested this a little using a draid1:4d:9c:1s pool of 9 NVMe drives. I tested powering off/on a NVMe drive using /sys/bus/pci/slots/<slot>/power and all seemed to work as expected. I plan on doing some more real world testing next week with spinning disks in a SAS enclosure.

ixhamza · 2022-09-18T18:39:53Z

@tonyhutter Thank you for your description. I debugged the issue and it seems to be related to virtio disks. Vdev should have devid "/dev/disk/by-id", so it can be identified during removal. However, by default virtio disks do not seem to have devid unless the serial attribute is hardcoded for the specified disk (systemd/systemd#17670). Even if we provide the serial attribute for virtio disks, zed won't be able to find devid since there is no handling for virtio disks in zfs_device_get_devid(). If you test this on real hardware or emulated disks, it should not be a problem. Normal vdevs won't be affected by this since they are identified by pool_guid and vdev_guid. Maybe I can dig more into it in future.
@freqlabs

tonyhutter · 2022-09-20T18:07:55Z

@ixhamza - I went back to testing on real-world NVMe devices and have some new results. Good news is that the cache device now correctly shows REMOVED on removal:

# zpool status
  pool: tank
 state: ONLINE
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    ONLINE       0     0     0
	spares
	  nvme1n1    AVAIL   

# echo 0 > /sys/bus/pci/slots/0/power 
# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    REMOVED      0     0     0
	spares
	  nvme1n1    AVAIL

The bad news is that the spares do not show REMOVED:

# echo 0 > /sys/bus/pci/slots/1/power 
# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    REMOVED      0     0     0
	spares
	  nvme1n1    AVAIL   

errors: No known data errors

# file /dev/nvme1n1
/dev/nvme1n1: cannot open `/dev/nvme1n1' (No such file or directory)

I also verified that in both removal cases (the cache and spare), that udev remove events were generated. So ZED should be able to key off those removal events for spares as well, although it might require some additional code.

ixhamza · 2022-09-20T18:44:27Z

@tonyhutter you are right, hotplugging of spare is indeed not supported in this patch since it does not contain a label that we use to identify other vdevs.
@freqlabs - do you think this is something we may cover in future work as it does not seem to be very straightforward to cover in this PR with reference to our earlier discussion?

ghost · 2022-09-20T18:51:55Z

Yes, we can address handling spares next.

ixhamza · 2022-09-26T18:50:46Z

@behlendorf, @tonyhutter It would be appreciated if you guys can provide feedback or tag relevant guys here to review this PR. These patches are very critical for our TrueNAS BETA release.
@amotin, @freqlabs

tonyhutter · 2022-09-26T19:31:48Z

@ixhamza apologies for the delay, I'll take a look right now

tonyhutter · 2022-09-26T19:40:13Z

lib/libzfs/libzfs_pool.c

+/*
+ * Remove the specified vdev. Called from zed on udev remove event.
+ */
+int
+zpool_vdev_remove_wanted(zpool_handle_t *zhp, const char *path)


Both zpool_vdev_remove_wanted() and zpool_vdev_remove() are exported by libzfs and have the same signature:

/* * Remove the given device. */ int zpool_vdev_remove(zpool_handle_t *zhp, const char *path)

Can you add some additional comments to differentiate what zpool_vdev_remove_wanted() does vs zpool_vdev_remove()?

That's a very good point. zpool_vdev_remove_wanted() asynchronously removes the vdev whereas zpool_vdev_remove() does it synchronously. I will test it again to recall what issues I faced when using zpool_vdev_remove() and will add the comment accordingly.

Just update the comment. Thanks!

tonyhutter · 2022-09-26T20:36:32Z

@ixhamza it looks good - I don't really don't have any more comments. I would recommend you squash your commits and use the (slightly modified) commit message:

zed: mark disks as REMOVED when they are removed

ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be propertly marked as REMOVED.

ixhamza · 2022-09-26T23:41:29Z

@tonyhutter Thank you so much for your review and for verifying it on your hardware. I have incorporated your feedback into this PR.

ghost · 2022-09-27T20:23:47Z

I've offered a few more suggestions in private for Ameer to incorporate, including fixing a nearby memory leak when "remove" events are ignored.

ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be propertly marked as REMOVED. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

ixhamza · 2022-09-27T21:55:21Z

Thank you @freqlabs. I have incorporated your suggested changes.

ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#13797

behlendorf requested a review from tonyhutter August 24, 2022 19:26

ghost requested review from behlendorf and removed request for behlendorf August 24, 2022 19:27

amotin reviewed Aug 24, 2022

View reviewed changes

ixhamza force-pushed the NAS-115238 branch from 7a9d820 to 4223370 Compare August 24, 2022 22:28

ixhamza requested review from amotin and removed request for tonyhutter August 24, 2022 22:29

ixhamza force-pushed the NAS-115238 branch from 4223370 to a2e6753 Compare August 24, 2022 22:40

amotin approved these changes Aug 24, 2022

View reviewed changes

cmd/zed/agents/zfs_retire.c Outdated Show resolved Hide resolved

cmd/zed/agents/zfs_retire.c Outdated Show resolved Hide resolved

module/zfs/vdev.c Outdated Show resolved Hide resolved

ixhamza force-pushed the NAS-115238 branch from a2e6753 to ea3c868 Compare August 25, 2022 14:16

ghost requested a review from tonyhutter August 25, 2022 16:14

ghost added the Status: Code Review Needed Ready for review and testing label Aug 25, 2022

ghost added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Aug 29, 2022

ixhamza force-pushed the NAS-115238 branch from ea3c868 to 33f5080 Compare September 1, 2022 13:02

ixhamza force-pushed the NAS-115238 branch 6 times, most recently from 6be1cf4 to be4e34c Compare September 2, 2022 13:35

ixhamza force-pushed the NAS-115238 branch 2 times, most recently from 6a45dd2 to 4bd16a4 Compare September 20, 2022 14:34

tonyhutter reviewed Sep 26, 2022

View reviewed changes

ixhamza force-pushed the NAS-115238 branch 2 times, most recently from 3d01ee7 to c357488 Compare September 26, 2022 23:38

tonyhutter approved these changes Sep 27, 2022

View reviewed changes

ixhamza force-pushed the NAS-115238 branch from c357488 to dec84b7 Compare September 27, 2022 21:50

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 27, 2022

ghost approved these changes Sep 28, 2022

View reviewed changes

behlendorf merged commit 55c1272 into openzfs:master Sep 28, 2022

ixhamza mentioned this pull request Sep 30, 2022

zed: mark disks as REMOVED when they are removed truenas/zfs#96

Merged

13 tasks

tonyhutter mentioned this pull request Oct 4, 2022

Meta: Include consequences of bugfixes in PR/commit messages #13612

Open

behlendorf mentioned this pull request Oct 29, 2022

zpool events does not list DEGRADED state change for pool #12629

Closed

ixhamza deleted the NAS-115238 branch August 2, 2023 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for hotplug issues #13797

Fix for hotplug issues #13797

ixhamza commented Aug 24, 2022 •

edited

Loading

ixhamza commented Aug 24, 2022

ghost commented Aug 24, 2022 •

edited by ghost

Loading

amotin left a comment

tonyhutter commented Aug 26, 2022

amotin commented Aug 26, 2022

ghost commented Aug 29, 2022

tonyhutter commented Aug 29, 2022

ghost commented Aug 29, 2022 •

edited by ghost

Loading

ixhamza commented Sep 1, 2022

tonyhutter commented Sep 9, 2022

ixhamza commented Sep 18, 2022

tonyhutter commented Sep 20, 2022

ixhamza commented Sep 20, 2022

ghost commented Sep 20, 2022

ixhamza commented Sep 26, 2022

tonyhutter commented Sep 26, 2022

tonyhutter Sep 26, 2022 •

edited

Loading

ixhamza Sep 26, 2022

tonyhutter Sep 26, 2022

ixhamza Sep 26, 2022

tonyhutter commented Sep 26, 2022

ixhamza commented Sep 26, 2022

ghost commented Sep 27, 2022

ixhamza commented Sep 27, 2022

Fix for hotplug issues #13797

Fix for hotplug issues #13797

Conversation

ixhamza commented Aug 24, 2022 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

ixhamza commented Aug 24, 2022

ghost commented Aug 24, 2022 • edited by ghost Loading

amotin left a comment

Choose a reason for hiding this comment

tonyhutter commented Aug 26, 2022

amotin commented Aug 26, 2022

ghost commented Aug 29, 2022

tonyhutter commented Aug 29, 2022

ghost commented Aug 29, 2022 • edited by ghost Loading

ixhamza commented Sep 1, 2022

tonyhutter commented Sep 9, 2022

ixhamza commented Sep 18, 2022

tonyhutter commented Sep 20, 2022

ixhamza commented Sep 20, 2022

ghost commented Sep 20, 2022

ixhamza commented Sep 26, 2022

tonyhutter commented Sep 26, 2022

tonyhutter Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

ixhamza Sep 26, 2022

Choose a reason for hiding this comment

tonyhutter Sep 26, 2022

Choose a reason for hiding this comment

ixhamza Sep 26, 2022

Choose a reason for hiding this comment

tonyhutter commented Sep 26, 2022

ixhamza commented Sep 26, 2022

ghost commented Sep 27, 2022

ixhamza commented Sep 27, 2022

ixhamza commented Aug 24, 2022 •

edited

Loading

ghost commented Aug 24, 2022 •

edited by ghost

Loading

ghost commented Aug 29, 2022 •

edited by ghost

Loading

tonyhutter Sep 26, 2022 •

edited

Loading