Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issuing "zpool trim" locks up zfs and makes pool importable only in RO mode #16056

Open
gititdone876 opened this issue Apr 3, 2024 · 17 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@gititdone876
Copy link

gititdone876 commented Apr 3, 2024

System information

Type Version/Name
Distribution Name Manjaro / EndeavourOS
Distribution Version 23.1.4 / ?
Kernel Version 6.6.23, 6.7.7, 6.8.2
Architecture x86_64
OpenZFS Version 2.2.3

Describe the problem you're observing

After zfs updated from 2.2.2 to 2.2.3 a pool which had autotrim=on locked up zfs on import. There was iowait activity and zfs could not be killed. Hard shutdown. I somehow managed to turn off autotrim, scrubbed with no issues and used the pool for a while until I accidentally issued a manual zpool trim which led to the same problem. After restart zpool import would not import the pool, would not exit and there was no iowait activity, but still had to do a hard shutdown. This was on Manjaro with kernel 6.7.7-1. Tried on a different machine with EndeavourOS and kernel 6.6.23-1-lts, zfs-2.2.3, same thing. I tried to import read-only and that worked, but several combinations of kernels 6.6.23-1-lts, 6.8.2-arch2-1 and zfs-2.2.3 and latest from git, would have the exact same problem - zpool import does not exit, does not import. While the pool was imported as read-only zpool get all would show capacity 0% which is not true. Files were readable, including partial ones that were being written during the hang.

I built zfs-2.2.2 and running kernel 6.6.23-1-lts on the EndeavourOS machine solved the problem. As soon as I ran zpool import the TRIM operation, that had failed previously, went ahead and finished successfully and the pool is now importable and writable with zfs-2.2.3, kernel 6.7.7-1 on the Manjaro system. Scrubbed with no errors. I'm guessing the issue is with zfs-2.2.3, not a specific kernel.

The pool is on a single external USB SMR drive - WD50NDZW-11BCSS1

Describe how to reproduce the problem

Issue a zpool trim or have autotrim=on with an SMR HDD connected via USB with zfs-2.2.3

EDIT: 2024-04-08 Here come the logs

Some of the errors might be identical, sorry about that. I just put everything between the problematic zpool trim and the eventual resolution in one file in chronological order. I don't remember when I tried importing as normal and when as read-only. Check the output of zpool history -i for an overview.

oopsie.log
zpool_history.log
zpool_get_all.log

@gititdone876 gititdone876 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 3, 2024
@gititdone876 gititdone876 changed the title Autotrim and manually issuing "zpool trim" locks up zfs and makes pool importable only in RO mode Issuing "zpool trim" locks up zfs and makes pool importable only in RO mode Apr 3, 2024
@IvanVolosyuk
Copy link
Contributor

Probably SMR to blame.

@gititdone876
Copy link
Author

Probably SMR to blame.

@IvanVolosyuk Why do you think so?

@IvanVolosyuk
Copy link
Contributor

ZFS on SMR drives has a lot of issues. It is generally recommended to avoid those drives. Search in google 'zfs smr' for the full picture. e.g. https://www.tspi.at/2022/11/26/whynosmrforzfs.html

SMR drives live their own life and it might be that it's not the change of ZFS version caused it start working again, but just passage of time and SMR internal logic progress. It might be useful to repeat swapping between ZFS 2.2.2 an 2.2.3 several times to see if it indeed related to some changes there. This is critical information - how easily repeatable the problem is.

When pool fails to import it might be actually waiting for the drive to respond to the issued commands. Thus, it might take just a long time for the drive to respond. Look at dmesg logs if there were disk command failures / timeouts there.

There are two changes in 2.2.3 which look relevant:
08fd5cc
07cf973
You can try to build 2.2.3 with those changes reverted to see if it is the cause.

@usaleem-ix
Copy link
Contributor

I have tried reproducing the issue on non-SMR drives but I am not able to do so. I agree with what @IvanVolosyuk has suggested. Trying 2.2.3 without the two commits mentioned might help narrow down the issue.

@gititdone876
Copy link
Author

gititdone876 commented Apr 6, 2024

@IvanVolosyuk @usaleem-ix I don't have experience with tracking down bugs. What should I do before I try it again to ensure useful information would be gathered and where would it be most likely logged?

When pool fails to import it might be actually waiting for the drive to respond to the issued commands. Thus, it might take just a long time for the drive to respond. Look at dmesg logs if there were disk command failures / timeouts there.

I actually make a point of keeping an eye on these drives by looking at CPU and IO activity especially during writes. I also listen to them and touch them to see what's going on. SMR is implemented differently in different drives and across manufacturers it seems. In this particular case I'm absolutely sure there was no activity that I might have missed. After issuing the TRIM I heard the heads park as ZFS locked up. In the case of the following import attempts, the drive was not trying to read anything past the point where the import process hung.

There might be some interaction between what ZFS tried to do to make the drive complete a TRIM and the drive's firmware. I suspect the subsequent imports did not succeed, because it might have looked like the drive was in the middle of an active TRIM operation (or something else), when it actually wasn't. Version 2.2.3 refused to proceed, while version 2.2.2 just imported and completed the pending TRIM. Now, was it something in the firmware that was left in a weird state or was it something with the file system on disk?

I'm aware of the drawbacks of SMR. I'm only using such disks for sequential writes of mostly big files and have found that TRIM helps a lot when the filesystem has had time to open up all kinds of inconvenient holes due to file deletions / writes / deletions ... Drive managed SMR is supposed be completely transparent. I don't get how ZFS could have a problem with the particular technology (excluding use cases where SMR is practically unusable regardless of the file system).

@IvanVolosyuk
Copy link
Contributor

I would have started with trying to consistently reproduce with ZFS 2.2.3.
1.Enable autotrim as you said it was working fine with ZFS 2.2.2
2. Boot with ZFS 2.2.3 and observe if you can no longer import the pool. Wait like 10 minutes to see if it just being slow. Examine dmesg for errors.
3. Boot with ZFS 2.2.2 and observe if you can import pool normally.
4. Repeat (2,3) several times, 5 time will do :)
This will confirm that the problem is indeed in 2.2.3 ZFS version.

Capturing stack traces of the zfs kernel threads and zpool import can be useful for investigating of possible deadlock:
$ for i in /proc/[0-9]*; do read -r -n 1 z < $i/comm; if [ "$z" = "z" ]; then echo === $(cat $i/comm) === ; cat $i/stack; fi; done

Reverting those 2 commits and trying ZFS 2.2.3 without them can also help further narrow down the problem.

@gititdone876
Copy link
Author

Will do. It will be in a few days though. I don't have a place to dump the drive's contents right now.

If anyone has a WD Elements or MyPassport 2.5" 5TB drive they don't mind messing with, please go ahead in the meantime. This particular one is a MyPassport WD50NDZW-11BCSS1, the slower 4800 RPM variant. Some Elements and MyPassport have a 5400 RPM variant of this drive. If it's a problem with ZFS, it should work with other drives as long it's trimmable and possibly a HDD not an SSD, but who knows.

BTW, I ran that line and learned zsh and bash apparently work differently when it comes to the "read" part. zsh seems to ignore the "-n 1 z" maybe. It's unclear to me if "read" is a builtin or standalone in either case.

@kiler129
Copy link

kiler129 commented Apr 7, 2024

I think I hit the same problem as @gititdone876. It appears the problem started somewhere in the recent version (I'm not 100% sure which one as I was doing updates over the last few days). Currently running:

# zfs --version
zfs-2.2.3-1~bpo12+1
zfs-kmod-2.2.3-1~bpo12+1

# lsb_release  -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Codename:	bookworm

# uname -a
Linux <redacted> 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

First I started getting kernel reported hungs for txg_sync & userland apps using the pool. The pool seemingly worked and was readable and writable to a point, which was deceptive. However, attempting to execute scrub or sometimes even status resulted in kernel hanging. Every time pool was imported during boot I was getting message similar to this one:
image

Trying to narrow down the issue I've got the system to hang countless of times. Since the error was pointing out to an SSD, I just replaced the SSD with a similar one, doing dd externally. After rebooting the error on import was gone. I executed a scrub which initiated and .... quickly after froze at this stage:

# zpool status -v
  pool: nvr4t
 state: ONLINE
  scan: scrub in progress since Sat Apr  6 22:37:56 2024
	4.42T / 17.9T scanned at 3.05G/s, 28.1G / 17.9T issued at 19.4M/s
	0B repaired, 0.15% done, 11 days 03:56:52 to go
config:

	NAME        STATE     READ WRITE CKSUM
	nvr4t       ONLINE       0     0     0
	  sdb       ONLINE       0     0     0
	special
	  sdc       ONLINE       0     0     0

errors: No known data errors

# zpool list -v
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nvr4t      20.5T  17.9T  2.56T        -         -     0%    87%  1.00x    ONLINE  -
  sdb      20.0T  17.7T  2.29T        -         -     0%  88.5%      -    ONLINE
special        -      -      -        -         -      -      -      -         -
  sdc       466G   187G   277G        -         -    45%  40.2%      -    ONLINE

Looking at the kernel log the dreaded error is back:

[Sat Apr  6 22:36:12 2024] Initializing XFRM netlink socket
[Sat Apr  6 22:38:02 2024] sd 7:0:0:3: [sdc] tag#117 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Sat Apr  6 22:38:02 2024] sd 7:0:0:3: [sdc] tag#117 Sense Key : Illegal Request [current]
[Sat Apr  6 22:38:02 2024] sd 7:0:0:3: [sdc] tag#117 Add. Sense: Invalid command operation code
[Sat Apr  6 22:38:02 2024] sd 7:0:0:3: [sdc] tag#117 CDB: Write same(16) 93 08 00 00 00 00 12 00 29 b0 00 00 00 50 00 00
[Sat Apr  6 22:38:02 2024] critical target error, dev sdc, sector 302000560 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 2
[Sat Apr  6 22:38:02 2024] zio pool=nvr4t vdev=/dev/sdc1 error=121 type=6 offset=154623238144 size=40960 flags=524480
[Sat Apr  6 22:42:42 2024] INFO: task txg_sync:1143 blocked for more than 120 seconds.
[Sat Apr  6 22:42:42 2024]       Tainted: P           OE      6.1.0-18-amd64 #1 Debian 6.1.76-1
[Sat Apr  6 22:42:42 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Apr  6 22:42:42 2024] task:txg_sync        state:D stack:0     pid:1143  ppid:2      flags:0x00004000
[Sat Apr  6 22:42:42 2024] Call Trace:
[Sat Apr  6 22:42:42 2024]  <TASK>
[Sat Apr  6 22:42:42 2024]  __schedule+0x34d/0x9e0
[Sat Apr  6 22:42:42 2024]  schedule+0x5a/0xd0
[Sat Apr  6 22:42:42 2024]  schedule_timeout+0x94/0x150
[Sat Apr  6 22:42:42 2024]  ? __bpf_trace_tick_stop+0x10/0x10
[Sat Apr  6 22:42:42 2024]  io_schedule_timeout+0x4c/0x80
[Sat Apr  6 22:42:42 2024]  __cv_timedwait_common+0x12f/0x170 [spl]
[Sat Apr  6 22:42:42 2024]  ? cpuusage_read+0x10/0x10
[Sat Apr  6 22:42:42 2024]  __cv_timedwait_io+0x15/0x20 [spl]
[Sat Apr  6 22:42:42 2024]  zio_wait+0x136/0x2b0 [zfs]
[Sat Apr  6 22:42:42 2024]  spa_sync+0x57/0xfd0 [zfs]
[Sat Apr  6 22:42:42 2024]  ? vdev_get_stats_ex+0x250/0x330 [zfs]
[Sat Apr  6 22:42:42 2024]  ? _raw_spin_unlock+0xa/0x30
[Sat Apr  6 22:42:42 2024]  ? spa_txg_history_set.part.0+0xa7/0xd0 [zfs]
[Sat Apr  6 22:42:42 2024]  ? spa_txg_history_init_io+0x113/0x120 [zfs]
[Sat Apr  6 22:42:42 2024]  txg_sync_thread+0x1fa/0x390 [zfs]
[Sat Apr  6 22:42:42 2024]  ? txg_fini+0x260/0x260 [zfs]
[Sat Apr  6 22:42:42 2024]  ? spl_taskq_fini+0x80/0x80 [spl]
[Sat Apr  6 22:42:42 2024]  thread_generic_wrapper+0x5a/0x70 [spl]
[Sat Apr  6 22:42:42 2024]  kthread+0xda/0x100
[Sat Apr  6 22:42:42 2024]  ? kthread_complete_and_exit+0x20/0x20
[Sat Apr  6 22:42:42 2024]  ret_from_fork+0x22/0x30
[Sat Apr  6 22:42:42 2024]  </TASK>

It appears like something is truly broken in the structure of the data, as I replaced the disk in question and the error still persists. None of the disks are SMRs. The HDD is ST22000NM001E-3HM103 while the current SSD is Samsung SSD 870 EVO 500GB (previously 860 EVO).

@IvanVolosyuk
Copy link
Contributor

IvanVolosyuk commented Apr 7, 2024

This looks like something related to recent async trim changes actually 08fd5cc .

@IvanVolosyuk
Copy link
Contributor

IvanVolosyuk commented Apr 7, 2024

Samsung 860/870 Evo have quirks specific for trim in linux: drivers/ata/libata-core.c

/* devices that don't properly handle queued TRIM commands */
ATA_HORKAGE_NO_NCQ_TRIM |ATA_HORKAGE_ZERO_AFTER_TRIM |ATA_HORKAGE_NO_NCQ_ON_ATI

I wonder if we accidentally bypassed some of quirk checks now.

@kiler129
Copy link

kiler129 commented Apr 7, 2024

@IvanVolosyuk Interesting find, I need to look around if I have a spare SSD that isn't one of these models. Linking the source for posteriority: https://github.com/torvalds/linux/blob/f2f80ac809875855ac843f9e5e7480604b5cbff5/drivers/ata/libata-core.c#L4208

I downgraded ZFS to 2.2.2 and the same DISCARD lockup continues:

[   96.712287] Initializing XFRM netlink socket
[  123.582755] sd 8:0:0:3: [sdc] tag#68 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  123.582779] sd 8:0:0:3: [sdc] tag#68 Sense Key : Illegal Request [current]
[  123.582788] sd 8:0:0:3: [sdc] tag#68 Add. Sense: Invalid command operation code
[  123.582797] sd 8:0:0:3: [sdc] tag#68 CDB: Write same(16) 93 08 00 00 00 00 11 80 3b 68 00 00 00 58 00 00
[  123.582806] critical target error, dev sdc, sector 293616488 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2

Looking at the git release history the trim performance improvement has been introduced in 2.2.3 (#15843). Excuse my stupid questions, but would ZIL contain broken discard request in this case, or downgrading to 2.2.2 and seeing the same behavior proves that it's not the new trim changes?


Edit:
Downgraded even lower:

# zfs --version
zfs-2.2.0-1~exp1
zfs-kmod-2.2.0-1~exp1

and the result is the same as above. I wonder if 2.2.3 somehow damaged the on-disk structure. I cannot try the 2.1.x as this pool is upgraded with vdev_zaps_v2 introduced in 2.2.0.

Currently, I came back to 2.2.2, disabled autotrim on zpool and now running scrub just in case. I bet if I trigger zpool trim manually now it will crash in the same way. It may be some kernel changes, so I can try downgrading the kernel when scrub finishes (~1 day).

@IvanVolosyuk
Copy link
Contributor

It might be too separate bugs here, one from original poster and one from you. The change was indeed in 2.2.3, so if it is reproducible before that - it should be something else. You should probably create a separate bug for your issue as well, as your case looks pretty different from the original poster.

@gititdone876
Copy link
Author

Turns out I have logs for the zpool trim incident. Also for the resolution with 2.2.2 on the other machine. I'll post them in a few hours, when I get a chance. Unfortunately the logs for the initial hang due to autotrim being on have been deleted.

I'm not a coder, but I've been looking around some kernel code and learning more about ZFS since this happened, trying to understand how the linux storage stack is organized and how ZFS grafts itself into it. USB attached drives too. If someone can point me to a good resource (preferably with pictures) detailing how things are organized, that would be awesome. I've started watching this - ZFS TRIM Explained by Brian Behlendorf. It turns out the TRIM progress is stored on disk. I guess that's why when I imported the pool on the second machine 2.2.2 finished what 2.2.3 couldn't complete.

Should I include the logs in this comment, or edit my original post?

@robn
Copy link
Member

robn commented Apr 8, 2024

@gititdone876 Should I include the logs in this comment, or edit my original post?

Probably doesn't matter too much; personally I'd add to the original post with a line "edit 2024-04-xx adding logs as requested in [link to comment]". But I don't think it matters so much as long as we can all find it!

Turns out I have logs for the zpool trim incident. Also for the resolution with 2.2.2 on the other machine.

Excellent; being able to compare will be gold.

For what its worth, zpool trim -c poolname should totally cancel a running TRIM, including removing the stored progress data. From there, zpool trim poolname should start the entire process over from scratch, issuing new TRIM IO for all the "gaps" on the disk. Granted, the disk hardware itself may "remember" to some degree, changing the characteristics, but all other things being equal that should be the whole reproducer. Not that I'm recommending crashing your pool repeatedly!

Incidentally, 2.2.3 (via 08fd5cc via #15843) does introduce buggy TRIM error handling on some kernel versions. I've spent the morning chasing that down, and am just finishing up my testing; should be posting a PR within the next 24h. However if I'm right, it won't "fix" issues with any TRIM itself, rather, it'll just stop the pool hanging, so I'm not sure if it actually properly resolves this issue. I only mention it to say that I'm working on it :)

@gititdone876
Copy link
Author

@robn @usaleem-ix Added logs in original post.

@robn
Copy link
Member

robn commented Apr 12, 2024

@gititdone876 thank you. "oopsie" :)

So, I'm not totally sure what's happening, because this particular kernel version (6.7) is responding in a way I can quite see my way through. Specifically, kernel: zio ... type=6 is showing that the async discard path is being used, and OpenZFS did receive the error and should have returned the IO properly and so not hung.

If you can be bothered reproducing it again on 6.7+2.2.3, I'd love if you could post the contents of the following two files just after the TRIM errors are thrown (while the pool is stuck, if you like).

  • /proc/spl/kstat/zfs/POOL_NAME/iostats
  • /sys/block/sda/queue/discard_granularity

The first will show if the OpenZFS TRIM subsystem is receiving the errors at all. If so, then it shouldn't be hanging, though it might be the earlier failure causing the hang (that is, this issue might not be entirely about TRIM). If its not getting the errors though, #16070 should fix that part (#16081 for 2.2). Even if that's not it, I'd still like to see it again with that patch applied - it reorganises the whole error response path, so it shouldn't be possible for anything to fall through the cracks.

Meanwhile, if discard_granularity is set to 0, then this device doesn't even support TRIM, and OpenZFS shouldn't have even allowed you to issue it. #16068 is the fix for that (2.2 TBD, but it'll be the same patch).

"Crash your pool, apply these patches then try to crash it again" is a lot to ask, so I understand if you would rather not. Avoiding TRIM until 2.2.4 is out (which will have all these fixes) is also a good option for you from here, and then we can go around again and see if there's more to all this.

Thanks for the info, it is helping a lot!

@gititdone876
Copy link
Author

Meanwhile, if discard_granularity is set to 0, then this device doesn't even support TRIM, and OpenZFS shouldn't have even allowed you to issue it

This is what lsblk -DS says:

NAME
    DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO HCTL       TYPE VENDOR   MODEL                     REV SERIAL     TRAN
sdb        0        4K       4G         0 1:0:0:0    disk WD       WDC WD50NDZW-11BCSS1 01.01A01 1234567890 usb

FW version make an appearance I think. I have a Seagate SMR disk that shows zeros for DISC-GRAN and DISC-MAX. BTW, I had to manually change SCSI settings to be able to trim a very similar WD 5TB disk with fstrim, but that one has BTRFS on it. Maybe there's a kernel reason why it's not enabled by default? I guess because of the way ZFS replaces a lot of the storage stack it's able to manage things like TRIM on it's own and this is the only difference between the two cases. I've just recently started learning more about the more technical and low level aspects of storage, so I might be wrong here. ZFS is to blame for the new rabbit hole, it's just such a cool design from a user's perspective, I'm sure it's fascinating underneath. I'm in the process of migrating a horrendous storage practice entirely to ZFS and this is actually why I'm not too keen on crashing this pool on this drive. I'm using it to move things around at the moment. I could try to mess with it after I'm done.

I did trim this pool on the machine with zfs-2.2.2 one more time since this particular problem got resolved and it went fine. I also switched to a 6.6 LTS kernel on the Manjaro, since I think I might have read that support for the newest kernels is not perfect yet. Should I not be trimming at all even with the previous ZFS version before the the async trim was introduced?

I'm keeping an eye on this, while I learn more. Let me know if I can provide any more relevant information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants