-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption after TRIM #14513
Comments
n.b. It'd be useful to know the model and make of the SSD, and firmware version, as some SSDs have had...questionable TRIM implementations. (Not saying ZFS can't have a bug here, just useful as a data point if it turns out that it only breaks on WD SSDs or Samsung or something.) e: It'd also be interesting if you have any of the It could, of course, also be native encryption being on fire, because why not. |
I have used it with this file system for more than a year, and never got an error. It went without trimming, because I only noticed today that The model of the drive is WDC WDS120G2G0A-00JH30. Worth noting that there is a swap partition on the drive (sda3), which has been extensively used (swappiness 60), but I never got a kernel panic or any other memory error which would indicate that the drive fails under the swap partition. Neither got an error on the BTRFS boot partition. |
Of course, I'm not trying to suggest you did something wrong or that it's necessarily anything else at fault, just trying to figure out what makes your setup different from many other people who haven't been bitten by this. This suggests TRIM support on that drive might be in The Bad Place(tm) sometimes, and I don't see an existing erratum in Linux's libata for it? Unclear why btrfs might not be upset about that, though, maybe not using enough queued IO, or not doing writes at the same time? You could test this, I think, by forcibly telling Linux to apply that erratum to that drive. Something like |
I've been hit by this issue as well.
I also tried setting queue depth to 1 without improvement. I can repro it everytime by trimming one device, then scrubbing. That device will show 10 to 30 checksum errors, which are fixed by the scrub. |
You're absolutely sure you're actually running 2.2.2? Because #15395 would be the obvious reason to suspect something going foul here, and that's in 2.2.0+. |
Yes, I'm sure.
|
#15588 becomes my default guess for a thing to try, then, though I'd also test that you can't reproduce this with a non-ZFS configuration to be sure it's not just the drive doing something ridiculous with TRIM. |
Hmm, okay. I'll wait for the next minor release that will include it, this is a production system so I can't test it with non-ZFS filesystems at the moment, though I could buy a replacement drive from a different manufacturer and replace one of the two drives for the mirror vdev. Is there any unofficial list on the best SSDs to use with ZFS? I know Samsung SATA drives have NCQ TRIM issues, should I get a Micron one? I'm a bit lost. |
I wouldn't expect that fix to go into a minor release, though I could be wrong. I'm not really aware of any SSDs that should have visible issues with ZFS where the underlying issue is an interaction with the SSD. #14793 has a number of people getting upset about certain models of SSD, but a number of them also reported getting a better PSU made the issues disappear, so who knows. There's also #15588 complicating matters, but that seems more controller and memory pressure-specific and not drive. |
I highly doubt #15588 will have anything to do with it - it doesn't change TRIM. |
I suspect this may be down to this model of drive's firmware so I've ordered two commonly used drives to replace them. Will see if this can fix the issue in the next few days. |
Update: after switching to 870 Evos the issue is no longer reproducible. No corruption or checksum errors on TRIM. |
I've got the same error on nearly the same SSDs. In my case it's a Western Digital WDS240G2G0A-00JH30 and a SanDisk SDSSDA-240G (they are both the same drive, one is just rebranded). When debugging I tested this error with a lot of different drives and it only seems to affect these two one. One notable difference of these two drives to all the other ones I've tested is that only these drives use DRAT TRIM (deterministic return after trim). All the other ones I have tested either claim to have non deterministic trim or RZAT TRIM (return zero after trim). I'm wondering if maybe ZFS has a problem with DRAT TRIM ssds or if it's just the SanDisk / WD ssds behaving badly. I also verified through testing that it's not the PSU / Sata Controller / Sata cables / RAM / CPU causing these issues. I've tried three different machines with different hardware and the error always only manifests with those two SSDs. The S.M.A.R.T values are all ok and the error also only occurs after deleting some files and then running a trim and then a scrub. Without trim, no error occurs at all. System info:
|
cf. #15395 in 2.1.14. |
I am now using the TRIM without any issues, on the same drive as I opened the issue originally about. However, there are some differences since then:
I don't know how these could affect the bugs, but I thought I'll throw in this info here in case someone finds it useful. |
I did now retest it |
If you've updated to 2.1.14 or newer and are no longer seeing this it was probably caused by #15395 which has been resolved. |
I retested it on 2.2.2-4 in Debian 13 and the issue is still present. |
That's unfortunate, version 2.2.2 does include the fix for #15395. |
It's unlikely that #15395 caused it because I did not put the system under reasonable write workload while trimming. |
I am facing the exact same situation using a Marvell-based SanDisk SSD Plus. There have not been any issues until enabling auto-trim. In addition, the issues only occured after a TRIM operation, as can be seen in the zpool history. In terms of my setup, I am also using ashift=9, native encryption, and am running NixOS. However, I will also try to up the ashift value to 12, in case the sector size is being misreported. Update: apparently no issues so far, invoking trim manually. |
System information
Describe the problem you're observing
Data corruption in some files and lot of checksum errors.
Describe how to reproduce the problem
I ran
zpool scrub
on a simple zpool, consisting of a single partition on a 120 GiB SSD. It finished without any errors (neither data or checksum).The I ran
zpool trim
on the pool. Then a few minutes later I saw 2 checksum errors by the report. Then a few more.(I'm not sure if the first error appeared before or after the TRIM.)
I ran
zfs trim
again, because the first time it finished quite fast despite there were much space to trim. Few more errors. (Probably not related to running TRIM again, because the errors has been constantly increasing since then.)Then I was thinking maybe the first scrub before the trim somehow did not notice the errors, so I ran
zfs scrub
again. The errors grown to 100-300 in the first few seconds, so I stopped it.At this point I couldn't use
zfs send
to save snapshot, so I usedrsync
to backup the data. It reported I/O error for a few files.It's an SSD, which report 512 blocks size, so I used
ashift=9
. Later I learned that most SSDs actually use larger blocks, but report 512. I don't know if it affected the trim.The
sda1
BTRFS partition contains ~2 GiB data (kernels, initramfs, and boot related stuff). I run a TRIM and abtrfs scrub
on it. No errors were found.A SMART extended self test was running when I first ran the
zfs scrub
(and possibly when I runzfs trim
the first time, but it could have finished then).Native encryption and compression is active.
Partition scheme:
Output of
zpool status
before submitting issue:Include any warning/errors/backtraces from the system logs
No errors in dmesg.
The text was updated successfully, but these errors were encountered: