-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input/output errors and QAT dc_fails #9784
Comments
Normally when the QAT request fails, ZFS falls back to the CPU. This is usually very rare on our production systems, maybe like 1 : 100M requests, and an error is logged to dmesg by the QAT driver:
My guess is that in your case the data for some reason is corrupt and cannot be decompressed by the QAT or the CPU gzip implementation; the checksum (post-compression) is can still be valid. Just a guess. I hope someone else can chime in on that. |
If the data is corrupt then that is also concerning, since there's no previous indication of corruption. This is a new ZFS volume and the data was copied from a backup with rsync. I can check the secondary backups to see if restoring these files has any effect. |
Restoring the file from the secondary backup has worked fine, so I guess the data was corrupted somehow. |
That is worrisome. I hope someone can chime in an elaborate what can be causing such corruption. Are you sure the hardware is all OK? No bus or memory errors? |
I've just looked over last week's system log files, and there aren't any messages about memory errors or other errors. All the scrubs, consistency checks, patrol reads, and verifications have completed successfully. But when I checked the previous week's log files, when I was copying data, I do see several "page allocation failure" messages from processes like swapper, rpciod, and nfsd. That sounds concerning but I didn't really see anything negative about it out there, more like it's just a warning. For example,
Any idea what that's about? |
Still getting read errors from bacula when it's trying to read (backup) the same files I just restored (from its backup from several months ago), the "secondary backup" I was speaking about previously. For example, rsync has an i/o error as well,
But it's not all the files. For example the "X" file from yesterday that I restored from backup is still reading fine. dc_fails is up to 8905 now.... nothing new in dmesg or the syslog |
I'm trying to restore the same files from the backup again, and noticing that dc_fails is steadily going up, it's over 9000 now and increasing. So I would say at this point it seems to be a problem with ZFS sending the data to the QAT card or something, as well as reading data with the QAT card. Here is one of the files causing problems, restored from the backup -> our web server (they are all 2.4 GB, sorry): O_nivara/canu2nd/trimming/O_nivara.ovlStore/0022 @luki-mbi perhaps you could see if your QAT setup has problems with this file? I could try to stop the qat_service too and see what happens, but that will slow down the system too much so it isn't a viable work-around. |
Sees to work fine for me.
No data errors, and performance is decent too for a single thread (~400 MB/sec). |
So our QAT card is broken then I guess? Are you also using ZFS 0.8.2? You compiled it from source using --with-qat right? |
When qat_compress() fails to allocate the required contigeous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9784
@AGI-admin I've identified the issue and have opened #9788 with the fix. The change is straight forward, but unfortunately I don't have easy access to a QAT accelerator so I wasn't able to test it myself. The issue is that when memory on the system becomes highly fragmented some QAT related memory allocations may fail. Those failures were not being reported to the higher layers of ZFS which prevented the software fallback from handling the compression. This fits with the logs you posted which clearly show that memory was highly fragmented on your system when writing these blocks. i.e.
|
Thanks @behlendorf ... but is it normal to see dc_fails increasing without any "page allocation failure" messages? Because those page allocation failures were happening a couple weeks ago, but not lately when these I/O errors are still happening.... |
The allocation failure messages might have been suppressed by the kernel, but yes I'd have expected to see at least one. You may also want to check the QAT accelerator firmware counters, they may indicate there's an issue on the hardware side. If the QAT were to return success when in fact there was a failure, you'd see a similar issue.
|
We don't have any of those counters, /sys/kernel/debug is empty,
|
You'll need to mount
|
OK thanks here's what we have there now, anything in particular to look for?
|
Good question. I was hoping it would be more self evident, the documentation doesn't appear to describe the individual counters in section 3.7. If you're willing to patch the ZFS code and rebuild we can patch the code to log the exact error. |
Yes I'm trying to rebuild it right now with #9788 |
@behlendorf I'm getting an error when compiling qat_compress.c now,
|
@AGI-admin it looks like you may need to re-run |
yeah i did that, after decompressing the source archive... I have to change |
ugh this is such a mess, i think it's back to this commit, #9268 |
Ahh yes, sorry I'd forgotten about that issue. We'll make sure it gets included in the next tag. |
Actually i think it was already merged, since the code matches, so not sure why the above error is happening :( |
When qat_compress() fails to allocate the required contigeous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9784
So is this bug corrupting our data? If so, is it just the data that we happen to see I/O errors with? In that case we can restore the individual files from our backup.... |
Hello @behlendorf , I applied the patch and recompiled ZFS and still getting input/output errors :( I restored files from backup and still can't read it. For example,
Then I delete this 0017 file and restore from backup from 2019-08-26 (long before ZFS and QAT were installed in the system), dc_fails increased again, and the file is still unreadable,
|
@luki-mbi thanks for the info! Please upgrade to the latest ZFS release and see if the issue is fixed. I've tried to copy/read more than 2TB files in the whole night and no issue was found on ZFS 0.8.4. |
@wli5 I am sad to report that this bug is not resolved. I did a large send (37 TiB); upon reading the results (also via zfs send, but redirecting to How would you like to proceed? I'm happy to assist any way I can (information, testing, etc). This is a high priority issue for us, so I'm willing to make the time if I can be of assistance. |
@aqw have you updated to 0.8.4? can you please let me know your test method? what we did was copying many files to the pool and read them back to verify the content are same. |
@wli5 Yes, as I mentioned above, I am running ZFS 0.8.4 and QAT driver 1.7.l.4.10.0. The machine is running Debian 10 on the 4.19 kernel. The method I did to test:
I noted these numbers beforehand, so it's possible that there's a mistake here. I need to re-run the test to confirm, and capture the full contents before and after. |
I have good news: I have a 500G dataset that I can reliably reproduce this bug with using the above method. Even after a reboot. I tested 5 times. The point where it fails in the data stream is not consistent, suggesting that something about this data increases the probability of encountering the bug, but isn't a simple "this file/block always corrupts". I am not sure if I can share this dataset (I need to check first), but I am happy to provide information and debugging. |
@aqw when you saw the fail, can the files still be read? just wondering if the software fallback worked. if you can continue narrow down to a small set of data and share with us that will be helpful to debug the issue. |
@wli5 The files with corrupted blocks can't be read... If Please re-read what I have written above. Something, somewhere, is still generating corrupted blocks. I don't know if that's a result of the fall-back failing or the worse-scenario that QAT (rarely) generates blocks that it thinks are valid, but actually aren't. I also tried @luki-mbi's suggestion to limit the ARC, as it helped them. I capped at 50% (leaving ~128GB free) and the problem is sadly still reproducible. I will see if I can get a smaller dataset that triggers this problem. I was already excited that I could reliable trigger this with just 500GB, as it's such a low-frequency problem. My suspicion is that this problem happens with many small files. The dataset that reliably triggers this bug is 1.5 million JSON files totaling ~500GB (average size ~330KB). |
@aqw are you able to reduce the file size? |
@wli5 I have not yet been able to generate a smaller dataset that tickles this. I'm still looking. |
I came back to this, and I can only consistently reproduce it with that 500GB dataset. Other workloads do still cause this problem, but I can't get the reliability of failure up. @wli5 What information can I provide that will help you forward on this? Currently, all of our QAT cards remain idle due to this bug. |
@aqw is it possible to figure out which piece of data caused the issue? Is it on the same block or random? Can you just test part of the data within the 500GB to see which part caused the issue? |
It's not always the same block (oddly enough), but is definitely not random. It's usually around similar blocks. I don't know how to use |
@behlendorf maybe Brian can give some advise on the zdb? |
@aqw is the 500GB a single file or multiple files? Can you narrow down which file caused the issue? |
@wli5 I have provided much of this info before (see above). The 500G is the size of the dataset. I am primarily testing this by using I have isolated the files it fails on (simple JSON files). Sending just those does not cause the bug. This bug behaves as if the QAT card needs to be primed with a certain type of data, then another type of data causes corrupted output. That's my best guess. |
@aqw I see.. thanks for the info. Can your files be shared so we can reproduce in our lab? |
Hi, root@dc1:/sys/kernel/debug/qat_c3xxx_0000:01:00.0# cat cnv_errors root@dc1:/proc/spl/kstat/zfs# cat qat The affected files are kvm/qemu qcow2 files from active virtual machines. How can I help t solve this error? |
Hi @meckiemac thanks! Can you please share more details how to reproduce the issue? |
In my case it's my home server Supermicro SYS-E200-9A. The system is on an NVMe in the original NVMe slot and the user data resides on another M2 NVMe via an adapter in the PCIe slot. A third HDD is connected via SATA as backup disk. I don't know when the error was introduced, but it seem reproducible. The virtual machine experiencing the issue got created with virt-install and the disk size (qcow2) was internal 30GB. Between creation and backup there was 2-4 hours the last time I experienced it. The high count of the error comes from the retry of the disk subsystem in the virtual machine. With its 4 cores the atom is sometimes under good load. I have also an nextcloud instance running and so on. So the reproducible step is: create a virtual machine with windows 10 as guest … and wait few hours of write activities in my case. (It happened before, but with a longer time between creation and detection.) |
Thanks @meckiemac , but I don't quite understand. You have Linux machine, and have ZFS running on it for an NVMe disk and a SATA drive, and you have Windows10 virtual machine on this Linux server... then how to reproduce the issue, as you said during backups, how? From windows VM? |
Currently I disabled gzip compression on large files and virtual machines. I also do regular zfs send to null to check the rest of the data. So far the issue only found with virtual machine disk files (qcow2) Does this help better to understand the scenario? |
What I can offer is a dump out the affected block(s) and we check the content to find out from where it came. But I need to know how to extract this block. With this we may get an idea if:
~Andreas |
@merkiemc Hi Andreas, thanks for the info! If you write/read the qcow2 file to ZFS with compression, can the issue be triggered? |
I haven't tried to trigger the corruption on purpose yet. The system is also my home "production" system, so I'm a bit careful. I found another compression corruption on the Sata disk (Backup) in the OSX-"Time Machine Backup" space. The "Time Machine Backup" space is provided via Samba to My OSX clients. The big difference here is that we have a lot of smaller files compared to the virtual machines. Currently I'm trying to clear this corruption, since the Time Machine is very sensitive with disk errors and finally also switch off compression here. What I can try to setup is a script which is installing windows virtual machines and then checks the written data in a loop, until we trigger the corruption, since this is the most obvious area where this error occurred. With the script, if successful, we could try to collect more data or modify the system and observe the behavior. |
If you can provide the script it can also help us to reproduce the issue. Thanks! |
We have tried to reproduce this issue by creating VM and backup snapshot regularly to a compressed ZFS pool (QAT enabled), by running more than 2 weeks repeatly, without any issue. |
@wli5 my process would looks exactly similar, but I haven't found the time to implement it yet. Family and job have currently priority. The interesting part: it is not only bound to VMs since I have the same corruption in backup data thru a Samba-server. So I guess it is connected to the system load too. To be able to automate my VM I need to create an unattended setup for windows 10 first, which takes some time. Sorry for the delay. The current workaround for me is switching of compression. The archieved compression rate is negligible compared to my available space. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
Getting some Input/output errors when trying to read some data. "zpool status" shows all disks are ONLINE and "No known data errors", last scrub just finished earlier today. Noticed that the dc_fails counter is increasing when the I/O errors happen, so I'm guessing it's related to the QAT code in ZFS or the QAT card itself? The QAT card status is:
and there aren't any errors popping up in the system log when the I/O errors happen. I could open a case with Intel if you think it would be helpful?
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
There are no warning/errors/backtraces from the system logs, just the dc_fails is incremented
Thanks
The text was updated successfully, but these errors were encountered: