Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input/output errors and QAT dc_fails #9784

Closed
AGI-chandler opened this issue Dec 29, 2019 · 81 comments
Closed

Input/output errors and QAT dc_fails #9784

AGI-chandler opened this issue Dec 29, 2019 · 81 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@AGI-chandler
Copy link

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 6.10
Linux Kernel 2.6.32-754.23.1.el6
Architecture x86_64
ZFS Version 0.8.2-1
SPL Version 0.8.2-1

Describe the problem you're observing

Getting some Input/output errors when trying to read some data. "zpool status" shows all disks are ONLINE and "No known data errors", last scrub just finished earlier today. Noticed that the dc_fails counter is increasing when the I/O errors happen, so I'm guessing it's related to the QAT code in ZFS or the QAT card itself? The QAT card status is:

# /etc/init.d/qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
 qat_dev0 - type: c6xx,  inst_id: 0,  node_id: 1,  bsf: 0000:84:00.0,  #accel: 5 #engines: 10 state: up
 qat_dev1 - type: c6xx,  inst_id: 1,  node_id: 1,  bsf: 0000:85:00.0,  #accel: 5 #engines: 10 state: up
 qat_dev2 - type: c6xx,  inst_id: 2,  node_id: 1,  bsf: 0000:86:00.0,  #accel: 5 #engines: 10 state: up
#

and there aren't any errors popping up in the system log when the I/O errors happen. I could open a case with Intel if you think it would be helpful?

Describe how to reproduce the problem

$ cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 361599045964 3369212093252152
name                            type data
comp_requests                   4    284279855
comp_total_in_bytes             4    35765197023232
comp_total_out_bytes            4    12613266941171
decomp_requests                 4    250580310
decomp_total_in_bytes           4    12703419899870
decomp_total_out_bytes          4    31075581360640
**dc_fails                        4    5804**
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    1396
cksum_total_in_bytes            4    155779072
cksum_fails                     4    0 

$ rsync -avP X /tmp
sending incremental file list
X
  2444094260 100%   49.18MB/s    0:00:47 (xfer#1, to-check=0/1)
rsync: read errors mapping "X": Input/output error (5)
WARNING: X failed verification -- update discarded (will try again).
X
  2444094260 100%   50.98MB/s    0:00:45 (xfer#2, to-check=0/1)
rsync: read errors mapping "X": Input/output error (5)
ERROR: X failed verification -- update discarded.

sent 4888785353 bytes  received 52 bytes  51733178.89 bytes/sec
total size is 2444094260  speedup is 0.50
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]

$ cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 361599045964 3369334123310545
name                            type data
comp_requests                   4    284279939
comp_total_in_bytes             4    35765206926336
comp_total_out_bytes            4    12613268958496
decomp_requests                 4    250617594
decomp_total_in_bytes           4    12707033557470
decomp_total_out_bytes          4    31080467294208
**dc_fails                        4    5808**
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    1396
cksum_total_in_bytes            4    155779072
cksum_fails                     4    0
$ 

Include any warning/errors/backtraces from the system logs

There are no warning/errors/backtraces from the system logs, just the dc_fails is incremented

Thanks

@luki-mbi
Copy link

Normally when the QAT request fails, ZFS falls back to the CPU. This is usually very rare on our production systems, maybe like 1 : 100M requests, and an error is logged to dmesg by the QAT driver:

[5344115.233436] dcCompression_ProcessCallback() - : The slice hang is detected on the compression slice

My guess is that in your case the data for some reason is corrupt and cannot be decompressed by the QAT or the CPU gzip implementation; the checksum (post-compression) is can still be valid. Just a guess. I hope someone else can chime in on that.

@AGI-chandler
Copy link
Author

If the data is corrupt then that is also concerning, since there's no previous indication of corruption. This is a new ZFS volume and the data was copied from a backup with rsync. I can check the secondary backups to see if restoring these files has any effect.

@AGI-chandler
Copy link
Author

Restoring the file from the secondary backup has worked fine, so I guess the data was corrupted somehow.

@luki-mbi
Copy link

That is worrisome. I hope someone can chime in an elaborate what can be causing such corruption.
Possibly the data got corrupted during or after compression (by the QAT?), and the checksum was computed for the corrupted data. So to ZFS all looks fine, the data are readable and the checksum matches. It just cannot be decompressed and results in an I/O error.

Are you sure the hardware is all OK? No bus or memory errors?

@AGI-chandler
Copy link
Author

AGI-chandler commented Dec 29, 2019

I've just looked over last week's system log files, and there aren't any messages about memory errors or other errors. All the scrubs, consistency checks, patrol reads, and verifications have completed successfully.

But when I checked the previous week's log files, when I was copying data, I do see several "page allocation failure" messages from processes like swapper, rpciod, and nfsd. That sounds concerning but I didn't really see anything negative about it out there, more like it's just a warning. For example,

Dec 18 20:22:20 pac kernel: nfsd: page allocation failure. order:4, mode:0x20
Dec 18 20:22:20 pac kernel: Pid: 13417, comm: nfsd Tainted: P           -- ------------    2.6.32-754.23.1.el6.x86_64 #1
Dec 18 20:22:20 pac kernel: Call Trace:
Dec 18 20:22:20 pac kernel: [<ffffffff81143c71>] ? __alloc_pages_nodemask+0x7e1/0x960
Dec 18 20:22:20 pac kernel: [<ffffffff81186412>] ? kmem_getpages+0x62/0x170
Dec 18 20:22:20 pac kernel: [<ffffffff8118702a>] ? fallback_alloc+0x1ba/0x270
Dec 18 20:22:20 pac kernel: [<ffffffff81186a87>] ? cache_grow+0x2d7/0x320
Dec 18 20:22:20 pac kernel: [<ffffffff81186da9>] ? ____cache_alloc_node+0x99/0x160
Dec 18 20:22:20 pac kernel: [<ffffffff8147da22>] ? pskb_expand_head+0x62/0x280
Dec 18 20:22:20 pac kernel: [<ffffffff811879f9>] ? __kmalloc+0x199/0x230
Dec 18 20:22:20 pac kernel: [<ffffffff8147da22>] ? pskb_expand_head+0x62/0x280
Dec 18 20:22:20 pac kernel: [<ffffffffa0885e3e>] ? __nf_conntrack_find+0x15e/0x170 [nf_conntrack]
Dec 18 20:22:20 pac kernel: [<ffffffff8147e30a>] ? __pskb_pull_tail+0x2aa/0x360
Dec 18 20:22:20 pac kernel: [<ffffffff8148ddf9>] ? harmonize_features+0x29/0x70
Dec 18 20:22:20 pac kernel: [<ffffffff8148e474>] ? dev_hard_start_xmit+0x1c4/0x490
Dec 18 20:22:20 pac kernel: [<ffffffff814ac26a>] ? sch_direct_xmit+0x15a/0x1c0
Dec 18 20:22:20 pac kernel: [<ffffffff8148ea30>] ? dev_queue_xmit+0x270/0x360
Dec 18 20:22:20 pac kernel: [<ffffffff81495566>] ? neigh_connected_output+0xc6/0x110
Dec 18 20:22:20 pac kernel: [<ffffffff814cb67d>] ? ip_finish_output+0x29d/0x380
Dec 18 20:22:20 pac kernel: [<ffffffff814cb818>] ? ip_output+0xb8/0xc0
Dec 18 20:22:20 pac kernel: [<ffffffff814caa2f>] ? __ip_local_out+0x9f/0xb0
Dec 18 20:22:20 pac kernel: [<ffffffff814caa6b>] ? ip_local_out+0x2b/0x40
Dec 18 20:22:20 pac kernel: [<ffffffff814caf80>] ? ip_queue_xmit+0x190/0x420
Dec 18 20:22:20 pac kernel: [<ffffffff814e0d9a>] ? tcp_transmit_skb+0x4ea/0x930
Dec 18 20:22:20 pac kernel: [<ffffffff814e34ac>] ? tcp_write_xmit+0x1dc/0xaa0
Dec 18 20:22:20 pac kernel: [<ffffffff814e40a0>] ? __tcp_push_pending_frames+0x30/0xe0
Dec 18 20:22:20 pac kernel: [<ffffffff814d28f9>] ? tcp_push+0x69/0x90
Dec 18 20:22:20 pac kernel: [<ffffffff814d423c>] ? tcp_sendpage+0x51c/0x590
Dec 18 20:22:20 pac kernel: [<ffffffff814721bb>] ? kernel_sendpage+0x5b/0x90
Dec 18 20:22:20 pac kernel: [<ffffffffa064da2d>] ? svc_send_common+0xfd/0x160 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa064db02>] ? svc_sendto+0x72/0x1f0 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa06dcc81>] ? nfsd4_release_compoundargs+0x31/0x80 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffffa06da328>] ? nfsd4_proc_compound+0x238/0x4b0 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffffa064e6fd>] ? auth_domain_put+0x1d/0x80 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa064dd19>] ? svc_tcp_sendto+0x39/0xa0 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa0659789>] ? svc_send+0xb9/0x100 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffff81070a10>] ? default_wake_function+0x0/0x20
Dec 18 20:22:20 pac kernel: [<ffffffffa064b730>] ? svc_process+0x130/0x160 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa06c7d22>] ? nfsd+0xc2/0x160 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffffa06c7c60>] ? nfsd+0x0/0x160 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffff810aac40>] ? kthread+0xa0/0xc0
Dec 18 20:22:20 pac kernel: [<ffffffff81565820>] ? child_rip+0x20/0x30
Dec 18 20:22:20 pac kernel: [<ffffffff810aaba0>] ? kthread+0x0/0xc0
Dec 18 20:22:20 pac kernel: [<ffffffff81565800>] ? child_rip+0x0/0x30

Any idea what that's about?

@AGI-chandler
Copy link
Author

Still getting read errors from bacula when it's trying to read (backup) the same files I just restored (from its backup from several months ago), the "secondary backup" I was speaking about previously. For example,
29-Dec 16:00 bacula-fd JobId 28943: Error: Read error on file O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017. ERR=Input/output error

rsync has an i/o error as well,

$ rsync -avP O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017 /tmp
sending incremental file list
0017
  2485036820 100%   85.53MB/s    0:00:27 (xfer#1, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)

sent 2485340260 bytes  received 31 bytes  84248823.42 bytes/sec
total size is 2485036820  speedup is 1.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
$ 

But it's not all the files. For example the "X" file from yesterday that I restored from backup is still reading fine.

dc_fails is up to 8905 now.... nothing new in dmesg or the syslog

@AGI-chandler
Copy link
Author

I'm trying to restore the same files from the backup again, and noticing that dc_fails is steadily going up, it's over 9000 now and increasing. So I would say at this point it seems to be a problem with ZFS sending the data to the QAT card or something, as well as reading data with the QAT card.

Here is one of the files causing problems, restored from the backup -> our web server (they are all 2.4 GB, sorry): O_nivara/canu2nd/trimming/O_nivara.ovlStore/0022 @luki-mbi perhaps you could see if your QAT setup has problems with this file?

I could try to stop the qat_service too and see what happens, but that will slow down the system too much so it isn't a viable work-around.

@luki-mbi
Copy link

Sees to work fine for me.

  • Original file in /net/scratch (no compression)
  • Copy in /projects, compressed with QAT (using a C627 chipset)
  • Copy in /archive, compressed with QAT (using a DH8950 PCIe card)
# md5sum /net/scratch/lugo/0022 /projects/0022 /archive/lugo/0022 
d27eb19aef7965f71ad75dedcf4e155b  /net/scratch/lugo/0022
d27eb19aef7965f71ad75dedcf4e155b  /projects/0022
d27eb19aef7965f71ad75dedcf4e155b  /archive/lugo/0022

# du -sh /net/scratch/lugo/0022 /projects/0022 /archive/lugo/0022 
2.4G	/net/scratch/lugo/0022
1.9G	/projects/0022
1.8G	/archive/lugo/0022

# dd if=0022  of=/dev/null bs=1M
2366+1 records in
2366+1 records out
2481227080 bytes (2.5 GB, 2.3 GiB) copied, 6.09046 s, 407 MB/s

No data errors, and performance is decent too for a single thread (~400 MB/sec).

@AGI-chandler
Copy link
Author

So our QAT card is broken then I guess? Are you also using ZFS 0.8.2? You compiled it from source using --with-qat right?

behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 30, 2019
When qat_compress() fails to allocate the required contigeous memory
it mistakenly returns success.  This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths.  Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#9784
@behlendorf
Copy link
Contributor

@AGI-admin I've identified the issue and have opened #9788 with the fix. The change is straight forward, but unfortunately I don't have easy access to a QAT accelerator so I wasn't able to test it myself.

The issue is that when memory on the system becomes highly fragmented some QAT related memory allocations may fail. Those failures were not being reported to the higher layers of ZFS which prevented the software fallback from handling the compression. This fits with the logs you posted which clearly show that memory was highly fragmented on your system when writing these blocks. i.e.

nfsd: page allocation failure. order:4, mode:0x20

@AGI-chandler
Copy link
Author

Thanks @behlendorf ... but is it normal to see dc_fails increasing without any "page allocation failure" messages? Because those page allocation failures were happening a couple weeks ago, but not lately when these I/O errors are still happening....

@behlendorf
Copy link
Contributor

The allocation failure messages might have been suppressed by the kernel, but yes I'd have expected to see at least one. You may also want to check the QAT accelerator firmware counters, they may indicate there's an issue on the hardware side. If the QAT were to return success when in fact there was a failure, you'd see a similar issue.

/sys/kernel/debug/qat_*/fw_counters

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 30, 2019
@AGI-chandler
Copy link
Author

We don't have any of those counters, /sys/kernel/debug is empty,

$ cat /sys/kernel/debug/qat_*/fw_counters
cat: /sys/kernel/debug/qat_*/fw_counters: No such file or directory
$ ls /sys/kernel/debug/
$ 

@behlendorf
Copy link
Contributor

You'll need to mount debugfs manually, here's a link to the relevant documentation. To summarize:

mount -t debugfs nodev /sys/kernel/debug

@AGI-chandler
Copy link
Author

OK thanks here's what we have there now, anything in particular to look for?

# cat /sys/kernel/debug/qat_*/fw_counters
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:         12149803939 |
| Firmware Responses[AE  0]:         12149803939 |
+------------------------------------------------+
| Firmware Requests [AE  1]:         11501471210 |
| Firmware Responses[AE  1]:         11501471210 |
+------------------------------------------------+
| Firmware Requests [AE  2]:         12149817476 |
| Firmware Responses[AE  2]:         12149817476 |
+------------------------------------------------+
| Firmware Requests [AE  3]:         11501576382 |
| Firmware Responses[AE  3]:         11501576382 |
+------------------------------------------------+
| Firmware Requests [AE  4]:         12148747996 |
| Firmware Responses[AE  4]:         12148747996 |
+------------------------------------------------+
| Firmware Requests [AE  5]:         11501307552 |
| Firmware Responses[AE  5]:         11501307552 |
+------------------------------------------------+
| Firmware Requests [AE  6]:         12145310500 |
| Firmware Responses[AE  6]:         12145310500 |
+------------------------------------------------+
| Firmware Requests [AE  7]:         11500510194 |
| Firmware Responses[AE  7]:         11500510194 |
+------------------------------------------------+
| Firmware Requests [AE  8]:         12142185315 |
| Firmware Responses[AE  8]:         12142185315 |
+------------------------------------------------+
| Firmware Requests [AE  9]:         11499848788 |
| Firmware Responses[AE  9]:         11499848788 |
+------------------------------------------------+
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:           625652871 |
| Firmware Responses[AE  0]:           625652871 |
+------------------------------------------------+
| Firmware Requests [AE  1]:           586996627 |
| Firmware Responses[AE  1]:           586996627 |
+------------------------------------------------+
| Firmware Requests [AE  2]:           625656789 |
| Firmware Responses[AE  2]:           625656789 |
+------------------------------------------------+
| Firmware Requests [AE  3]:           587004550 |
| Firmware Responses[AE  3]:           587004550 |
+------------------------------------------------+
| Firmware Requests [AE  4]:           625588159 |
| Firmware Responses[AE  4]:           625588159 |
+------------------------------------------------+
| Firmware Requests [AE  5]:           586988266 |
| Firmware Responses[AE  5]:           586988266 |
+------------------------------------------------+
| Firmware Requests [AE  6]:           625391062 |
| Firmware Responses[AE  6]:           625391062 |
+------------------------------------------------+
| Firmware Requests [AE  7]:           586948068 |
| Firmware Responses[AE  7]:           586948068 |
+------------------------------------------------+
| Firmware Requests [AE  8]:           625196343 |
| Firmware Responses[AE  8]:           625196343 |
+------------------------------------------------+
| Firmware Requests [AE  9]:           586918530 |
| Firmware Responses[AE  9]:           586918530 |
+------------------------------------------------+
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:            18575554 |
| Firmware Responses[AE  0]:            18575554 |
+------------------------------------------------+
| Firmware Requests [AE  1]:            18575271 |
| Firmware Responses[AE  1]:            18575271 |
+------------------------------------------------+
| Firmware Requests [AE  2]:            18574983 |
| Firmware Responses[AE  2]:            18574983 |
+------------------------------------------------+
| Firmware Requests [AE  3]:            18575350 |
| Firmware Responses[AE  3]:            18575350 |
+------------------------------------------------+
| Firmware Requests [AE  4]:            18575441 |
| Firmware Responses[AE  4]:            18575441 |
+------------------------------------------------+
| Firmware Requests [AE  5]:            18574861 |
| Firmware Responses[AE  5]:            18574861 |
+------------------------------------------------+
| Firmware Requests [AE  6]:            18575357 |
| Firmware Responses[AE  6]:            18575357 |
+------------------------------------------------+
| Firmware Requests [AE  7]:            18575451 |
| Firmware Responses[AE  7]:            18575451 |
+------------------------------------------------+
| Firmware Requests [AE  8]:            18574719 |
| Firmware Responses[AE  8]:            18574719 |
+------------------------------------------------+
| Firmware Requests [AE  9]:            18575148 |
| Firmware Responses[AE  9]:            18575148 |
+------------------------------------------------+
# 

@behlendorf
Copy link
Contributor

Good question. I was hoping it would be more self evident, the documentation doesn't appear to describe the individual counters in section 3.7. If you're willing to patch the ZFS code and rebuild we can patch the code to log the exact error.

@AGI-chandler
Copy link
Author

Yes I'm trying to rebuild it right now with #9788

@AGI-chandler
Copy link
Author

@behlendorf I'm getting an error when compiling qat_compress.c now,

/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:544: error: expected declaration specifiers or ‘...’ before ‘zfs_kernel_param_t’
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c: In function ‘param_set_qat_compress’:
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: ‘kp’ undeclared (first use in this function)
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: (Each undeclared identifier is reported only once
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: for each function it appears in.)
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c: At top level:
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:565: warning: initialization from incompatible pointer type

@behlendorf
Copy link
Contributor

@AGI-admin it looks like you may need to re-run ./autogen.sh and ./configure against the intended kernel. For some reason, the wrong kernel interface was used. There might be a typo in the PR but I didn't see anything when visually inspecting it.

@AGI-chandler
Copy link
Author

yeah i did that, after decompressing the source archive... I have to change #include <sys/qat.h> to #include "qat.h", but otherwise it builds fine with the old qat_compress.c.

@AGI-chandler
Copy link
Author

ugh this is such a mess, i think it's back to this commit, #9268

@behlendorf
Copy link
Contributor

Ahh yes, sorry I'd forgotten about that issue. We'll make sure it gets included in the next tag.

@AGI-chandler
Copy link
Author

Actually i think it was already merged, since the code matches, so not sure why the above error is happening :(

behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 31, 2019
When qat_compress() fails to allocate the required contigeous memory
it mistakenly returns success.  This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths.  Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#9784
@AGI-chandler
Copy link
Author

So is this bug corrupting our data? If so, is it just the data that we happen to see I/O errors with? In that case we can restore the individual files from our backup....

@AGI-chandler
Copy link
Author

Hello @behlendorf , I applied the patch and recompiled ZFS and still getting input/output errors :( I restored files from backup and still can't read it. For example,

tmp# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22259398341141
name                            type data
comp_requests                   4    34962
comp_total_in_bytes             4    4335009280
comp_total_out_bytes            4    2155182815
decomp_requests                 4    249190
decomp_total_in_bytes           4    15959668224
decomp_total_out_bytes          4    32395348992
dc_fails                        4    19
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0

# rsync -avP O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017 /tmp
sending incremental file list
0017
  2485036820 100%  106.94MB/s    0:00:22 (xfer#1, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
WARNING: 0017 failed verification -- update discarded (will try again).
0017
  2485036820 100%  102.12MB/s    0:00:23 (xfer#2, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
ERROR: 0017 failed verification -- update discarded.

sent 4970680473 bytes  received 52 bytes  109245725.82 bytes/sec
total size is 2485036820  speedup is 0.50
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22318995971136
name                            type data
comp_requests                   4    34984
comp_total_in_bytes             4    4337058304
comp_total_out_bytes            4    2155334849
decomp_requests                 4    286905
decomp_total_in_bytes           4    19604866560
decomp_total_out_bytes          4    37337207296
dc_fails                        4    27
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0
# 

Then I delete this 0017 file and restore from backup from 2019-08-26 (long before ZFS and QAT were installed in the system), dc_fails increased again, and the file is still unreadable,

# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22716806520828
name                            type data
comp_requests                   4    54054
comp_total_in_bytes             4    6833537024
comp_total_out_bytes            4    3983695013
decomp_requests                 4    287172
decomp_total_in_bytes           4    19614187520
decomp_total_out_bytes          4    37369810432
dc_fails                        4    34
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0

# rsync -avP O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017 /tmp
sending incremental file list
0017
  2485036820 100%  112.86MB/s    0:00:20 (xfer#1, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
WARNING: 0017 failed verification -- update discarded (will try again).
0017
  2485036820 100%  105.85MB/s    0:00:22 (xfer#2, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
ERROR: 0017 failed verification -- update discarded.

sent 4970680473 bytes  received 52 bytes  111700685.96 bytes/sec
total size is 2485036820  speedup is 0.50
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]

# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22814570223417
name                            type data
comp_requests                   4    54095
comp_total_in_bytes             4    6837144576
comp_total_out_bytes            4    3983944970
decomp_requests                 4    325519
decomp_total_in_bytes           4    23296637440
decomp_total_out_bytes          4    42393592832
dc_fails                        4    44
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0
# 

@wli5
Copy link
Contributor

wli5 commented Aug 7, 2020

@luki-mbi thanks for the info! Please upgrade to the latest ZFS release and see if the issue is fixed. I've tried to copy/read more than 2TB files in the whole night and no issue was found on ZFS 0.8.4.

@aqw
Copy link

aqw commented Aug 9, 2020

@wli5 I am sad to report that this bug is not resolved. I did a large send (37 TiB); upon reading the results (also via zfs send, but redirecting to /dev/null), I get a decompress failure, and the send fails.

How would you like to proceed? I'm happy to assist any way I can (information, testing, etc). This is a high priority issue for us, so I'm willing to make the time if I can be of assistance.

@wli5
Copy link
Contributor

wli5 commented Aug 10, 2020

@aqw have you updated to 0.8.4? can you please let me know your test method? what we did was copying many files to the pool and read them back to verify the content are same.

@aqw
Copy link

aqw commented Aug 10, 2020

have you updated to 0.8.4?

@wli5 Yes, as I mentioned above, I am running ZFS 0.8.4 and QAT driver 1.7.l.4.10.0. The machine is running Debian 10 on the 4.19 kernel.

The method I did to test:

# a large pool with plenty of free space
$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP   HEALTH  ALTROOT
hdd1   327T   149T   179T        -         -     1%    45%  1.00x   ONLINE  -

# enable QAT compression and checksumming for ZFS
$ echo 0 >| sys/module/zfs/parameters/zfs_qat_compress_disable
$ echo 0 >| sys/module/zfs/parameters/zfs_qat_checksum_disable

# source is gzip-9
$ zfs get compress hdd1/data1/project
NAME               PROPERTY      VALUE     SOURCE
hdd1/data/project  compressions  gzip-9    local

# create destination
$ zfs create hdd1/test/qat

# send 38T; without -C, so it will recompress on the receive
$ zfs send -vR hdd1/data1/project@2020-05-01_05.20.01--1y | zfs receive -F hdd1/test/qat

# read back the results
$ zfs send -vR hdd1/test/qat@2020-05-01_05.20.01--1y > /dev/null
[...]
warning: cannot send: 'hdd1/test/qat/digital_biomarker@2020-01-14_09.46.39--1y': Input/output error
  • /proc/spl/kstat/zfs/qat shows an increase of 43 dc_fails
  • /sys/kernel/debug/qat_c6xx_*/cnv_errors shows an increase of 42 failures.

I noted these numbers beforehand, so it's possible that there's a mistake here. I need to re-run the test to confirm, and capture the full contents before and after.

@aqw
Copy link

aqw commented Aug 10, 2020

I have good news: I have a 500G dataset that I can reliably reproduce this bug with using the above method. Even after a reboot. I tested 5 times.

The point where it fails in the data stream is not consistent, suggesting that something about this data increases the probability of encountering the bug, but isn't a simple "this file/block always corrupts".

I am not sure if I can share this dataset (I need to check first), but I am happy to provide information and debugging.

@wli5
Copy link
Contributor

wli5 commented Aug 12, 2020

@aqw when you saw the fail, can the files still be read? just wondering if the software fallback worked. if you can continue narrow down to a small set of data and share with us that will be helpful to debug the issue.

@aqw
Copy link

aqw commented Aug 13, 2020

@wli5 The files with corrupted blocks can't be read... If zfs itself can't read the block, nothing else in the OS can.

Please re-read what I have written above. /proc/spl/kstat/zfs/qat and /sys/kernel/debug/qat_c6xx_*/cnv_error are off by 1. ZFS and QAT are both catching all of the same failures during compress (at least they end up with the same number, which is a good sign). The off-by-1 comes from reading the results (and failing). That failure is seen by ZFS but not by QAT. This makes sense, because the QAT stats are only for Compress and Verify (not for decompress alone).

Something, somewhere, is still generating corrupted blocks. I don't know if that's a result of the fall-back failing or the worse-scenario that QAT (rarely) generates blocks that it thinks are valid, but actually aren't.

I also tried @luki-mbi's suggestion to limit the ARC, as it helped them. I capped at 50% (leaving ~128GB free) and the problem is sadly still reproducible.

I will see if I can get a smaller dataset that triggers this problem. I was already excited that I could reliable trigger this with just 500GB, as it's such a low-frequency problem.

My suspicion is that this problem happens with many small files. The dataset that reliably triggers this bug is 1.5 million JSON files totaling ~500GB (average size ~330KB).

@wli5
Copy link
Contributor

wli5 commented Aug 17, 2020

@aqw are you able to reduce the file size?

@aqw
Copy link

aqw commented Aug 17, 2020

@wli5 I have not yet been able to generate a smaller dataset that tickles this. I'm still looking.

@aqw
Copy link

aqw commented Sep 30, 2020

I came back to this, and I can only consistently reproduce it with that 500GB dataset. Other workloads do still cause this problem, but I can't get the reliability of failure up.

@wli5 What information can I provide that will help you forward on this? Currently, all of our QAT cards remain idle due to this bug.

@wli5
Copy link
Contributor

wli5 commented Sep 30, 2020

@aqw is it possible to figure out which piece of data caused the issue? Is it on the same block or random? Can you just test part of the data within the 500GB to see which part caused the issue?

@aqw
Copy link

aqw commented Sep 30, 2020

It's not always the same block (oddly enough), but is definitely not random. It's usually around similar blocks.

I don't know how to use zdb well enough to inspect the failed blocks themselves. Any tips?

@wli5
Copy link
Contributor

wli5 commented Sep 30, 2020

@behlendorf maybe Brian can give some advise on the zdb?

@wli5
Copy link
Contributor

wli5 commented Oct 5, 2020

@aqw is the 500GB a single file or multiple files? Can you narrow down which file caused the issue?

@aqw
Copy link

aqw commented Oct 5, 2020

@wli5 I have provided much of this info before (see above).

The 500G is the size of the dataset. I am primarily testing this by using zfs send to send the entire dataset. I do this, because the dataset is 1.5 million files (average size ~330KB). Using rsync also encounters the bug, but I use zfs send because it's faster than rsync.

I have isolated the files it fails on (simple JSON files). Sending just those does not cause the bug.

This bug behaves as if the QAT card needs to be primed with a certain type of data, then another type of data causes corrupted output. That's my best guess.

@wli5
Copy link
Contributor

wli5 commented Oct 7, 2020

@aqw I see.. thanks for the info. Can your files be shared so we can reproduce in our lab?

@meckiemac
Copy link

Hi,
I'm running into the exactly same issue.
Ubuntu 20.04.1
Kernel 5.4.0-58-generic
QAT driver qat1.7.l.4.12.0-00011
Engine:
There is 1 QAT acceleration device(s) in the system:
qat_dev0 - type: c3xxx, inst_id: 0, node_id: 0, bsf: 0000:01:00.0, #accel: 3 #engines: 6 state: up
Processor: Intel(R) Atom(TM) CPU C3558 @ 2.20GHz

root@dc1:/sys/kernel/debug/qat_c3xxx_0000:01:00.0# cat cnv_errors
+-----------------------------------------------------------------+
| CNV Error Freq Statistics for Qat Device |
+-----------------------------------------------------------------+
|[AE 0]: TotalErrors: 6 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 1]: TotalErrors: 3 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 2]: TotalErrors: 4 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 3]: TotalErrors: 2 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 4]: TotalErrors: 7 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 5]: TotalErrors: 9 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+

root@dc1:/proc/spl/kstat/zfs# cat qat
19 1 0x01 17 4624 9104174767 242634634083171
name type data
comp_requests 4 5304289
comp_total_in_bytes 4 693237692928
comp_total_out_bytes 4 119198066061
decomp_requests 4 1649025
decomp_total_in_bytes 4 79376894049
decomp_total_out_bytes 4 190214495232
dc_fails 4 188577
encrypt_requests 4 0
encrypt_total_in_bytes 4 0
encrypt_total_out_bytes 4 0
decrypt_requests 4 0
decrypt_total_in_bytes 4 0
decrypt_total_out_bytes 4 0
crypt_fails 4 0
cksum_requests 4 370667
cksum_total_in_bytes 4 5627505664
cksum_fails 4 0

The affected files are kvm/qemu qcow2 files from active virtual machines.

How can I help t solve this error?

@wli5
Copy link
Contributor

wli5 commented Jan 6, 2021

Hi @meckiemac thanks! Can you please share more details how to reproduce the issue?

@meckiemac
Copy link

In my case it's my home server Supermicro SYS-E200-9A. The system is on an NVMe in the original NVMe slot and the user data resides on another M2 NVMe via an adapter in the PCIe slot. A third HDD is connected via SATA as backup disk.
Both data NVMe and backup SATA drive are on ZFS, the system NVMe runs ext4 via lvm. Further it has 32GB ECC RAM and zfs scrub is clean and zfs arc size was default 1/2 RAM, but now reduced to 8GB (after the errors).
The error was found during backups via syncoid from the data to the backup medium.

I don't know when the error was introduced, but it seem reproducible. The virtual machine experiencing the issue got created with virt-install and the disk size (qcow2) was internal 30GB. Between creation and backup there was 2-4 hours the last time I experienced it. The high count of the error comes from the retry of the disk subsystem in the virtual machine.

With its 4 cores the atom is sometimes under good load. I have also an nextcloud instance running and so on.

So the reproducible step is: create a virtual machine with windows 10 as guest … and wait few hours of write activities in my case. (It happened before, but with a longer time between creation and detection.)

@wli5
Copy link
Contributor

wli5 commented Jan 13, 2021

Thanks @meckiemac , but I don't quite understand. You have Linux machine, and have ZFS running on it for an NVMe disk and a SATA drive, and you have Windows10 virtual machine on this Linux server... then how to reproduce the issue, as you said during backups, how? From windows VM?

@meckiemac
Copy link

meckiemac commented Jan 25, 2021

  • To reproduce the issue I just create a Windows10 virtual machine (btw I had this with XP too) and/or let it run for a short time. The virtual machines running on the nvme. This was enough to produce corruption.
  • The detection of the corruption was during a almost immediate syncoid backup of the filesystem to the backup disk (all zfs). After detection any read of the file containing the vm failed, inside or outside read (dd, windows disk backup). Only recovery method is dd_rescure so far.

Currently I disabled gzip compression on large files and virtual machines. I also do regular zfs send to null to check the rest of the data. So far the issue only found with virtual machine disk files (qcow2)

Does this help better to understand the scenario?

@meckiemac
Copy link

What I can offer is a dump out the affected block(s) and we check the content to find out from where it came. But I need to know how to extract this block. With this we may get an idea if:

  • it was wrong referenced after compression (then I should not be de-compressible and we see some interesting memory content)
  • or wrong referenced before compression (then its de-compressible and we see some interesting content, memory or disk)
    What do you think?
    If you don't think this is a good idea, then I let me know and I delete the qcow2 file and free up some space.

~Andreas

@wli5
Copy link
Contributor

wli5 commented Jan 26, 2021

@merkiemc Hi Andreas, thanks for the info! If you write/read the qcow2 file to ZFS with compression, can the issue be triggered?

@meckiemac
Copy link

I haven't tried to trigger the corruption on purpose yet. The system is also my home "production" system, so I'm a bit careful. I found another compression corruption on the Sata disk (Backup) in the OSX-"Time Machine Backup" space. The "Time Machine Backup" space is provided via Samba to My OSX clients. The big difference here is that we have a lot of smaller files compared to the virtual machines. Currently I'm trying to clear this corruption, since the Time Machine is very sensitive with disk errors and finally also switch off compression here.

What I can try to setup is a script which is installing windows virtual machines and then checks the written data in a loop, until we trigger the corruption, since this is the most obvious area where this error occurred. With the script, if successful, we could try to collect more data or modify the system and observe the behavior.

@wli5
Copy link
Contributor

wli5 commented Jan 27, 2021

If you can provide the script it can also help us to reproduce the issue. Thanks!

@wli5
Copy link
Contributor

wli5 commented Feb 23, 2021

We have tried to reproduce this issue by creating VM and backup snapshot regularly to a compressed ZFS pool (QAT enabled), by running more than 2 weeks repeatly, without any issue.

@meckiemac
Copy link

@wli5 my process would looks exactly similar, but I haven't found the time to implement it yet. Family and job have currently priority. The interesting part: it is not only bound to VMs since I have the same corruption in backup data thru a Samba-server. So I guess it is connected to the system load too. To be able to automate my VM I need to create an unattended setup for windows 10 first, which takes some time. Sorry for the delay. The current workaround for me is switching of compression. The archieved compression rate is negligible compared to my available space.

@stale
Copy link

stale bot commented Feb 23, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Feb 23, 2022
@stale stale bot closed this as completed Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

7 participants