Input/output errors and QAT dc_fails #9784

AGI-chandler · 2019-12-29T00:07:17Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	6.10
Linux Kernel	2.6.32-754.23.1.el6
Architecture	x86_64
ZFS Version	0.8.2-1
SPL Version	0.8.2-1

Describe the problem you're observing

Getting some Input/output errors when trying to read some data. "zpool status" shows all disks are ONLINE and "No known data errors", last scrub just finished earlier today. Noticed that the dc_fails counter is increasing when the I/O errors happen, so I'm guessing it's related to the QAT code in ZFS or the QAT card itself? The QAT card status is:

# /etc/init.d/qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
 qat_dev0 - type: c6xx,  inst_id: 0,  node_id: 1,  bsf: 0000:84:00.0,  #accel: 5 #engines: 10 state: up
 qat_dev1 - type: c6xx,  inst_id: 1,  node_id: 1,  bsf: 0000:85:00.0,  #accel: 5 #engines: 10 state: up
 qat_dev2 - type: c6xx,  inst_id: 2,  node_id: 1,  bsf: 0000:86:00.0,  #accel: 5 #engines: 10 state: up
#

and there aren't any errors popping up in the system log when the I/O errors happen. I could open a case with Intel if you think it would be helpful?

Describe how to reproduce the problem

$ cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 361599045964 3369212093252152
name                            type data
comp_requests                   4    284279855
comp_total_in_bytes             4    35765197023232
comp_total_out_bytes            4    12613266941171
decomp_requests                 4    250580310
decomp_total_in_bytes           4    12703419899870
decomp_total_out_bytes          4    31075581360640
**dc_fails                        4    5804**
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    1396
cksum_total_in_bytes            4    155779072
cksum_fails                     4    0 

$ rsync -avP X /tmp
sending incremental file list
X
  2444094260 100%   49.18MB/s    0:00:47 (xfer#1, to-check=0/1)
rsync: read errors mapping "X": Input/output error (5)
WARNING: X failed verification -- update discarded (will try again).
X
  2444094260 100%   50.98MB/s    0:00:45 (xfer#2, to-check=0/1)
rsync: read errors mapping "X": Input/output error (5)
ERROR: X failed verification -- update discarded.

sent 4888785353 bytes  received 52 bytes  51733178.89 bytes/sec
total size is 2444094260  speedup is 0.50
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]

$ cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 361599045964 3369334123310545
name                            type data
comp_requests                   4    284279939
comp_total_in_bytes             4    35765206926336
comp_total_out_bytes            4    12613268958496
decomp_requests                 4    250617594
decomp_total_in_bytes           4    12707033557470
decomp_total_out_bytes          4    31080467294208
**dc_fails                        4    5808**
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    1396
cksum_total_in_bytes            4    155779072
cksum_fails                     4    0
$

Include any warning/errors/backtraces from the system logs

There are no warning/errors/backtraces from the system logs, just the dc_fails is incremented

Thanks

The text was updated successfully, but these errors were encountered:

luki-mbi · 2019-12-29T01:02:53Z

Normally when the QAT request fails, ZFS falls back to the CPU. This is usually very rare on our production systems, maybe like 1 : 100M requests, and an error is logged to dmesg by the QAT driver:

[5344115.233436] dcCompression_ProcessCallback() - : The slice hang is detected on the compression slice

My guess is that in your case the data for some reason is corrupt and cannot be decompressed by the QAT or the CPU gzip implementation; the checksum (post-compression) is can still be valid. Just a guess. I hope someone else can chime in on that.

AGI-chandler · 2019-12-29T05:43:15Z

If the data is corrupt then that is also concerning, since there's no previous indication of corruption. This is a new ZFS volume and the data was copied from a backup with rsync. I can check the secondary backups to see if restoring these files has any effect.

AGI-chandler · 2019-12-29T06:41:44Z

Restoring the file from the secondary backup has worked fine, so I guess the data was corrupted somehow.

luki-mbi · 2019-12-29T21:33:39Z

That is worrisome. I hope someone can chime in an elaborate what can be causing such corruption.
Possibly the data got corrupted during or after compression (by the QAT?), and the checksum was computed for the corrupted data. So to ZFS all looks fine, the data are readable and the checksum matches. It just cannot be decompressed and results in an I/O error.

Are you sure the hardware is all OK? No bus or memory errors?

AGI-chandler · 2019-12-29T21:54:07Z

I've just looked over last week's system log files, and there aren't any messages about memory errors or other errors. All the scrubs, consistency checks, patrol reads, and verifications have completed successfully.

But when I checked the previous week's log files, when I was copying data, I do see several "page allocation failure" messages from processes like swapper, rpciod, and nfsd. That sounds concerning but I didn't really see anything negative about it out there, more like it's just a warning. For example,

Dec 18 20:22:20 pac kernel: nfsd: page allocation failure. order:4, mode:0x20
Dec 18 20:22:20 pac kernel: Pid: 13417, comm: nfsd Tainted: P           -- ------------    2.6.32-754.23.1.el6.x86_64 #1
Dec 18 20:22:20 pac kernel: Call Trace:
Dec 18 20:22:20 pac kernel: [<ffffffff81143c71>] ? __alloc_pages_nodemask+0x7e1/0x960
Dec 18 20:22:20 pac kernel: [<ffffffff81186412>] ? kmem_getpages+0x62/0x170
Dec 18 20:22:20 pac kernel: [<ffffffff8118702a>] ? fallback_alloc+0x1ba/0x270
Dec 18 20:22:20 pac kernel: [<ffffffff81186a87>] ? cache_grow+0x2d7/0x320
Dec 18 20:22:20 pac kernel: [<ffffffff81186da9>] ? ____cache_alloc_node+0x99/0x160
Dec 18 20:22:20 pac kernel: [<ffffffff8147da22>] ? pskb_expand_head+0x62/0x280
Dec 18 20:22:20 pac kernel: [<ffffffff811879f9>] ? __kmalloc+0x199/0x230
Dec 18 20:22:20 pac kernel: [<ffffffff8147da22>] ? pskb_expand_head+0x62/0x280
Dec 18 20:22:20 pac kernel: [<ffffffffa0885e3e>] ? __nf_conntrack_find+0x15e/0x170 [nf_conntrack]
Dec 18 20:22:20 pac kernel: [<ffffffff8147e30a>] ? __pskb_pull_tail+0x2aa/0x360
Dec 18 20:22:20 pac kernel: [<ffffffff8148ddf9>] ? harmonize_features+0x29/0x70
Dec 18 20:22:20 pac kernel: [<ffffffff8148e474>] ? dev_hard_start_xmit+0x1c4/0x490
Dec 18 20:22:20 pac kernel: [<ffffffff814ac26a>] ? sch_direct_xmit+0x15a/0x1c0
Dec 18 20:22:20 pac kernel: [<ffffffff8148ea30>] ? dev_queue_xmit+0x270/0x360
Dec 18 20:22:20 pac kernel: [<ffffffff81495566>] ? neigh_connected_output+0xc6/0x110
Dec 18 20:22:20 pac kernel: [<ffffffff814cb67d>] ? ip_finish_output+0x29d/0x380
Dec 18 20:22:20 pac kernel: [<ffffffff814cb818>] ? ip_output+0xb8/0xc0
Dec 18 20:22:20 pac kernel: [<ffffffff814caa2f>] ? __ip_local_out+0x9f/0xb0
Dec 18 20:22:20 pac kernel: [<ffffffff814caa6b>] ? ip_local_out+0x2b/0x40
Dec 18 20:22:20 pac kernel: [<ffffffff814caf80>] ? ip_queue_xmit+0x190/0x420
Dec 18 20:22:20 pac kernel: [<ffffffff814e0d9a>] ? tcp_transmit_skb+0x4ea/0x930
Dec 18 20:22:20 pac kernel: [<ffffffff814e34ac>] ? tcp_write_xmit+0x1dc/0xaa0
Dec 18 20:22:20 pac kernel: [<ffffffff814e40a0>] ? __tcp_push_pending_frames+0x30/0xe0
Dec 18 20:22:20 pac kernel: [<ffffffff814d28f9>] ? tcp_push+0x69/0x90
Dec 18 20:22:20 pac kernel: [<ffffffff814d423c>] ? tcp_sendpage+0x51c/0x590
Dec 18 20:22:20 pac kernel: [<ffffffff814721bb>] ? kernel_sendpage+0x5b/0x90
Dec 18 20:22:20 pac kernel: [<ffffffffa064da2d>] ? svc_send_common+0xfd/0x160 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa064db02>] ? svc_sendto+0x72/0x1f0 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa06dcc81>] ? nfsd4_release_compoundargs+0x31/0x80 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffffa06da328>] ? nfsd4_proc_compound+0x238/0x4b0 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffffa064e6fd>] ? auth_domain_put+0x1d/0x80 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa064dd19>] ? svc_tcp_sendto+0x39/0xa0 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa0659789>] ? svc_send+0xb9/0x100 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffff81070a10>] ? default_wake_function+0x0/0x20
Dec 18 20:22:20 pac kernel: [<ffffffffa064b730>] ? svc_process+0x130/0x160 [sunrpc]
Dec 18 20:22:20 pac kernel: [<ffffffffa06c7d22>] ? nfsd+0xc2/0x160 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffffa06c7c60>] ? nfsd+0x0/0x160 [nfsd]
Dec 18 20:22:20 pac kernel: [<ffffffff810aac40>] ? kthread+0xa0/0xc0
Dec 18 20:22:20 pac kernel: [<ffffffff81565820>] ? child_rip+0x20/0x30
Dec 18 20:22:20 pac kernel: [<ffffffff810aaba0>] ? kthread+0x0/0xc0
Dec 18 20:22:20 pac kernel: [<ffffffff81565800>] ? child_rip+0x0/0x30

Any idea what that's about?

AGI-chandler · 2019-12-30T00:04:31Z

Still getting read errors from bacula when it's trying to read (backup) the same files I just restored (from its backup from several months ago), the "secondary backup" I was speaking about previously. For example,
29-Dec 16:00 bacula-fd JobId 28943: Error: Read error on file O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017. ERR=Input/output error

rsync has an i/o error as well,

$ rsync -avP O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017 /tmp
sending incremental file list
0017
  2485036820 100%   85.53MB/s    0:00:27 (xfer#1, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)

sent 2485340260 bytes  received 31 bytes  84248823.42 bytes/sec
total size is 2485036820  speedup is 1.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
$

But it's not all the files. For example the "X" file from yesterday that I restored from backup is still reading fine.

dc_fails is up to 8905 now.... nothing new in dmesg or the syslog

AGI-chandler · 2019-12-30T09:23:50Z

I'm trying to restore the same files from the backup again, and noticing that dc_fails is steadily going up, it's over 9000 now and increasing. So I would say at this point it seems to be a problem with ZFS sending the data to the QAT card or something, as well as reading data with the QAT card.

Here is one of the files causing problems, restored from the backup -> our web server (they are all 2.4 GB, sorry): O_nivara/canu2nd/trimming/O_nivara.ovlStore/0022 @luki-mbi perhaps you could see if your QAT setup has problems with this file?

I could try to stop the qat_service too and see what happens, but that will slow down the system too much so it isn't a viable work-around.

luki-mbi · 2019-12-30T13:59:52Z

Sees to work fine for me.

Original file in /net/scratch (no compression)
Copy in /projects, compressed with QAT (using a C627 chipset)
Copy in /archive, compressed with QAT (using a DH8950 PCIe card)

# md5sum /net/scratch/lugo/0022 /projects/0022 /archive/lugo/0022 
d27eb19aef7965f71ad75dedcf4e155b  /net/scratch/lugo/0022
d27eb19aef7965f71ad75dedcf4e155b  /projects/0022
d27eb19aef7965f71ad75dedcf4e155b  /archive/lugo/0022

# du -sh /net/scratch/lugo/0022 /projects/0022 /archive/lugo/0022 
2.4G	/net/scratch/lugo/0022
1.9G	/projects/0022
1.8G	/archive/lugo/0022

# dd if=0022  of=/dev/null bs=1M
2366+1 records in
2366+1 records out
2481227080 bytes (2.5 GB, 2.3 GiB) copied, 6.09046 s, 407 MB/s

No data errors, and performance is decent too for a single thread (~400 MB/sec).

AGI-chandler · 2019-12-30T18:41:12Z

So our QAT card is broken then I guess? Are you also using ZFS 0.8.2? You compiled it from source using --with-qat right?

When qat_compress() fails to allocate the required contigeous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9784

behlendorf · 2019-12-30T18:48:23Z

@AGI-admin I've identified the issue and have opened #9788 with the fix. The change is straight forward, but unfortunately I don't have easy access to a QAT accelerator so I wasn't able to test it myself.

The issue is that when memory on the system becomes highly fragmented some QAT related memory allocations may fail. Those failures were not being reported to the higher layers of ZFS which prevented the software fallback from handling the compression. This fits with the logs you posted which clearly show that memory was highly fragmented on your system when writing these blocks. i.e.

nfsd: page allocation failure. order:4, mode:0x20

AGI-chandler · 2019-12-30T18:51:51Z

Thanks @behlendorf ... but is it normal to see dc_fails increasing without any "page allocation failure" messages? Because those page allocation failures were happening a couple weeks ago, but not lately when these I/O errors are still happening....

behlendorf · 2019-12-30T18:59:53Z

The allocation failure messages might have been suppressed by the kernel, but yes I'd have expected to see at least one. You may also want to check the QAT accelerator firmware counters, they may indicate there's an issue on the hardware side. If the QAT were to return success when in fact there was a failure, you'd see a similar issue.

/sys/kernel/debug/qat_*/fw_counters

AGI-chandler · 2019-12-30T19:04:31Z

We don't have any of those counters, /sys/kernel/debug is empty,

$ cat /sys/kernel/debug/qat_*/fw_counters
cat: /sys/kernel/debug/qat_*/fw_counters: No such file or directory
$ ls /sys/kernel/debug/
$

behlendorf · 2019-12-30T19:07:34Z

You'll need to mount debugfs manually, here's a link to the relevant documentation. To summarize:

mount -t debugfs nodev /sys/kernel/debug

AGI-chandler · 2019-12-30T20:36:29Z

OK thanks here's what we have there now, anything in particular to look for?

# cat /sys/kernel/debug/qat_*/fw_counters
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:         12149803939 |
| Firmware Responses[AE  0]:         12149803939 |
+------------------------------------------------+
| Firmware Requests [AE  1]:         11501471210 |
| Firmware Responses[AE  1]:         11501471210 |
+------------------------------------------------+
| Firmware Requests [AE  2]:         12149817476 |
| Firmware Responses[AE  2]:         12149817476 |
+------------------------------------------------+
| Firmware Requests [AE  3]:         11501576382 |
| Firmware Responses[AE  3]:         11501576382 |
+------------------------------------------------+
| Firmware Requests [AE  4]:         12148747996 |
| Firmware Responses[AE  4]:         12148747996 |
+------------------------------------------------+
| Firmware Requests [AE  5]:         11501307552 |
| Firmware Responses[AE  5]:         11501307552 |
+------------------------------------------------+
| Firmware Requests [AE  6]:         12145310500 |
| Firmware Responses[AE  6]:         12145310500 |
+------------------------------------------------+
| Firmware Requests [AE  7]:         11500510194 |
| Firmware Responses[AE  7]:         11500510194 |
+------------------------------------------------+
| Firmware Requests [AE  8]:         12142185315 |
| Firmware Responses[AE  8]:         12142185315 |
+------------------------------------------------+
| Firmware Requests [AE  9]:         11499848788 |
| Firmware Responses[AE  9]:         11499848788 |
+------------------------------------------------+
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:           625652871 |
| Firmware Responses[AE  0]:           625652871 |
+------------------------------------------------+
| Firmware Requests [AE  1]:           586996627 |
| Firmware Responses[AE  1]:           586996627 |
+------------------------------------------------+
| Firmware Requests [AE  2]:           625656789 |
| Firmware Responses[AE  2]:           625656789 |
+------------------------------------------------+
| Firmware Requests [AE  3]:           587004550 |
| Firmware Responses[AE  3]:           587004550 |
+------------------------------------------------+
| Firmware Requests [AE  4]:           625588159 |
| Firmware Responses[AE  4]:           625588159 |
+------------------------------------------------+
| Firmware Requests [AE  5]:           586988266 |
| Firmware Responses[AE  5]:           586988266 |
+------------------------------------------------+
| Firmware Requests [AE  6]:           625391062 |
| Firmware Responses[AE  6]:           625391062 |
+------------------------------------------------+
| Firmware Requests [AE  7]:           586948068 |
| Firmware Responses[AE  7]:           586948068 |
+------------------------------------------------+
| Firmware Requests [AE  8]:           625196343 |
| Firmware Responses[AE  8]:           625196343 |
+------------------------------------------------+
| Firmware Requests [AE  9]:           586918530 |
| Firmware Responses[AE  9]:           586918530 |
+------------------------------------------------+
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:            18575554 |
| Firmware Responses[AE  0]:            18575554 |
+------------------------------------------------+
| Firmware Requests [AE  1]:            18575271 |
| Firmware Responses[AE  1]:            18575271 |
+------------------------------------------------+
| Firmware Requests [AE  2]:            18574983 |
| Firmware Responses[AE  2]:            18574983 |
+------------------------------------------------+
| Firmware Requests [AE  3]:            18575350 |
| Firmware Responses[AE  3]:            18575350 |
+------------------------------------------------+
| Firmware Requests [AE  4]:            18575441 |
| Firmware Responses[AE  4]:            18575441 |
+------------------------------------------------+
| Firmware Requests [AE  5]:            18574861 |
| Firmware Responses[AE  5]:            18574861 |
+------------------------------------------------+
| Firmware Requests [AE  6]:            18575357 |
| Firmware Responses[AE  6]:            18575357 |
+------------------------------------------------+
| Firmware Requests [AE  7]:            18575451 |
| Firmware Responses[AE  7]:            18575451 |
+------------------------------------------------+
| Firmware Requests [AE  8]:            18574719 |
| Firmware Responses[AE  8]:            18574719 |
+------------------------------------------------+
| Firmware Requests [AE  9]:            18575148 |
| Firmware Responses[AE  9]:            18575148 |
+------------------------------------------------+
#

behlendorf · 2019-12-30T20:51:27Z

Good question. I was hoping it would be more self evident, the documentation doesn't appear to describe the individual counters in section 3.7. If you're willing to patch the ZFS code and rebuild we can patch the code to log the exact error.

AGI-chandler · 2019-12-30T20:55:57Z

Yes I'm trying to rebuild it right now with #9788

AGI-chandler · 2019-12-30T21:24:06Z

@behlendorf I'm getting an error when compiling qat_compress.c now,

/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:544: error: expected declaration specifiers or ‘...’ before ‘zfs_kernel_param_t’
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c: In function ‘param_set_qat_compress’:
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: ‘kp’ undeclared (first use in this function)
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: (Each undeclared identifier is reported only once
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: for each function it appears in.)
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c: At top level:
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:565: warning: initialization from incompatible pointer type

behlendorf · 2019-12-30T22:08:02Z

@AGI-admin it looks like you may need to re-run ./autogen.sh and ./configure against the intended kernel. For some reason, the wrong kernel interface was used. There might be a typo in the PR but I didn't see anything when visually inspecting it.

AGI-chandler · 2019-12-30T22:24:19Z

yeah i did that, after decompressing the source archive... I have to change #include <sys/qat.h> to #include "qat.h", but otherwise it builds fine with the old qat_compress.c.

AGI-chandler · 2019-12-30T22:27:02Z

ugh this is such a mess, i think it's back to this commit, #9268

behlendorf · 2019-12-30T22:33:38Z

Ahh yes, sorry I'd forgotten about that issue. We'll make sure it gets included in the next tag.

AGI-chandler · 2019-12-30T22:42:41Z

Actually i think it was already merged, since the code matches, so not sure why the above error is happening :(

When qat_compress() fails to allocate the required contigeous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9784

AGI-chandler · 2020-01-02T20:00:38Z

So is this bug corrupting our data? If so, is it just the data that we happen to see I/O errors with? In that case we can restore the individual files from our backup....

AGI-chandler · 2020-01-09T04:34:18Z

Hello @behlendorf , I applied the patch and recompiled ZFS and still getting input/output errors :( I restored files from backup and still can't read it. For example,

tmp# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22259398341141
name                            type data
comp_requests                   4    34962
comp_total_in_bytes             4    4335009280
comp_total_out_bytes            4    2155182815
decomp_requests                 4    249190
decomp_total_in_bytes           4    15959668224
decomp_total_out_bytes          4    32395348992
dc_fails                        4    19
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0

# rsync -avP O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017 /tmp
sending incremental file list
0017
  2485036820 100%  106.94MB/s    0:00:22 (xfer#1, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
WARNING: 0017 failed verification -- update discarded (will try again).
0017
  2485036820 100%  102.12MB/s    0:00:23 (xfer#2, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
ERROR: 0017 failed verification -- update discarded.

sent 4970680473 bytes  received 52 bytes  109245725.82 bytes/sec
total size is 2485036820  speedup is 0.50
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22318995971136
name                            type data
comp_requests                   4    34984
comp_total_in_bytes             4    4337058304
comp_total_out_bytes            4    2155334849
decomp_requests                 4    286905
decomp_total_in_bytes           4    19604866560
decomp_total_out_bytes          4    37337207296
dc_fails                        4    27
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0
#

Then I delete this 0017 file and restore from backup from 2019-08-26 (long before ZFS and QAT were installed in the system), dc_fails increased again, and the file is still unreadable,

# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22716806520828
name                            type data
comp_requests                   4    54054
comp_total_in_bytes             4    6833537024
comp_total_out_bytes            4    3983695013
decomp_requests                 4    287172
decomp_total_in_bytes           4    19614187520
decomp_total_out_bytes          4    37369810432
dc_fails                        4    34
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0

# rsync -avP O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017 /tmp
sending incremental file list
0017
  2485036820 100%  112.86MB/s    0:00:20 (xfer#1, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
WARNING: 0017 failed verification -- update discarded (will try again).
0017
  2485036820 100%  105.85MB/s    0:00:22 (xfer#2, to-check=0/1)
rsync: read errors mapping "O_sativa117425/canu1st/trimming/canu1st.ovlStore/0017": Input/output error (5)
ERROR: 0017 failed verification -- update discarded.

sent 4970680473 bytes  received 52 bytes  111700685.96 bytes/sec
total size is 2485036820  speedup is 0.50
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]

# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 1271105125637 22814570223417
name                            type data
comp_requests                   4    54095
comp_total_in_bytes             4    6837144576
comp_total_out_bytes            4    3983944970
decomp_requests                 4    325519
decomp_total_in_bytes           4    23296637440
decomp_total_out_bytes          4    42393592832
dc_fails                        4    44
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    550
cksum_total_in_bytes            4    63078400
cksum_fails                     4    0
#

wli5 · 2020-08-07T02:12:21Z

@luki-mbi thanks for the info! Please upgrade to the latest ZFS release and see if the issue is fixed. I've tried to copy/read more than 2TB files in the whole night and no issue was found on ZFS 0.8.4.

aqw · 2020-08-09T15:25:08Z

@wli5 I am sad to report that this bug is not resolved. I did a large send (37 TiB); upon reading the results (also via zfs send, but redirecting to /dev/null), I get a decompress failure, and the send fails.

How would you like to proceed? I'm happy to assist any way I can (information, testing, etc). This is a high priority issue for us, so I'm willing to make the time if I can be of assistance.

wli5 · 2020-08-10T04:42:57Z

@aqw have you updated to 0.8.4? can you please let me know your test method? what we did was copying many files to the pool and read them back to verify the content are same.

aqw · 2020-08-10T09:20:49Z

have you updated to 0.8.4?

@wli5 Yes, as I mentioned above, I am running ZFS 0.8.4 and QAT driver 1.7.l.4.10.0. The machine is running Debian 10 on the 4.19 kernel.

The method I did to test:

# a large pool with plenty of free space
$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP   HEALTH  ALTROOT
hdd1   327T   149T   179T        -         -     1%    45%  1.00x   ONLINE  -

# enable QAT compression and checksumming for ZFS
$ echo 0 >| sys/module/zfs/parameters/zfs_qat_compress_disable
$ echo 0 >| sys/module/zfs/parameters/zfs_qat_checksum_disable

# source is gzip-9
$ zfs get compress hdd1/data1/project
NAME               PROPERTY      VALUE     SOURCE
hdd1/data/project  compressions  gzip-9    local

# create destination
$ zfs create hdd1/test/qat

# send 38T; without -C, so it will recompress on the receive
$ zfs send -vR hdd1/data1/project@2020-05-01_05.20.01--1y | zfs receive -F hdd1/test/qat

# read back the results
$ zfs send -vR hdd1/test/qat@2020-05-01_05.20.01--1y > /dev/null
[...]
warning: cannot send: 'hdd1/test/qat/digital_biomarker@2020-01-14_09.46.39--1y': Input/output error

/proc/spl/kstat/zfs/qat shows an increase of 43 dc_fails
/sys/kernel/debug/qat_c6xx_*/cnv_errors shows an increase of 42 failures.

I noted these numbers beforehand, so it's possible that there's a mistake here. I need to re-run the test to confirm, and capture the full contents before and after.

aqw · 2020-08-10T12:18:17Z

I have good news: I have a 500G dataset that I can reliably reproduce this bug with using the above method. Even after a reboot. I tested 5 times.

The point where it fails in the data stream is not consistent, suggesting that something about this data increases the probability of encountering the bug, but isn't a simple "this file/block always corrupts".

I am not sure if I can share this dataset (I need to check first), but I am happy to provide information and debugging.

wli5 · 2020-08-12T07:12:10Z

@aqw when you saw the fail, can the files still be read? just wondering if the software fallback worked. if you can continue narrow down to a small set of data and share with us that will be helpful to debug the issue.

aqw · 2020-08-13T06:58:29Z

@wli5 The files with corrupted blocks can't be read... If zfs itself can't read the block, nothing else in the OS can.

Please re-read what I have written above. /proc/spl/kstat/zfs/qat and /sys/kernel/debug/qat_c6xx_*/cnv_error are off by 1. ZFS and QAT are both catching all of the same failures during compress (at least they end up with the same number, which is a good sign). The off-by-1 comes from reading the results (and failing). That failure is seen by ZFS but not by QAT. This makes sense, because the QAT stats are only for Compress and Verify (not for decompress alone).

Something, somewhere, is still generating corrupted blocks. I don't know if that's a result of the fall-back failing or the worse-scenario that QAT (rarely) generates blocks that it thinks are valid, but actually aren't.

I also tried @luki-mbi's suggestion to limit the ARC, as it helped them. I capped at 50% (leaving ~128GB free) and the problem is sadly still reproducible.

I will see if I can get a smaller dataset that triggers this problem. I was already excited that I could reliable trigger this with just 500GB, as it's such a low-frequency problem.

My suspicion is that this problem happens with many small files. The dataset that reliably triggers this bug is 1.5 million JSON files totaling ~500GB (average size ~330KB).

wli5 · 2020-08-17T05:41:31Z

@aqw are you able to reduce the file size?

aqw · 2020-08-17T06:23:31Z

@wli5 I have not yet been able to generate a smaller dataset that tickles this. I'm still looking.

aqw · 2020-09-30T12:06:24Z

I came back to this, and I can only consistently reproduce it with that 500GB dataset. Other workloads do still cause this problem, but I can't get the reliability of failure up.

@wli5 What information can I provide that will help you forward on this? Currently, all of our QAT cards remain idle due to this bug.

wli5 · 2020-09-30T12:14:00Z

@aqw is it possible to figure out which piece of data caused the issue? Is it on the same block or random? Can you just test part of the data within the 500GB to see which part caused the issue?

aqw · 2020-09-30T12:19:45Z

It's not always the same block (oddly enough), but is definitely not random. It's usually around similar blocks.

I don't know how to use zdb well enough to inspect the failed blocks themselves. Any tips?

wli5 · 2020-09-30T12:46:21Z

@behlendorf maybe Brian can give some advise on the zdb?

wli5 · 2020-10-05T07:51:46Z

@aqw is the 500GB a single file or multiple files? Can you narrow down which file caused the issue?

aqw · 2020-10-05T13:52:40Z

@wli5 I have provided much of this info before (see above).

The 500G is the size of the dataset. I am primarily testing this by using zfs send to send the entire dataset. I do this, because the dataset is 1.5 million files (average size ~330KB). Using rsync also encounters the bug, but I use zfs send because it's faster than rsync.

I have isolated the files it fails on (simple JSON files). Sending just those does not cause the bug.

This bug behaves as if the QAT card needs to be primed with a certain type of data, then another type of data causes corrupted output. That's my best guess.

wli5 · 2020-10-07T14:04:44Z

@aqw I see.. thanks for the info. Can your files be shared so we can reproduce in our lab?

meckiemac · 2021-01-05T16:48:45Z

Hi,
I'm running into the exactly same issue.
Ubuntu 20.04.1
Kernel 5.4.0-58-generic
QAT driver qat1.7.l.4.12.0-00011
Engine:
There is 1 QAT acceleration device(s) in the system:
qat_dev0 - type: c3xxx, inst_id: 0, node_id: 0, bsf: 0000:01:00.0, #accel: 3 #engines: 6 state: up
Processor: Intel(R) Atom(TM) CPU C3558 @ 2.20GHz

root@dc1:/sys/kernel/debug/qat_c3xxx_0000:01:00.0# cat cnv_errors
+-----------------------------------------------------------------+
| CNV Error Freq Statistics for Qat Device |
+-----------------------------------------------------------------+
|[AE 0]: TotalErrors: 6 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 1]: TotalErrors: 3 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 2]: TotalErrors: 4 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 3]: TotalErrors: 2 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 4]: TotalErrors: 7 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+
|[AE 5]: TotalErrors: 9 : LastError: Decomp Error [ -7] |
+-----------------------------------------------------------------+

root@dc1:/proc/spl/kstat/zfs# cat qat
19 1 0x01 17 4624 9104174767 242634634083171
name type data
comp_requests 4 5304289
comp_total_in_bytes 4 693237692928
comp_total_out_bytes 4 119198066061
decomp_requests 4 1649025
decomp_total_in_bytes 4 79376894049
decomp_total_out_bytes 4 190214495232
dc_fails 4 188577
encrypt_requests 4 0
encrypt_total_in_bytes 4 0
encrypt_total_out_bytes 4 0
decrypt_requests 4 0
decrypt_total_in_bytes 4 0
decrypt_total_out_bytes 4 0
crypt_fails 4 0
cksum_requests 4 370667
cksum_total_in_bytes 4 5627505664
cksum_fails 4 0

The affected files are kvm/qemu qcow2 files from active virtual machines.

How can I help t solve this error?

wli5 · 2021-01-06T05:28:48Z

Hi @meckiemac thanks! Can you please share more details how to reproduce the issue?

meckiemac · 2021-01-10T11:07:34Z

In my case it's my home server Supermicro SYS-E200-9A. The system is on an NVMe in the original NVMe slot and the user data resides on another M2 NVMe via an adapter in the PCIe slot. A third HDD is connected via SATA as backup disk.
Both data NVMe and backup SATA drive are on ZFS, the system NVMe runs ext4 via lvm. Further it has 32GB ECC RAM and zfs scrub is clean and zfs arc size was default 1/2 RAM, but now reduced to 8GB (after the errors).
The error was found during backups via syncoid from the data to the backup medium.

I don't know when the error was introduced, but it seem reproducible. The virtual machine experiencing the issue got created with virt-install and the disk size (qcow2) was internal 30GB. Between creation and backup there was 2-4 hours the last time I experienced it. The high count of the error comes from the retry of the disk subsystem in the virtual machine.

With its 4 cores the atom is sometimes under good load. I have also an nextcloud instance running and so on.

So the reproducible step is: create a virtual machine with windows 10 as guest … and wait few hours of write activities in my case. (It happened before, but with a longer time between creation and detection.)

wli5 · 2021-01-13T06:03:57Z

Thanks @meckiemac , but I don't quite understand. You have Linux machine, and have ZFS running on it for an NVMe disk and a SATA drive, and you have Windows10 virtual machine on this Linux server... then how to reproduce the issue, as you said during backups, how? From windows VM?

meckiemac · 2021-01-25T11:31:57Z

To reproduce the issue I just create a Windows10 virtual machine (btw I had this with XP too) and/or let it run for a short time. The virtual machines running on the nvme. This was enough to produce corruption.
The detection of the corruption was during a almost immediate syncoid backup of the filesystem to the backup disk (all zfs). After detection any read of the file containing the vm failed, inside or outside read (dd, windows disk backup). Only recovery method is dd_rescure so far.

Currently I disabled gzip compression on large files and virtual machines. I also do regular zfs send to null to check the rest of the data. So far the issue only found with virtual machine disk files (qcow2)

Does this help better to understand the scenario?

meckiemac · 2021-01-25T12:15:13Z

What I can offer is a dump out the affected block(s) and we check the content to find out from where it came. But I need to know how to extract this block. With this we may get an idea if:

it was wrong referenced after compression (then I should not be de-compressible and we see some interesting memory content)
or wrong referenced before compression (then its de-compressible and we see some interesting content, memory or disk)
What do you think?
If you don't think this is a good idea, then I let me know and I delete the qcow2 file and free up some space.

~Andreas

wli5 · 2021-01-26T06:23:10Z

@merkiemc Hi Andreas, thanks for the info! If you write/read the qcow2 file to ZFS with compression, can the issue be triggered?

meckiemac · 2021-01-26T11:18:52Z

I haven't tried to trigger the corruption on purpose yet. The system is also my home "production" system, so I'm a bit careful. I found another compression corruption on the Sata disk (Backup) in the OSX-"Time Machine Backup" space. The "Time Machine Backup" space is provided via Samba to My OSX clients. The big difference here is that we have a lot of smaller files compared to the virtual machines. Currently I'm trying to clear this corruption, since the Time Machine is very sensitive with disk errors and finally also switch off compression here.

What I can try to setup is a script which is installing windows virtual machines and then checks the written data in a loop, until we trigger the corruption, since this is the most obvious area where this error occurred. With the script, if successful, we could try to collect more data or modify the system and observe the behavior.

wli5 · 2021-01-27T03:13:13Z

If you can provide the script it can also help us to reproduce the issue. Thanks!

wli5 · 2021-02-23T07:24:07Z

We have tried to reproduce this issue by creating VM and backup snapshot regularly to a compressed ZFS pool (QAT enabled), by running more than 2 weeks repeatly, without any issue.

meckiemac · 2021-02-23T08:09:23Z

@wli5 my process would looks exactly similar, but I haven't found the time to implement it yet. Family and job have currently priority. The interesting part: it is not only bound to VMs since I have the same corruption in backup data thru a Samba-server. So I guess it is connected to the system load too. To be able to automate my VM I need to create an unattended setup for windows 10 first, which takes some time. Sorry for the delay. The current workaround for me is switching of compression. The archieved compression rate is negligible compared to my available space.

stale · 2022-02-23T08:43:43Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf mentioned this issue Dec 30, 2019

Fix QAT allocation failure return value #9788

Merged

12 tasks

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 30, 2019

behlendorf closed this as completed in bc9cef1 Jan 6, 2020

stale bot added the Status: Stale No recent activity for issue label Feb 23, 2022

stale bot closed this as completed Jun 19, 2022

Input/output errors and QAT dc_fails #9784

Input/output errors and QAT dc_fails #9784

Comments

AGI-chandler commented Dec 29, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

luki-mbi commented Dec 29, 2019

AGI-chandler commented Dec 29, 2019

AGI-chandler commented Dec 29, 2019

luki-mbi commented Dec 29, 2019

AGI-chandler commented Dec 29, 2019 • edited Loading

AGI-chandler commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

luki-mbi commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

behlendorf commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

behlendorf commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

behlendorf commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

behlendorf commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

behlendorf commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

behlendorf commented Dec 30, 2019

AGI-chandler commented Dec 30, 2019

AGI-chandler commented Jan 2, 2020

AGI-chandler commented Jan 9, 2020

wli5 commented Aug 7, 2020

aqw commented Aug 9, 2020

wli5 commented Aug 10, 2020

aqw commented Aug 10, 2020

aqw commented Aug 10, 2020

wli5 commented Aug 12, 2020

aqw commented Aug 13, 2020

wli5 commented Aug 17, 2020

aqw commented Aug 17, 2020

aqw commented Sep 30, 2020

wli5 commented Sep 30, 2020

aqw commented Sep 30, 2020

wli5 commented Sep 30, 2020

wli5 commented Oct 5, 2020

aqw commented Oct 5, 2020

wli5 commented Oct 7, 2020

meckiemac commented Jan 5, 2021

wli5 commented Jan 6, 2021

meckiemac commented Jan 10, 2021

wli5 commented Jan 13, 2021

meckiemac commented Jan 25, 2021 • edited Loading

meckiemac commented Jan 25, 2021

wli5 commented Jan 26, 2021

meckiemac commented Jan 26, 2021

wli5 commented Jan 27, 2021

wli5 commented Feb 23, 2021

meckiemac commented Feb 23, 2021

stale bot commented Feb 23, 2022

AGI-chandler commented Dec 29, 2019 •

edited

Loading

meckiemac commented Jan 25, 2021 •

edited

Loading