-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix QAT allocation failure return value #9788
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9788 +/- ##
========================================
+ Coverage 79% 80% +<1%
========================================
Files 385 385
Lines 121481 121481
========================================
+ Hits 96461 96600 +139
+ Misses 25020 24881 -139
Continue to review full report at Codecov.
|
When qat_compress() fails to allocate the required contigeous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9784
85faabe
to
bd2a933
Compare
@behlendorf thanks for finding and fixing the issue, we are on vacation, will validate it after back. |
Thanks but I have a problem building on CentOS 6.10 Linux 2.6.32-754.23.1.el6,
I ran I tried to use newer GCC 7 from Software Collections but got an error that it can't build modules,
Well I know it's an old system, but it is our most important and most expensive, plus CentOS 6's EOL isn't happening for 11 more months. We are planning to replace it within a couple months, though, and will put the latest CentOS on it. So if you can still help out with getting this old machine working it would be appreciated. |
If I change line 544 to use |
This comment has been minimized.
This comment has been minimized.
This is something we'll want to include in master and 0.8.3 (which does still support CentOS 6). So we'll make sure to resolve any qat build issues. |
Thanks, we would like to implement the fix ASAP though so we can read our data. If you know a definitive fix please let us know. |
This comment has been minimized.
This comment has been minimized.
Manually, setting
Assuming your systems encountered the low memory error handling problem fixed by this PR then this will prevent any future damaged blocks. Blocks written prior to applying this fix may not have been compressed correctly on disk resulting in the errors reported. |
Thanks @behlendorf, but what do you mean specifically by "Manually, setting zfs_kernel_param for your build"? Sorry am not very familiar with the code. Since, after restoring data to the ZFS volume, I am getting a new full backup (which reads all the data), if I see I/O errors then would know that data is corrupted and should be restored from the old backup, and the rest of the data is still ok. |
Specifically, I was referencing the change you made to resolve the compilation failure. That change shouldn't cause any problems, we'll just need to investigate why it was needed at all. |
Thanks for clarifying that, we will give it a shot. |
Thanks for fixing the issue, I have run some tests to verify this fix, it works well. |
@cfzhu what tests did you run to tell it works? It doesn't work for us. We copy a file to QAT-compressed ZFS storage and dc_fails counter increases, there are no messages in the system log indicating the CPU was used to compress the file, and it gets corrupted--when we try to read the file we get input/output errors. |
@AGI-admin just to be sure, did you make sure you rebooted to make sure all modules are loaded correctly? |
@Ornias1993 Yes rebooted and updated the kernel, recompiled ZFS against the new kernel and rebooted again |
@AGI-admin I also copy a test file to the ZFS storage with QAT compression , it works well,
In addition, I have run test when
|
@AGI-admin Are you certain your QAT card or chipset is (still) good? |
@Ornias1993 It is a new QAT card only 1-2 months old but the server is over 5 years old, still running fine though as far as I can tell... is there a way to know if chipset or other hardware is going bad? I don't notice any other major problems like this. We are planning to replace the server in the next 1-2 months and will move the QAT card+storage to it. I also filed a case with Intel for the QAT card so will see what they say. It could be a different bug that we are experiencing, according to recent notes on #9784 |
@AGI-admin if you can run QAT sample code successfully that means QAT hardware is fine. Have you ever run ZFS with QAT successfully before? |
@wli5 I've tried |
Interesting... 1-2 Months, is within the DOA range of hardware... |
Yes and doesn't seem any one can re-create the error with the same data. |
When qat_compress() fails to allocate the required contiguous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9784 Closes openzfs#9788
When qat_compress() fails to allocate the required contiguous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9784 Closes #9788
Motivation and Context
Issue #9784
Description
When qat_compress() fails to allocate the required contigeous memory
it mistakenly returns success. This prevents the fallback software
compression from taking over and (un)compressing the block.
Resolve the issue by correctly setting the local 'status' variable
on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.
How Has This Been Tested?
Visually inspected. Unfortunately, I don't currently have access to
a qat accelerator to verify the fix.
@wli5 @cfzhu would it be possible for you to review and test this.
Types of changes
Checklist:
Signed-off-by
.