Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SIMD for encryption #9296

Merged
merged 1 commit into from
Sep 10, 2019
Merged

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Sep 7, 2019

Motivation and Context

Issue #9215.

Description

When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context. This was done
intentionally since it's a relatively infrequent operation. However,
this also meant that the encryption context templates were initialized
using the generic operations. Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
execute by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread. And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context. For example, when performing a read from the ARC.

How Has This Been Tested?

Manually by instrumenting the code to ensure the encryption SIMD operations
were never performed in an incorrect thread context. Verified the expected
implementation was being used (aesni).

Submitted to the CI for a full run of the test suite.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implemention even when
execute by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For examples, when performing a read from the ARC.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 7, 2019
@codecov
Copy link

codecov bot commented Sep 7, 2019

Codecov Report

Merging #9296 into master will increase coverage by 0.17%.
The diff coverage is 93.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9296      +/-   ##
==========================================
+ Coverage   79.12%    79.3%   +0.17%     
==========================================
  Files         401      401              
  Lines      122063   122071       +8     
==========================================
+ Hits        96587    96812     +225     
+ Misses      25476    25259     -217
Flag Coverage Δ
#kernel 79.81% <83.33%> (+0.06%) ⬆️
#user 67.41% <93.54%> (+0.64%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bbf047...a7b4f25. Read the comment docs.

@AttilaFueloep
Copy link
Contributor

Thanks for doing this. I'll need a couple of days to build a version with your commit cherry picked. Once I'm done I'll report back.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Sep 9, 2019
@behlendorf
Copy link
Contributor Author

@AttilaFueloep that would be great, thanks!

@behlendorf behlendorf requested a review from tcaputi September 9, 2019 16:34
Copy link
Contributor

@tcaputi tcaputi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Do we have any performance numbers on this? On older kernels (where we have access to the fpu normally) do we always avoid using the taskq?

@behlendorf
Copy link
Contributor Author

Do we have any performance numbers on this?

I'm working on running those tests now, I should have some numbers to share soon.

On older kernels (where we have access to the fpu normally) do we always avoid using the taskq?

That's right. For any kernel where the fpu symbols are available we don't do the dispatch in zio_do_crypt_uio(). The one exception is zio_crypt_key_unwrap(), where we always dispatch as a simplification to initialize the context templates.

@lovesegfault
Copy link

@behlendorf If you tell me how to generate results I'd be more than happy to try this on my machine

@behlendorf
Copy link
Contributor Author

Do we have any performance numbers on this?

The results are as expected. The performance gain from using the optimized versions (40-50%) significantly outweighs the negligible cost of dispatching the operation to the IO pipeline (<-0.5%).

Impl Seq Read (MiB/s) Seq Write (MiB/s)
generic 722 675
x86_64 (no-dispatch) 1031 1012
x86_64 (dispatched) 1026 1007
  • fio results averaged over 3 runs
  • tested using an AMD EPYC 7351 16-Core system (using x86_64. instead of aesni)
  • manually modified the code to test the dispatch overhead

@lovesegfault additional testing would be welcome. Benchmarking your performance both before and after applying this commit should be enable to determine if it's working as intended.

@daniel-mcclintock
Copy link

I am also running encryption on my root filesystem (aes-128-gcm) and am experiencing pretty poor performance. I would be happy to perform some testing if required. I am running this branch now but performance is still low.

Let me know if I can assist at all.

@behlendorf
Copy link
Contributor Author

behlendorf commented Sep 10, 2019

@dmimpact one thing worth trying if you're running an amd system is to set the icp_aes_impl implementation to x86_64. We've found it can outperform the aesni version depending on the system, this was the case for the testing performed above. I'd suggest applying this patch, loading the modules, setting echo x86_64 >/sys/module/icp/parameters/icp_aes_impl, then import your pool and see if there's an improvement.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 10, 2019
@AttilaFueloep
Copy link
Contributor

I can confirm that a7b4f25 fixes the issue for me. Using the first part of the reproducer given in #9215 I now see the SIMD implementation getting used. Nice job!

Performance wise there is no noticeable difference, if any.

@AttilaFueloep
Copy link
Contributor

@behlendorf

one thing worth trying if you're running an amd system is to set the aes implementation to x86_64. We've found it can outperforms the aesni version

I'm also seeing slightly (< 10%) better performance with the x86_64 implementation on an six core Intel Coffee Lake. But the performance is still well below the 3GB/s a single core can deliver. It seems that the GCM implementation is the limiting factor.

@behlendorf
Copy link
Contributor Author

@AttilaFueloep thanks for the additional data point. It sounds as if adding a micro benchmark to dynamically select the fastest available algorithms for the icp would be worthwhile.

@behlendorf behlendorf merged commit b88ca2a into openzfs:master Sep 10, 2019
@lovesegfault
Copy link

lovesegfault commented Sep 10, 2019

I just rebooted my system with this patch applied. While there is a very noticeable difference, git status on a very large repo for example takes 0.3s as opposed to ~3s before, the /sys information seems to be unchanged:

───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/zfs/parameters/zfs_vdev_raidz_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ cycle [fastest] original scalar sse2 ssse3 avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/zcommon/parameters/zfs_fletcher_4_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ [fastest] scalar superscalar superscalar4 sse2 ssse3 avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/icp/parameters/icp_aes_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ cycle [fastest] generic x86_64 aesni
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/icp/parameters/icp_gcm_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ cycle [fastest] generic pclmulqdq
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /proc/spl/kstat/zfs/fletcher_4_bench
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ 5 0 0x01 -1 0 4080525647 575538761624
   2   │ implementation   native         byteswap
   3   │ scalar           7466579952     5359121972
   4   │ superscalar      9629088852     7305813396
   5   │ superscalar4     7201263074     7211906464
   6   │ sse2             16978895502    9423597361
   7   │ ssse3            16983024489    12816781828
   8   │ avx2             26302488205    23665293062
   9   │ fastest          avx2           avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /proc/spl/kstat/zfs/vdev_raidz_bench
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ 18 0 0x01 -1 0 5395518757 575538937607
   2   │ implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr
   3   │ original         641172714       336736048       130317721       1502946388      331059342       49029629        129053073       28065549        27558190        18877923
   4   │ scalar           1906158709      521174068       209778522       1806821780      613740429       417343287       267505400       218054777       136317558       118446716
   5   │ sse2             3134425887      1448093032      714514580       3148147815      1096501832      875633505       573197891       530942863       321856232       147878595
   6   │ ssse3            3185570225      1407155294      676799814       3170711232      1818589971      1344820674      1088712666      948039950       663061825       512357970
   7   │ avx2             6223049054      2410269039      1224448763      6124180716      3583297155      2613174841      2000104658      1714679687      1272038091      966609623
   8   │ fastest          avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

@lovesegfault
Copy link

@behlendorf Could /proc be misreporting somehow? How is it that it is faster but looks exactly the same otherwise?

@behlendorf
Copy link
Contributor Author

How is it that it is faster but looks exactly the same otherwise?

Thanks for the additional testing of the patch, I'm glad to hear it helps as intended. This change won't effect the output /sys/module/icp/parameters/icp_aes_impl, the issue was internally that the code wasn't using the "fastest" but was instead falling back to the generic implementation. It will now correctly use the optimized implementation when available.

@lovesegfault
Copy link

@behlendorf Oh! I'm surprised that aesni isn't the fastest impl!

mattmacy pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Sep 10, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
@daniel-mcclintock
Copy link

@behlendorf Thanks, Unfortunately I am not running AMD but rather Intel Coffee Lake.

Performance does feel a bit better. but is not what I would expect from my cpu + storage config (i7 + PCIe NVME).

Regardless I am glad to see this issue has received attention and will test this out.
Thanks.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 17, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 18, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 18, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 18, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 19, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
behlendorf added a commit to behlendorf/zfs that referenced this pull request Oct 4, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
stevijo pushed a commit to stevijo/zfs that referenced this pull request Jan 12, 2020
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
@behlendorf behlendorf deleted the simd-crypt-compat branch April 19, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants