Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPU detected but no SIMD optimized encryption #9215

Closed
AttilaFueloep opened this issue Aug 26, 2019 · 14 comments
Closed

FPU detected but no SIMD optimized encryption #9215

AttilaFueloep opened this issue Aug 26, 2019 · 14 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@AttilaFueloep
Copy link
Contributor

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version N/A
Linux Kernel 5.2.9-arch1-1-ARCH
Architecture x86_64
ZFS Version git commit f09fda5 (2019-08-16)
SPL Version ditto

Describe the problem you're observing

It's my understanding that with the integration of #8965 SIMD support should work again and indeed I see avx2 as the fastest implementation in e.g. fletcher_4_bench, indicating that ZFS is able to use the FPU.

$ cat /proc/spl/kstat/zfs/fletcher_4_bench
5 0 0x01 -1 0 2720293623 156491491640954
implementation   native         byteswap       
scalar           7029091918     5551234944     
superscalar      9433541338     7039884753     
superscalar4     8806767231     7058469765     
sse2             15540142458    9209733432     
ssse3            15699311619    14793781457    
avx2             26057814841    23605675674    
fastest          avx2           avx2           

But reading a file off of an encrypted filesystem peaks at 500 MB/s and uses all CPU, clearly
indicating AES-NI isn't used. I'd expect well above 1 GB/s (NVMe drive, Intel i7-8750H)
at a much lower CPU load otherwise. If SIMD optimizations are really used for checksum calculations I can't tell, but it seems to me that reading from an unencrypted filesystem produces more CPU load then before, when SIMD support was working (pre 5.0), not sure though.

How would I debug this problem? Is there anything I have to tweak to get SIMD accelerated encryption back again?

I already asked on zfs-discuss but got no enlightening input.

Thanks

Attila

PS
Please refrain from starting any "evil Linux devs" discussion, all has been said in this regard already.

Describe how to reproduce the problem

Read a large file off of an encrypted filesystem and monitor throughput and CPU load.

Include any warning/errors/backtraces from the system logs

N/A

@DeHackEd
Copy link
Contributor

To test I'd suggest finding (or making) a large encrypted file, bringing the machine to idle, and doing cat $BIG_ENCRYPTED_FILE > /dev/null while running perf top in another window. Within perf find the aes_encrypt_intel function near the top if you can - that should be the accelerated AES routines and you can even annotate it to verify from the disassembly.

If you can't find that function in the output or it's named something different then yeah you're probably not using AES-NI.

Avoiding the ARC for this test is up to you.

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Aug 26, 2019

Thanks, I'm seeing aes_generic_encrypt() in perf output, so AES-NI is definitely not used. Any idea what the problem might be?

@AttilaFueloep
Copy link
Contributor Author

Fletcher 4 is indeed SIMD optimized, cat /unencrypted-ds/largefile >/dev/null gives fletcher_4_avx2_native() in perf output.

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Aug 26, 2019

To add to this, cat /proc/crypto gives, among others,

name         : gcm(aes)
driver       : generic-gcm-aesni
module       : aesni_intel
priority     : 400
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 1
ivsize       : 12
maxauthsize  : 16
geniv        : <none>

Since I use encryption=aes-256-gcm, I'd expect AES-NI to work.

The ZFS module is dkms compiled, do I need any special library for configure to pick up AES-NI support?

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Aug 26, 2019
@behlendorf
Copy link
Contributor

@AttilaFueloep if you're running f09fda5 then no other changes should be needed. You can check which optimized versions are available by checking the contents of the following files.

  • RAIDZ - /sys/module/zfs/parameters/zfs_vdev_raidz_impl
  • Fletcher 4 - /sys/module/zcommon/parameters/zfs_fletcher_4_impl
  • AES - /sys/module/icp/parameters/icp_aes_impl
  • GCM - /sys/module/icp/parameters/icp_gcm_impl

Micro-benchmarks indicating which version was determined to be the fastest are also available for Fletcher 4 and RAIDZ.

  • Fletcher 4 - cat /proc/spl/kstat/zfs/fletcher_4_bench
  • RAIDZ - cat /proc/spl/kstat/zfs/vdev_raidz_bench

Encryption is a slightly different story, no micro-benchmarks are run for AES or GCM. When an optimized version is available it is assumed to perform better than the generic code and preferentially used. There is one caveat, the accelerated version won't be used when first decrypting the wrapping key as part of zfs load-key. However, it will be used by the IO pipeline after this initial setup, so you may still see aes_generic_encrypt() appear in the perf top output. You should primarily see aes_encrypt_amd64 or aes_aesni_encrypt for the accelerated versions.

If that's not the case, we'll certainly want to did deeper.

Note: /proc/crypto provides statistics for the kernel's implementations, ZFS provides its own.

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Aug 27, 2019

@behlendorf First of all, thank you for your detailed explanations. I've no idea how I managed to miss the fact that ZoL uses ilumos crypto, usually I do know that. The ipc module was the bit I missed.

It took me a while to sort this out, but I've a reproducer now.

On a freshly booted system with an mostly idling Desktop run the following ($pool has mountpoint=none, not sure if this matters)

# cat  /sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 aesni
# cat  /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq
# zfs create -o encryption=aes-256-gcm -o keyformat=passphrase $pool/foo
Enter passphrase:
Re-enter passphrase:
# zfs set mountpoint=/foo $pool/foo
# dd if=/dev/urandom of=/foo/bar bs=1M count=$((1024*32)) (my arc size is 20G)
# cat /foo/bar >/dev/null
^C

While cat is running do a perf top in another terminal.

Overhead  Shared Object                       Symbol
  37.91%  [icp]                               [k] aes_generic_encrypt
  30.44%  [icp]                               [k] gcm_pclmulqdq_mul
   6.81%  [icp]                               [k] aes_xor_block
   5.31%  [icp]                               [k] aes_encrypt_block
   3.21%  [icp]                               [k] gcm_mul_pclmulqdq
   2.39%  [kernel]                            [k] __x86_indirect_thunk_rax
   2.36%  [icp]                               [k] gcm_decrypt_final

aes_aesni_encrypt() doesn't show up, aes_generic_encrypt() stays on top all the time.

If you do an unmount/mount cycle of /foo the implementation used changes from generic to aesni

# zfs unmount $pool/foo
# zfs mount $pool/foo
# cat /foo/bar >/dev/null

Overhead  Shared Object                       Symbol
  31.35%  [icp]                               [k] gcm_pclmulqdq_mul
  29.12%  [icp]                               [k] aes_aesni_encrypt
   6.78%  [icp]                               [k] aes_xor_block
   5.62%  [icp]                               [k] aes_encrypt_intel
   4.21%  [icp]                               [k] aes_encrypt_block
   3.33%  [icp]                               [k] gcm_mul_pclmulqdq
   3.25%  [kernel]                            [k] preempt_count_add
   2.87%  [kernel]                            [k] preempt_count_sub
   2.55%  [kernel]                            [k] __x86_indirect_thunk_rax
   2.21%  [icp]                               [k] gcm_decrypt_final

GCM seems to pick up the expected implementation.

@behlendorf
Copy link
Contributor

@AttilaFueloep I was able to reproduce this issue locally and understand the issue. I'll see about putting together a patch.

@lovesegfault
Copy link

I've had Encrypted ZFS slow down very significantly and I think I have the same issue:

───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/zfs/parameters/zfs_vdev_raidz_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ cycle [fastest] original scalar sse2 ssse3 avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/zcommon/parameters/zfs_fletcher_4_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ [fastest] scalar superscalar superscalar4 sse2 ssse3 avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/icp/parameters/icp_aes_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ cycle [fastest] generic x86_64 aesni
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /sys/module/icp/parameters/icp_gcm_impl
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ cycle [fastest] generic pclmulqdq
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /proc/spl/kstat/zfs/fletcher_4_bench
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ 5 0 0x01 -1 0 3279387634 342555257913
   2   │ implementation   native         byteswap
   3   │ scalar           7035679076     2998336463
   4   │ superscalar      8965934583     6416057333
   5   │ superscalar4     8414687198     6858908607
   6   │ sse2             14925732425    8871957784
   7   │ ssse3            14383096130    14117282701
   8   │ avx2             24660391165    22193686536
   9   │ fastest          avx2           avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /proc/spl/kstat/zfs/vdev_raidz_bench
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ 18 0 0x01 -1 0 4599695826 342555694538
   2   │ implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr
   3   │ original         520363481       306862956       123794189       1338949457      301620445       45436110        116458955       26789302        26037614        18109890
   4   │ scalar           1912009164      469128628       198106814       1759170367      529138724       397300883       243088353       203142018       133218276       114297165
   5   │ sse2             3088572787      1313742388      688781527       3002038366      992581132       835532305       545011486       523304898       315135892       150529417
   6   │ ssse3            3093255318      1300630695      676483229       3063879250      1628912263      1308276558      995395737       886620336       646762789       491185059
   7   │ avx2             5792809931      2198491448      1191114729      5569530285      3160246027      2545778233      1794212965      1572169415      1191238109      910534839
   8   │ fastest          avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

@AttilaFueloep
Copy link
Contributor Author

@lovesegfault The issue here is that newly created datasets do not pick up the fastest AES implementation until the dataset is remounted or the machine is rebooted.

What implementation do you see getting used if you follow DeHackEds suggestion or the first part of my reproducer? What bandwidth are you observing on what hardware? I'm seeing 500 MB/s with all cores at 100% regardless of the AES implementation used. It does seem that the GCM calculation are the limiting factor. Currently I'm taking a stab at speeding things up, lets see how this goes.

@behlendorf
Copy link
Contributor

Thanks for commenting, I should have a patch ready for testing by the end of the week.

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Sep 4, 2019

Take your time, it's easy to work around.

@lovesegfault
Copy link

@AttilaFueloep I use ZFS as my root drive, I can't just unmount/remount, so this issue basically means my whole system is always slow. It's annoying, but at least everything continues to work :)

@AttilaFueloep
Copy link
Contributor Author

@lovesegfault

I use ZFS as my root drive

I do as well,still I'm seeing the SIMD versions getting used. It's hard to tell more without knowing any details.

@lovesegfault
Copy link

@behlendorf I mentioned in the PR, but I figured this is a better place, it seems to me that somehow the SIMD algos are still not being picked.
#9296 (comment)

mattmacy pushed a commit to zfsonfreebsd/ZoF that referenced this issue Sep 10, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 17, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 19, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 4, 2019
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
stevijo pushed a commit to stevijo/zfs that referenced this issue Jan 12, 2020
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9215
Closes openzfs#9296
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

4 participants