ICP: Improve AES-GCM performance #9749

AttilaFueloep · 2019-12-19T18:38:06Z

Motivation and Context

Currently SIMD accelerated AES-GCM performance is limited by two factors:

a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations.

Description

To solve (a) we do crypto processing in larger chunks while owning the FPU. An icp_gcm_avx_chunk_size module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks.

How Has This Been Tested?

During development a special version was prepared which ran old an new routines in succession and compared the results. This proved to be quite helpful in finding bugs.

The final version was tested with the ZTS, a six hours run of zloop, rsyncing an Arch Linux installation (60 GiB) across multiple updates and running scrubs with the old and new code in between. Finally I run this version on the encrypted ZFS root box I'm typing this on.

Performance figures

The tables below show the speed improvements compared to the original implementation (pclmulqdq). The data was captured using bpftrace scripts which can be found here and manually post-processed.

Table 1:

Time spend in gcm_decrypt_final() and in gcm_mode_encrypt_contiguous_blocks() in nano seconds. This basically covers only aes-gcm processing in the decrypt case and processing and write out in the encrypt case. Only some exemplary data sizes are shown.

+-------+--------------------+-------+--------------------+-------+
| Size  | Decrypt time [ns]  | Fact. | Encrypt time [ns]  | Fact. |
| [KiB] +-----------+--------+       +-----------+--------+       |
|       | pclmulqdq |  avx   |       | pclmulqdq |  avx   |       |
+-------+-----------+--------+-------+-----------+--------+-------+
|   0.5 |     8,164 |  1,252 |   6.5 |    12,999 |  4,818 |   2.7 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     1 |    16,040 |  1,427 |  11.2 |    24,846 |  5,392 |   4.6 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     2 |    29,727 |  1,710 |  17.4 |    49,764 |  6,214 |   8.0 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     4 |    57,583 |  2,364 |  24.4 |    96,362 |  7,444 |  12.9 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     8 |   137,249 |  3,890 |  35.5 |   186,323 |  9,284 |  20.1 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    16 |   238,991 |  6,028 |  39.6 |   390,489 | 16,117 |  24.2 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    32 |   546,365 | 11,990 |  45.9 |   768,231 | 29,139 |  26.4 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    64 | 1,094,133 | 22,768 |  48.1 | 1,447,619 | 48,879 |  29.6 |
+-------+-----------+--------+-------+-----------+--------+-------+
|   128 | 2,038,599 | 42,521 |  47.9 | 2,797,580 | 83,720 |  33.4 |
+-------+-----------+--------+-------+-----------+--------+-------+

For small sizes I could only capture encryption times as follows:

+---------+--------------------+-------+
|  Size   | Encrypt time [ns]  | Fact. |
| [Bytes] +-----------+--------+       +
|         | pclmulqdq |  avx   |       |
+---------+-----------+--------+-------+
|     16  |       819 |    900 |   0.9 |
+---------+-----------+--------+-------+
|     32  |     1,074 |  1,091 |   1.0 |
+---------+-----------+--------+-------+
|    128  |     3,015 |    954 |   3.2 |
+---------+-----------+--------+-------+

Table 2:

Time spend processing a gcm_cxt in nano seconds. This starts with the call to gcm_init_ctx() and ends with the return from gcm_{en,de}crypt_final(), thus covering context initialization, final tag calculation and write out. Again only some exemplary data sizes are shown.

+-------+--------------------+-------+--------------------+-------+
| Size  | Decrypt time [ns]  | Fact. | Encrypt time [ns]  | Fact. |
| [KiB] +-----------+--------+       +-----------+--------+       |
|       | pclmulqdq |  avx   |       | pclmulqdq |  avx   |       |
+-------+-----------+--------+-------+-----------+--------+-------+
|   0.5 |    10,869 | 3,066  |   3.6 |    13,809 |  5,451 |   2.5 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     1 |    17,798 |  3,107 |   5.7 |    25,400 |  5,929 |   4.3 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     2 |    34,107 |  3,365 |  10.1 |    48,947 |  6,579 |   7.4 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     4 |    66,349 |  4,233 |  15.7 |    94,665 |  8,235 |  11.5 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     8 |   153,649 |  7,150 |  21.5 |   182,939 | 11,122 |  16.4 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    16 |   287,996 | 11,126 |  25.9 |   373,888 | 16,842 |  22.2 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    32 |   605,206 | 21,532 |  28.1 |   751,444 | 31,245 |  24.1 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    64 | 1,268,601 | 39,598 |  32.0 | 1,401,151 | 51,788 |  27.1 |
+-------+-----------+--------+-------+-----------+--------+-------+
|   128 | 2,037,635 | 73,167 |  27.8 | 2,712,798 | 92,379 |  29.4 |
+-------+-----------+--------+-------+-----------+--------+-------+

Here the nuber of captured small blocks were to low to show.

Fio results:

The user visible speedup is about 12 times. This was measured with fio runs (280 MiB / 22 MiB = 12.7) and manually copying and dd'ing files and measuring time taken.

# fio --name="Test ICP PCLMUL" --randrepeat=1 --ioengine=libaio
--gtod_reduce=1 --directory=/var/tmp --bs=512 --iodepth=64 --filesize=10M-100m
--readwrite=randrw --rwmixread=50 --percentage_random=50
--blocksize_range=512-128k --blocksize_unaligned --unlink=1 --nrfiles=250

Test ICP: (g=0): rw=randrw, bs=(R) 512B-128KiB, (W) 512B-128KiB, (T) 512B-128KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Test ICP: Laying out IO files (250 files / total 13552MiB)
Jobs: 1 (f=168): [m(1)][100.0%][r=27.1MiB/s,w=28.4MiB/s][r=484,w=521 IOPS][eta 00m:00s]
Test ICP: (groupid=0, jobs=1): err= 0: pid=453146: Tue Dec 17 19:41:51 2019
  read: IOPS=390, BW=22.2MiB/s (23.3MB/s)(6763MiB/303985msec)
   bw (  KiB/s): min= 9460, max=33184, per=99.90%, avg=22758.74, stdev=3955.12, samples=607
   iops        : min=  142, max=  582, avg=389.86, stdev=75.40, samples=607
  write: IOPS=389, BW=22.3MiB/s (23.4MB/s)(6790MiB/303985msec); 0 zone resets
   bw (  KiB/s): min= 8943, max=31526, per=99.90%, avg=22849.19, stdev=4078.35, samples=607
   iops        : min=  146, max=  550, avg=389.24, stdev=76.27, samples=607
  cpu          : usr=0.25%, sys=88.38%, ctx=22414, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=118633,118465,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=6763MiB (7091MB), run=303985-303985msec
  WRITE: bw=22.3MiB/s (23.4MB/s), 22.3MiB/s-22.3MiB/s (23.4MB/s-23.4MB/s), io=6790MiB (7120MB), run=303985-303985msec

# fio --name="Test ICP AVX" --randrepeat=1 --ioengine=libaio
--gtod_reduce=1  --directory=/var/tmp --bs=512 --iodepth=64 --filesize=10M-100m
--readwrite=randrw --rwmixread=50 --percentage_random=50
--blocksize_range=512-128k --blocksize_unaligned --unlink=1 --nrfiles=250

Test ICP: (g=0): rw=randrw, bs=(R) 512B-128KiB, (W) 512B-128KiB, (T) 512B-128KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Test ICP: Laying out IO files (250 files / total 13552MiB)
Jobs: 1 (f=169): [m(1)][100.0%][r=219MiB/s,w=225MiB/s][r=4078,w=4147 IOPS][eta 00m:00s]
Test ICP: (groupid=0, jobs=1): err= 0: pid=62831: Tue Dec 17 18:48:44 2019
  read: IOPS=4890, BW=279MiB/s (292MB/s)(6763MiB/24256msec)
   bw (  KiB/s): min=115495, max=420832, per=100.00%, avg=286461.67, stdev=86508.74, samples=48
   iops        : min= 2110, max= 7236, avg=4906.65, stdev=1447.97, samples=48
  write: IOPS=4883, BW=280MiB/s (294MB/s)(6790MiB/24256msec); 0 zone resets
   bw (  KiB/s): min=118692, max=415220, per=100.00%, avg=287503.60, stdev=86732.52, samples=48
   iops        : min= 2090, max= 7242, avg=4896.92, stdev=1444.53, samples=48
  cpu          : usr=3.20%, sys=77.05%, ctx=5272, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=118633,118465,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=279MiB/s (292MB/s), 279MiB/s-279MiB/s (292MB/s-292MB/s), io=6763MiB (7091MB), run=24256-24256msec
  WRITE: bw=280MiB/s (294MB/s), 280MiB/s-280MiB/s (294MB/s-294MB/s), io=6790MiB (7120MB), run=24256-24256msec

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

AttilaFueloep · 2019-12-19T18:41:27Z

Some further remarks:

Since this is crypto I tried to keep my changes to the existing code at a bare minimum. This should also make it easier to merge possible future upstream changes.
I left in a couple of fixmes in places where I would appreciate reviewer feedback.
The changes done to the openssl assembler routines during porting can be found here (aesni-gcm-x86_64.S.diff) and here (ghash-x86_64.S.diff)
The ICP module parameters doesn't seem to be documented, so I have no place to add documentation to.
My testing resources are quite limited so I would really appreciate any further testing.

behlendorf · 2019-12-19T19:06:02Z

Very impressive! This is a huge performance improvement, thank you for tackling this work.

cc: @jasonbking @Fabian-Gruenbichler

AttilaFueloep · 2019-12-19T19:24:51Z

I'm really sorry for the build failures, but I can't reproduce them locally. Do I need to add anything besides --enable-debug --enable-debuginfo to configure?

behlendorf · 2019-12-19T19:49:50Z

No problem, that's why we have to CI to test across a wide range of kernel, compiler, and library versions. It looks like you've managed to sort it out. Using --enable-debug --enable-debuginfo should be sufficient.

[edit] I'd also suggest running make checkstyle and make cppcheck locally. You will need to install cppcheck for your distribution of choice.

AttilaFueloep · 2019-12-19T20:26:45Z

Yes, forgot to run make cstyle after my last change and my editor converted four to two spaces. make lint gives me tons of errors even in files I did not touch, seems Arch Linux cppcheck is special. But I think I can fix the errors.

behlendorf · 2019-12-19T20:28:33Z

make lint gives me tons of errors even in files I did not touch

Try rebasing the PR on the master branch from yesterday. A collection of cppcheck patches were merged yesterday for issues only detected by newer versions of cppcheck, up to cppcheck v1.88 should now be clean.

AttilaFueloep · 2019-12-19T20:34:45Z

Well the PR is based on master@a3640486fffc which is from today but Arch has cppcheck 1.89-1. I'll try downgrading it to 1.88 and see if that helps.

behlendorf · 2019-12-19T20:44:31Z

The CI is still using cppcheck 1.82 and is reporting these warnings which do look related to this change.

AttilaFueloep · 2019-12-19T21:21:56Z

Right, although I didn't fully understand the error, I tried to fix them with commit f488b0e which didn't help. I now downgraded cppcheck to 1.86 and it's silent with and without f488b0e. May that be a false positive? Paxcheck complained about an executable stack in the assembler files, which I've already fixed but not pushed yet. Should I revert f488b0e and add a /* LINTED */ comment?

AttilaFueloep · 2019-12-19T21:45:18Z

Well, I see ztest failing in the new code. I've to have a look tomorrow.

behlendorf · 2019-12-19T21:53:43Z

That's strange, I wasn't able to reproduce the issue with cppcheck 1.88 either. Let's revert f488b0e from the PR for now, we'll be updating the version of cppcheck used pretty shortly so this may not be an issue.

AttilaFueloep · 2019-12-19T22:26:20Z

Ok, will do tomorrow. Looks like there is no support for the MOVBE instruction in the testers. I'm a bit puzzled here. What virtual CPU do they use?

behlendorf · 2019-12-19T22:32:08Z

@AttilaFueloep nothing exotic, they're currently using ec2 m3.large instances and will soon to be switched to m5.large instances.

AttilaFueloep · 2019-12-19T22:35:53Z

Ok, thanks. I've to do some research regarding MOVBE support.

AttilaFueloep · 2019-12-19T22:52:24Z

Ok, M3 instances use Ivy/Sandy Bridge CPUs which do have AVX but no MOVBE. I'd need to update the requirements from AVX to AVX2 (or test explicitly for MOVBE), which would mean we can't use the testers to test this PR. This would lift CPU requirements to at least Haswell or Excavator. I've no idea how to proceed from here.

behlendorf · 2019-12-19T23:26:03Z

Actually, it sounds like this went pretty well since we catch this MOVBE issue. What you're going to want to do is a check for X86_FEATURE_MOVBE in include/os/linux/kernel/linux/simd_x86.h. This way we'll be able to detect if it's available at run time, take a look at the zfs_*_available() functions. Are there other specific instructions we should be concerned about?

As for how to test this I don't think it'll be long before switching to m5 instances which according to the documentation use skylake.

AttilaFueloep · 2019-12-19T23:54:39Z

Luckily, yes. I've to go over the openssl requirements again, obviously I missed something there.

This also poses the question if this code should always be compiled in, regardless of the build environment detected by configure and only called on supported platforms at run time. The pclmulqdq code gets compiled in only if building on supported hardware (defined(HAVE_PCLMULQDQ)), I just mirrored that behaviour.

behlendorf · 2019-12-20T00:24:23Z

should always be compiled in, regardless of the build environment detected by configure and only called on supported platforms at run time.

Yup. And it looks like that is the case for pclmulqdq code so your code is pretty close. There are a couple level of checks going on here which can be a bit confusing. But to summarize with an example, the HAVE_PCLMULQDQ check detects if the host build tool chain understands instruction, not if the host hardware supports it. This is coupled with zfs_pclmulqdq_available() which is used for the run time check.

You'll want to add similar checks for MOVBE to config/toolchain-simd.m4 and include/os/linux/kernel/linux/simd_x86.h and lib/libspl/include/sys/simd.h. Then always compile the code in when HAVE_MOVBE (and HAVE_AVX) but only enable it on module load when zfs_movbe_available() and whatever else is needed.

Fabian-Gruenbichler · 2019-12-20T07:29:00Z

I probably won't have time until the new year to take a closer look, but from a quick glance it seems like the proper preconditions are in place to avoid long kfpu_begin/end blocks which is probably all I can really contribute here ;)

AttilaFueloep · 2019-12-20T12:36:46Z

@behlendorf All right, got it. Checking for tool chain support definitely makes more sense. I'll try to find some time tomorrow.

AttilaFueloep · 2019-12-22T16:14:23Z

The "signed integer overflow" lint error is a false positive and will go away with cppcheck >= 0.86. Since the testers do not run the new code now, (they are missing the MOVBE instruction), I don't think the test failure is related to my changes. All other tests pass, but they are not very meaningful until the testers will be updated to newer ec2 instances which will be able to test the added code.

behlendorf · 2019-12-23T21:00:36Z

@AttilaFueloep thanks for addressing the review feedback so quickly. I agree, the testing failures you hit were unrelated to this change.

The CI has been updated to use m5d.large instances which should support the needed instruction. I've gone ahead and resubmitted this PR for a fresh test run, but you may want to rebase it as well to clean out any old results. Additionally, the cppcheck version has been updated that issue should be resolved.

AttilaFueloep · 2019-12-23T23:05:21Z

@behlendorf

The CI has been updated to use m5d.large instance

Fantastic!

Squashed and rebased, it should be ready for review now.

rlaager · 2019-12-24T02:45:03Z

@AttilaFueloep Any chance you have a comparison between AES-CCM and AES-GCM in ZFS? (It's not strictly related to this PR, but you may be in a good position to easily test it, which would be helpful to me in a different context.)

nickcmaynard · 2020-04-11T06:46:40Z

Thank you @behlendorf, much appreciated!

Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Jason King <jason.king@joyent.com> Reviewed-By: Tom Caputi <tcaputi@datto.com> Reviewed-By: Richard Laager <rlaager@wiktel.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes openzfs#9749

There are a couple of x86_64 architectures which support all needed features to make the accelerated GCM implementation work but the MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge and AMD Bulldozer, Piledriver, and Steamroller. By using MOVBE only if available and replacing it with a MOV followed by a BSWAP if not, those architectures now benefit from the new GCM routines and performance is considerably better compared to the original implementation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Followup openzfs#9749 Closes openzfs#10029