Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICP: Improve AES-GCM performance #9749

Merged
merged 2 commits into from
Feb 10, 2020

Conversation

AttilaFueloep
Copy link
Contributor

Motivation and Context

Currently SIMD accelerated AES-GCM performance is limited by two factors:

a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations.

Description

To solve (a) we do crypto processing in larger chunks while owning the FPU. An icp_gcm_avx_chunk_size module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks.

How Has This Been Tested?

During development a special version was prepared which ran old an new routines in succession and compared the results. This proved to be quite helpful in finding bugs.

The final version was tested with the ZTS, a six hours run of zloop, rsyncing an Arch Linux installation (60 GiB) across multiple updates and running scrubs with the old and new code in between. Finally I run this version on the encrypted ZFS root box I'm typing this on.

Performance figures

The tables below show the speed improvements compared to the original implementation (pclmulqdq). The data was captured using bpftrace scripts which can be found here and manually post-processed.

Table 1:

Time spend in gcm_decrypt_final() and in gcm_mode_encrypt_contiguous_blocks() in nano seconds. This basically covers only aes-gcm processing in the decrypt case and processing and write out in the encrypt case. Only some exemplary data sizes are shown.

+-------+--------------------+-------+--------------------+-------+
| Size  | Decrypt time [ns]  | Fact. | Encrypt time [ns]  | Fact. |
| [KiB] +-----------+--------+       +-----------+--------+       |
|       | pclmulqdq |  avx   |       | pclmulqdq |  avx   |       |
+-------+-----------+--------+-------+-----------+--------+-------+
|   0.5 |     8,164 |  1,252 |   6.5 |    12,999 |  4,818 |   2.7 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     1 |    16,040 |  1,427 |  11.2 |    24,846 |  5,392 |   4.6 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     2 |    29,727 |  1,710 |  17.4 |    49,764 |  6,214 |   8.0 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     4 |    57,583 |  2,364 |  24.4 |    96,362 |  7,444 |  12.9 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     8 |   137,249 |  3,890 |  35.5 |   186,323 |  9,284 |  20.1 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    16 |   238,991 |  6,028 |  39.6 |   390,489 | 16,117 |  24.2 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    32 |   546,365 | 11,990 |  45.9 |   768,231 | 29,139 |  26.4 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    64 | 1,094,133 | 22,768 |  48.1 | 1,447,619 | 48,879 |  29.6 |
+-------+-----------+--------+-------+-----------+--------+-------+
|   128 | 2,038,599 | 42,521 |  47.9 | 2,797,580 | 83,720 |  33.4 |
+-------+-----------+--------+-------+-----------+--------+-------+

For small sizes I could only capture encryption times as follows:

+---------+--------------------+-------+
|  Size   | Encrypt time [ns]  | Fact. |
| [Bytes] +-----------+--------+       +
|         | pclmulqdq |  avx   |       |
+---------+-----------+--------+-------+
|     16  |       819 |    900 |   0.9 |
+---------+-----------+--------+-------+
|     32  |     1,074 |  1,091 |   1.0 |
+---------+-----------+--------+-------+
|    128  |     3,015 |    954 |   3.2 |
+---------+-----------+--------+-------+

Table 2:

Time spend processing a gcm_cxt in nano seconds. This starts with the call to gcm_init_ctx() and ends with the return from gcm_{en,de}crypt_final(), thus covering context initialization, final tag calculation and write out. Again only some exemplary data sizes are shown.

+-------+--------------------+-------+--------------------+-------+
| Size  | Decrypt time [ns]  | Fact. | Encrypt time [ns]  | Fact. |
| [KiB] +-----------+--------+       +-----------+--------+       |
|       | pclmulqdq |  avx   |       | pclmulqdq |  avx   |       |
+-------+-----------+--------+-------+-----------+--------+-------+
|   0.5 |    10,869 | 3,066  |   3.6 |    13,809 |  5,451 |   2.5 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     1 |    17,798 |  3,107 |   5.7 |    25,400 |  5,929 |   4.3 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     2 |    34,107 |  3,365 |  10.1 |    48,947 |  6,579 |   7.4 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     4 |    66,349 |  4,233 |  15.7 |    94,665 |  8,235 |  11.5 |
+-------+-----------+--------+-------+-----------+--------+-------+
|     8 |   153,649 |  7,150 |  21.5 |   182,939 | 11,122 |  16.4 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    16 |   287,996 | 11,126 |  25.9 |   373,888 | 16,842 |  22.2 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    32 |   605,206 | 21,532 |  28.1 |   751,444 | 31,245 |  24.1 |
+-------+-----------+--------+-------+-----------+--------+-------+
|    64 | 1,268,601 | 39,598 |  32.0 | 1,401,151 | 51,788 |  27.1 |
+-------+-----------+--------+-------+-----------+--------+-------+
|   128 | 2,037,635 | 73,167 |  27.8 | 2,712,798 | 92,379 |  29.4 |
+-------+-----------+--------+-------+-----------+--------+-------+

Here the nuber of captured small blocks were to low to show.

Fio results:

The user visible speedup is about 12 times. This was measured with fio runs (280 MiB / 22 MiB = 12.7) and manually copying and dd'ing files and measuring time taken.

# fio --name="Test ICP PCLMUL" --randrepeat=1 --ioengine=libaio
--gtod_reduce=1 --directory=/var/tmp --bs=512 --iodepth=64 --filesize=10M-100m
--readwrite=randrw --rwmixread=50 --percentage_random=50
--blocksize_range=512-128k --blocksize_unaligned --unlink=1 --nrfiles=250

Test ICP: (g=0): rw=randrw, bs=(R) 512B-128KiB, (W) 512B-128KiB, (T) 512B-128KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Test ICP: Laying out IO files (250 files / total 13552MiB)
Jobs: 1 (f=168): [m(1)][100.0%][r=27.1MiB/s,w=28.4MiB/s][r=484,w=521 IOPS][eta 00m:00s]
Test ICP: (groupid=0, jobs=1): err= 0: pid=453146: Tue Dec 17 19:41:51 2019
  read: IOPS=390, BW=22.2MiB/s (23.3MB/s)(6763MiB/303985msec)
   bw (  KiB/s): min= 9460, max=33184, per=99.90%, avg=22758.74, stdev=3955.12, samples=607
   iops        : min=  142, max=  582, avg=389.86, stdev=75.40, samples=607
  write: IOPS=389, BW=22.3MiB/s (23.4MB/s)(6790MiB/303985msec); 0 zone resets
   bw (  KiB/s): min= 8943, max=31526, per=99.90%, avg=22849.19, stdev=4078.35, samples=607
   iops        : min=  146, max=  550, avg=389.24, stdev=76.27, samples=607
  cpu          : usr=0.25%, sys=88.38%, ctx=22414, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=118633,118465,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=6763MiB (7091MB), run=303985-303985msec
  WRITE: bw=22.3MiB/s (23.4MB/s), 22.3MiB/s-22.3MiB/s (23.4MB/s-23.4MB/s), io=6790MiB (7120MB), run=303985-303985msec

# fio --name="Test ICP AVX" --randrepeat=1 --ioengine=libaio
--gtod_reduce=1  --directory=/var/tmp --bs=512 --iodepth=64 --filesize=10M-100m
--readwrite=randrw --rwmixread=50 --percentage_random=50
--blocksize_range=512-128k --blocksize_unaligned --unlink=1 --nrfiles=250

Test ICP: (g=0): rw=randrw, bs=(R) 512B-128KiB, (W) 512B-128KiB, (T) 512B-128KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Test ICP: Laying out IO files (250 files / total 13552MiB)
Jobs: 1 (f=169): [m(1)][100.0%][r=219MiB/s,w=225MiB/s][r=4078,w=4147 IOPS][eta 00m:00s]
Test ICP: (groupid=0, jobs=1): err= 0: pid=62831: Tue Dec 17 18:48:44 2019
  read: IOPS=4890, BW=279MiB/s (292MB/s)(6763MiB/24256msec)
   bw (  KiB/s): min=115495, max=420832, per=100.00%, avg=286461.67, stdev=86508.74, samples=48
   iops        : min= 2110, max= 7236, avg=4906.65, stdev=1447.97, samples=48
  write: IOPS=4883, BW=280MiB/s (294MB/s)(6790MiB/24256msec); 0 zone resets
   bw (  KiB/s): min=118692, max=415220, per=100.00%, avg=287503.60, stdev=86732.52, samples=48
   iops        : min= 2090, max= 7242, avg=4896.92, stdev=1444.53, samples=48
  cpu          : usr=3.20%, sys=77.05%, ctx=5272, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=118633,118465,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=279MiB/s (292MB/s), 279MiB/s-279MiB/s (292MB/s-292MB/s), io=6763MiB (7091MB), run=24256-24256msec
  WRITE: bw=280MiB/s (294MB/s), 280MiB/s-280MiB/s (294MB/s-294MB/s), io=6790MiB (7120MB), run=24256-24256msec

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@AttilaFueloep
Copy link
Contributor Author

Some further remarks:

  • Since this is crypto I tried to keep my changes to the existing code at a bare minimum. This should also make it easier to merge possible future upstream changes.

  • I left in a couple of fixmes in places where I would appreciate reviewer feedback.

  • The changes done to the openssl assembler routines during porting can be found here (aesni-gcm-x86_64.S.diff) and here (ghash-x86_64.S.diff)

  • The ICP module parameters doesn't seem to be documented, so I have no place to add documentation to.

  • My testing resources are quite limited so I would really appreciate any further testing.

@behlendorf behlendorf added Component: Encryption "native encryption" feature Type: Performance Performance improvement or performance problem labels Dec 19, 2019
@behlendorf
Copy link
Contributor

Very impressive! This is a huge performance improvement, thank you for tackling this work.

cc: @jasonbking @Fabian-Gruenbichler

@AttilaFueloep
Copy link
Contributor Author

I'm really sorry for the build failures, but I can't reproduce them locally. Do I need to add anything besides --enable-debug --enable-debuginfo to configure?

@behlendorf
Copy link
Contributor

behlendorf commented Dec 19, 2019

No problem, that's why we have to CI to test across a wide range of kernel, compiler, and library versions. It looks like you've managed to sort it out. Using --enable-debug --enable-debuginfo should be sufficient.

[edit] I'd also suggest running make checkstyle and make cppcheck locally. You will need to install cppcheck for your distribution of choice.

@AttilaFueloep
Copy link
Contributor Author

Yes, forgot to run make cstyle after my last change and my editor converted four to two spaces. make lint gives me tons of errors even in files I did not touch, seems Arch Linux cppcheck is special. But I think I can fix the errors.

@behlendorf
Copy link
Contributor

make lint gives me tons of errors even in files I did not touch

Try rebasing the PR on the master branch from yesterday. A collection of cppcheck patches were merged yesterday for issues only detected by newer versions of cppcheck, up to cppcheck v1.88 should now be clean.

@AttilaFueloep
Copy link
Contributor Author

Well the PR is based on master@a3640486fffc which is from today but Arch has cppcheck 1.89-1. I'll try downgrading it to 1.88 and see if that helps.

@behlendorf
Copy link
Contributor

The CI is still using cppcheck 1.82 and is reporting these warnings which do look related to this change.

@AttilaFueloep
Copy link
Contributor Author

Right, although I didn't fully understand the error, I tried to fix them with commit f488b0e which didn't help. I now downgraded cppcheck to 1.86 and it's silent with and without f488b0e. May that be a false positive? Paxcheck complained about an executable stack in the assembler files, which I've already fixed but not pushed yet. Should I revert f488b0e and add a /* LINTED */ comment?

@AttilaFueloep
Copy link
Contributor Author

Well, I see ztest failing in the new code. I've to have a look tomorrow.

@behlendorf
Copy link
Contributor

That's strange, I wasn't able to reproduce the issue with cppcheck 1.88 either. Let's revert f488b0e from the PR for now, we'll be updating the version of cppcheck used pretty shortly so this may not be an issue.

@AttilaFueloep
Copy link
Contributor Author

Ok, will do tomorrow. Looks like there is no support for the MOVBE instruction in the testers. I'm a bit puzzled here. What virtual CPU do they use?

@behlendorf
Copy link
Contributor

@AttilaFueloep nothing exotic, they're currently using ec2 m3.large instances and will soon to be switched to m5.large instances.

@AttilaFueloep
Copy link
Contributor Author

Ok, thanks. I've to do some research regarding MOVBE support.

@AttilaFueloep
Copy link
Contributor Author

Ok, M3 instances use Ivy/Sandy Bridge CPUs which do have AVX but no MOVBE. I'd need to update the requirements from AVX to AVX2 (or test explicitly for MOVBE), which would mean we can't use the testers to test this PR. This would lift CPU requirements to at least Haswell or Excavator. I've no idea how to proceed from here.

@behlendorf
Copy link
Contributor

Actually, it sounds like this went pretty well since we catch this MOVBE issue. What you're going to want to do is a check for X86_FEATURE_MOVBE in include/os/linux/kernel/linux/simd_x86.h. This way we'll be able to detect if it's available at run time, take a look at the zfs_*_available() functions. Are there other specific instructions we should be concerned about?

As for how to test this I don't think it'll be long before switching to m5 instances which according to the documentation use skylake.

@AttilaFueloep
Copy link
Contributor Author

Luckily, yes. I've to go over the openssl requirements again, obviously I missed something there.

This also poses the question if this code should always be compiled in, regardless of the build environment detected by configure and only called on supported platforms at run time. The pclmulqdq code gets compiled in only if building on supported hardware (defined(HAVE_PCLMULQDQ)), I just mirrored that behaviour.

@behlendorf
Copy link
Contributor

behlendorf commented Dec 20, 2019

should always be compiled in, regardless of the build environment detected by configure and only called on supported platforms at run time.

Yup. And it looks like that is the case for pclmulqdq code so your code is pretty close. There are a couple level of checks going on here which can be a bit confusing. But to summarize with an example, the HAVE_PCLMULQDQ check detects if the host build tool chain understands instruction, not if the host hardware supports it. This is coupled with zfs_pclmulqdq_available() which is used for the run time check.

You'll want to add similar checks for MOVBE to config/toolchain-simd.m4 and include/os/linux/kernel/linux/simd_x86.h and lib/libspl/include/sys/simd.h. Then always compile the code in when HAVE_MOVBE (and HAVE_AVX) but only enable it on module load when zfs_movbe_available() and whatever else is needed.

@Fabian-Gruenbichler
Copy link
Contributor

I probably won't have time until the new year to take a closer look, but from a quick glance it seems like the proper preconditions are in place to avoid long kfpu_begin/end blocks which is probably all I can really contribute here ;)

@AttilaFueloep
Copy link
Contributor Author

@behlendorf All right, got it. Checking for tool chain support definitely makes more sense. I'll try to find some time tomorrow.

@AttilaFueloep
Copy link
Contributor Author

The "signed integer overflow" lint error is a false positive and will go away with cppcheck >= 0.86. Since the testers do not run the new code now, (they are missing the MOVBE instruction), I don't think the test failure is related to my changes. All other tests pass, but they are not very meaningful until the testers will be updated to newer ec2 instances which will be able to test the added code.

@behlendorf
Copy link
Contributor

@AttilaFueloep thanks for addressing the review feedback so quickly. I agree, the testing failures you hit were unrelated to this change.

The CI has been updated to use m5d.large instances which should support the needed instruction. I've gone ahead and resubmitted this PR for a fresh test run, but you may want to rebase it as well to clean out any old results. Additionally, the cppcheck version has been updated that issue should be resolved.

@AttilaFueloep
Copy link
Contributor Author

@behlendorf

The CI has been updated to use m5d.large instance

Fantastic!

Squashed and rebased, it should be ready for review now.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 23, 2019
@rlaager
Copy link
Member

rlaager commented Dec 24, 2019

@AttilaFueloep Any chance you have a comparison between AES-CCM and AES-GCM in ZFS? (It's not strictly related to this PR, but you may be in a good position to easily test it, which would be helpful to me in a different context.)

@nickcmaynard
Copy link

Thank you @behlendorf, much appreciated!

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 15, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 15, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 22, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 22, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 22, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 22, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 23, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 23, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 28, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 28, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
winny- pushed a commit to winny-/zfs that referenced this pull request May 1, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
winny- pushed a commit to winny-/zfs that referenced this pull request May 1, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
tonyhutter pushed a commit that referenced this pull request May 12, 2020
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9749
tonyhutter pushed a commit that referenced this pull request May 12, 2020
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup #9749 
Closes #10029
AttilaFueloep added a commit to AttilaFueloep/zfs that referenced this pull request Oct 23, 2020
While preparing openzfs#9749 some .cfi_{start,end}proc directives
were missed. Add the missing ones.

See upstream openssl/openssl@275a048f

Signed-off-by: Attila Fülöp <attila@fueloep.org>
behlendorf pushed a commit that referenced this pull request Oct 30, 2020
While preparing #9749 some .cfi_{start,end}proc directives
were missed. Add the missing ones.

See upstream openssl/openssl@275a048f

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #11101
behlendorf pushed a commit that referenced this pull request Oct 30, 2020
While preparing #9749 some .cfi_{start,end}proc directives
were missed. Add the missing ones.

See upstream openssl/openssl@275a048f

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #11101
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#9749
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup openzfs#9749 
Closes openzfs#10029
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
While preparing openzfs#9749 some .cfi_{start,end}proc directives
were missed. Add the missing ones.

See upstream openssl/openssl@275a048f

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#11101
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
While preparing openzfs#9749 some .cfi_{start,end}proc directives
were missed. Add the missing ones.

See upstream openssl/openssl@275a048f

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#11101
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants