AES-CTR implementation using AES-NI intrinsics and SIMD operations.
After reading Understanding Cryptography, A Textbook for Students and Practitioners by Christof Paar and Jan Pelzl, I tried to implement AES in C++. At the end of Chapter 4, the book says that it is possible to implement AES in such a way that it requires fewer than 4 CPU cycles to encrypt a byte. So, I gave it a shot.
Below is the tabulated percentual differences between openssl
and tautastic/aes-ctr
:
Metric | openssl Value | tautastic/aes-ctr Value | Percentual Difference |
---|---|---|---|
CPU Utilization | 0.413 | 0.999 | +141.89% |
Cycles | 2,183,036,314 | 932,415,861 | -57.29% |
Instructions | 2,544,915,007 | 1,014,823,687 | -60.11% |
Cycles per Byte | 5.4576 | 2.3310 | -57.29% |
Throughput in GB/s | 0.325 | 0.712 | +119.08% |
Note: The comparison is done to illustrate the performance characteristics of
tautastic/aes-ctr
relative toopenssl
. Both libraries have their own set of features and optimizations which may suit different use cases.
Performance counter stats for 'openssl enc -aes-128-ctr -in song.webm -out song128.webm -K 2b7e151628aed2a6abf7158809cf4f3c -iv f0f1f2f3f4f5f6f7f8f9fafbfcfdfeff':
508,18 msec task-clock # 0,413 CPUs utilized
117 context-switches # 230,232 /sec
1 cpu-migrations # 1,968 /sec
333 page-faults # 655,275 /sec
2.183.036.314 cycles # 4,296 GHz
2.544.915.007 instructions # 1,17 insn per cycle
416.065.929 branches # 818,732 M/sec
3.988.395 branch-misses # 0,96% of all branches
1,230478489 seconds time elapsed
0,129072000 seconds user
0,380437000 seconds sys
Performance counter stats for 'aes-128-ctr -in song.webm -out song128.webm -K 2b7e151628aed2a6abf7158809cf4f3c -iv f0f1f2f3f4f5f6f7f8f9fafbfcfdfeff':
212,55 msec task-clock # 0,999 CPUs utilized
0 context-switches # 0,000 /sec
0 cpu-migrations # 0,000 /sec
103.563 page-faults # 487,245 K/sec
932.415.861 cycles # 4,387 GHz
1.014.823.687 instructions # 1,09 insn per cycle
129.065.235 branches # 607,229 M/sec
550.080 branch-misses # 0,43% of all branches
0,212803631 seconds time elapsed
0,087524000 seconds user
0,124655000 seconds sys
Note: The file
song.webm
has a size of400MB
.
Intel's whitepaper, "Breakthrough AES Performance with Intel® AES New Instructions", states in its conclusion: "We are able to achieve excellent AES performance on the Intel® Core™ i7 Processor Extreme Edition (i7-980X) using the new instructions. With optimized code, it is possible to achieve ~0.24 cycles/byte on 6 cores for AES128 on parallel modes for large buffers."
So there is still plenty of room for further optimization.
This implementation is intended for academic purposes only and should not be used in a production environment.