Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go crypto faster than OpenSSL on AES-NI systems #23

Closed
lxp opened this issue May 8, 2016 · 19 comments
Closed

Go crypto faster than OpenSSL on AES-NI systems #23

lxp opened this issue May 8, 2016 · 19 comments

Comments

@lxp
Copy link
Contributor

lxp commented May 8, 2016

On my system Go crypto seems to be a lot faster than OpenSSL crypto.
I started to investigate this with gocryptfs 0.9 and perf on Linux 4.4. Under heavy load (multiple rsync's ongoing) perf attributed 60% overhead to the Go runtime's native call checks (runtime.cgoCheckArg), which were caused by OpenSSL calls.
I will provide proper benchmarks with gocryptfs 0.10-rc1, once my system is idle again.

$ cat /proc/cpuinfo 
[...]
model name  : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
[...]
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts
[...]
@rfjakob
Copy link
Owner

rfjakob commented May 8, 2016

That would be defininitely interesting. You can run the built-in benchmark using

cd gocryptfs/internal/stupidgcm
go test -bench .

On my machine, I get this (StupidGCM = simple OpenSSL wrapper, GoGCM = built-in Go crypto):

Benchmark4kEncStupidGCM-2      50000         24774 ns/op     165.33 MB/s
Benchmark4kEncGoGCM-2          10000        120745 ns/op      33.92 MB/s

My cpu does not have AES-NI,

cat /proc/cpuinfo 
[...]
model name  : Intel(R) Pentium(R) CPU G630 @ 2.70GHz
[...]
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt

@lxp
Copy link
Contributor Author

lxp commented May 8, 2016

My machine (i5-4690K) is still not fully idle, but I think the results are clear enough:

$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          7123 ns/op     575.03 MB/s
Benchmark4kEncGoGCM-4         500000          2512 ns/op    1629.95 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 2.867s
$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          6949 ns/op     589.37 MB/s
Benchmark4kEncGoGCM-4         500000          2480 ns/op    1651.41 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 2.803s
$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          6985 ns/op     586.37 MB/s
Benchmark4kEncGoGCM-4         500000          2480 ns/op    1651.13 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 2.813s

Results from the old openssl_benchmark.bash from v0.9:

$ ./openssl_benchmark.bash 
+ go test -bench=.
Benchmarking AES-GCM-256 with 4kB block size
testing: warning: no tests to run
PASS
BenchmarkGoEnc4K-4       1000000          1493 ns/op    2743.30 MB/s
BenchmarkGoDec4K-4       1000000          1481 ns/op    2764.83 MB/s
BenchmarkOpensslEnc4K-4   200000          7624 ns/op     537.24 MB/s
BenchmarkOpensslDec4K-4   100000         20524 ns/op     199.56 MB/s
ok      github.com/rfjakob/gocryptfs/openssl_benchmark  6.878s
$ ./openssl_benchmark.bash 
+ go test -bench=.
Benchmarking AES-GCM-256 with 4kB block size
testing: warning: no tests to run
PASS
BenchmarkGoEnc4K-4       1000000          1497 ns/op    2734.83 MB/s
BenchmarkGoDec4K-4       1000000          1487 ns/op    2754.54 MB/s
BenchmarkOpensslEnc4K-4   200000          7648 ns/op     535.54 MB/s
BenchmarkOpensslDec4K-4   100000         20577 ns/op     199.05 MB/s
ok      github.com/rfjakob/gocryptfs/openssl_benchmark  6.901s
$ ./openssl_benchmark.bash 
+ go test -bench=.
Benchmarking AES-GCM-256 with 4kB block size
testing: warning: no tests to run
PASS
BenchmarkGoEnc4K-4       1000000          1500 ns/op    2729.13 MB/s
BenchmarkGoDec4K-4       1000000          1490 ns/op    2747.32 MB/s
BenchmarkOpensslEnc4K-4   200000          7690 ns/op     532.61 MB/s
BenchmarkOpensslDec4K-4   100000         20579 ns/op     199.03 MB/s
ok      github.com/rfjakob/gocryptfs/openssl_benchmark  6.941s

I am not sure what causes the difference in Go crypto performance (but I also didn't look into the code).
What I also find interesting in the old benchmark is that OpenSSL decryption is significantly slower than encryption.

@rfjakob
Copy link
Owner

rfjakob commented May 8, 2016

The old benchmarks use a 12-byte IV, which is Go's default. Since v0.7, gocryptfs actually uses 16 bytes and the new benchmarks reflect that.

@rfjakob
Copy link
Owner

rfjakob commented May 8, 2016

In any case, the performance difference between Go and OpenSSL is huge. I will add autodection that switches to Go crypto if AES-NI is available.

@lxp
Copy link
Contributor Author

lxp commented May 8, 2016

Ah okay, that explains it.
For me, the current situation is no problem, as I just use -openssl=false during mounting.
Yeah, autodetection was exactly what I wanted to recommend :)
I think the Go crypto code already does it. I am just not sure if it is easily accessible from outside.

@lxp
Copy link
Contributor Author

lxp commented May 8, 2016

I am rather new to Go. Do you know if there is an easy way to compile the benchmark as binary?
Then, I could also test it on one of the first Intel processors supporting AES-NI (Xeon E5620).
I know it has worse AES-NI performance than newer processors, but would be interesting to know if Go crypto is still faster.

@alphazo
Copy link

alphazo commented May 9, 2016

Similar results here on an i5 core that has AES-NI instructions.

$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          8815 ns/op     464.65 MB/s
Benchmark4kEncGoGCM-4         300000          3796 ns/op    1078.98 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 3.147s

$ cat /proc/cpuinfo 
[...]
model name  : Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
[...]
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

@rfjakob
Copy link
Owner

rfjakob commented May 10, 2016

@lxp Run go test -c to get the stupidgcm.test binary. Benchmark is run using

./stupidgcm.test -test.bench .

@rfjakob
Copy link
Owner

rfjakob commented May 10, 2016

Ugh. Looks like it is going to be more complicated than checking for the "aes" flag.

$ go test -bench .
PASS
Benchmark4kEncStupidGCM-2     200000         10611 ns/op     385.99 MB/s
Benchmark4kEncGoGCM-2          30000         44999 ns/op      91.02 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 4.429s

$ cat /proc/cpuinfo | grep -e "model name\|flags" | head -2
model name  : Intel Xeon E312xx (Sandy Bridge)
flags       : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm xsaveopt

$ go version
go version go1.5.1 linux/amd64

@rfjakob
Copy link
Owner

rfjakob commented May 10, 2016

Ok here we go, Go seems to use the AES instructions from v1.6. This is on the same box as above.

 ~/go/bin/go test -bench .
PASS
Benchmark4kEncStupidGCM-2     100000         16528 ns/op     247.81 MB/s
Benchmark4kEncGoGCM-2         300000          5014 ns/op     816.86 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 3.603s

$ ~/go/bin/go version
go version go1.6.2 linux/amd64

@alphazo
Copy link

alphazo commented May 10, 2016

Hi guys, if you are interested I ran some benchmarks on my desktop machine and a fresh SSD comparing plain, gocryptfs (openssl on/off), encfs, securefs, truecrypt & dm-crypt.
Keep in mind that Truecrypt & dm-crypt do play in a different league since they are not file based encryption tools.
https://gist.github.com/alphazo/09a2e523e22e7aa00d491ab67678dd80

@lxp
Copy link
Contributor Author

lxp commented May 10, 2016

@rfjakob Thank you, I didn't expect a that simple solution :)
I compiled a version with Go 1.6 and used the same binary on all machines.
I think the benchmarks draw a pretty clear picture.
AES-NI + Go 1.6+ -> Go Crypto
Otherwise -> OpenSSL

$ go version
go version go1.6 linux/amd64

AES-NI

Skylake (Launch: Q3'15)

$ cat /proc/cpuinfo
model name  : Intel(R) Core(TM) i3-6100U CPU @ 2.30GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     200000         10688 ns/op     383.22 MB/s
Benchmark4kEncGoGCM-4         300000          4073 ns/op    1005.57 MB/s

Haswell (Launch: Q2'14)

$ cat /proc/cpuinfo
model name  : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     200000          6710 ns/op     610.43 MB/s
Benchmark4kEncGoGCM-4         500000          2422 ns/op    1690.86 MB/s

Ivy Bridge (Launch: Q2'12)

$ cat /proc/cpuinfo 
model name  : Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     200000         14684 ns/op     278.94 MB/s
Benchmark4kEncGoGCM-4         300000          7792 ns/op     525.62 MB/s

Sandy Bridge (Launch: Q1'11)

$ cat /proc/cpuinfo 
model name  : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     100000         19070 ns/op     214.78 MB/s
Benchmark4kEncGoGCM-4         200000         10981 ns/op     373.01 MB/s

Westmere (Launch: Q1'10)

$ cat /proc/cpuinfo 
model name  : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-16        100000             18297 ns/op         223.85 MB/s
Benchmark4kEncGoGCM-16            200000              9579 ns/op         427.58 MB/s

no AES-NI

Ivy Bridge (Launch: Q1'13)

$ cat /proc/cpuinfo 
model name  : Intel(R) Pentium(R) CPU G2130 @ 3.20GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-2     100000         22691 ns/op     180.51 MB/s
Benchmark4kEncGoGCM-2          20000         92810 ns/op      44.13 MB/s

Nehalem (Launch: Q3'09)

$ cat /proc/cpuinfo 
model name  : Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-8      50000         35247 ns/op     116.21 MB/s
Benchmark4kEncGoGCM-8          20000         92230 ns/op      44.41 MB/s

Core (Launch: Q1'08)

$ cat /proc/cpuinfo 
model name  : Intel(R) Core(TM)2 Duo CPU     E7400  @ 2.80GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-2      30000         46697 ns/op      87.71 MB/s
Benchmark4kEncGoGCM-2          10000        194095 ns/op      21.10 MB/s

Maybe, I will add two older AMD processors (without AES-NI), when I have time.

@rfjakob
Copy link
Owner

rfjakob commented May 10, 2016 via email

@alphazo
Copy link

alphazo commented May 10, 2016

@rfjakob While most of gocryptfs operations outperformed encfs (even in standard mode) in the quick benchmark I posted earlier, why is the rm operation a bit behind ?

@rfjakob
Copy link
Owner

rfjakob commented May 10, 2016

Hi @alphazo, I read your comparison with great interest, thank you! Yes, we are 15% behind EncFS for rm, hmm. To be honest, I'm not sure why. I'll have to profile this!

rfjakob added a commit that referenced this issue May 11, 2016
Go GCM is faster than OpenSSL if the CPU has AES instructions
and you are running Go 1.6+.

See #23 for details.
rfjakob added a commit that referenced this issue May 11, 2016
Go GCM is faster than OpenSSL if the CPU has AES instructions
and you are running Go 1.6+.

See #23 for details.
rfjakob added a commit that referenced this issue May 11, 2016
Go GCM is faster than OpenSSL if the CPU has AES instructions
and you are running Go 1.6+.

Run "gocryptfs -debug -version" to display the result of the
autodetection.

See #23 for details and
benchmarks.
rfjakob added a commit that referenced this issue May 11, 2016
Go GCM is faster than OpenSSL if the CPU has AES instructions
and you are running Go 1.6+.

The "-openssl" option now defaults to "auto".

"gocryptfs -debug -version" displays the result of the autodetection.

See #23 for details and
benchmarks.
rfjakob added a commit that referenced this issue May 11, 2016
Go GCM is faster than OpenSSL if the CPU has AES instructions
and you are running Go 1.6+.

The "-openssl" option now defaults to "auto".

"gocryptfs -debug -version" displays the result of the autodetection.

See #23 for details and
benchmarks.
@rfjakob
Copy link
Owner

rfjakob commented May 11, 2016

Autodetection has been added to master in 49b597f , the -openssl option now defaults to "auto". It can be overridden by passing true or false.

You can run "gocryptfs -debug -version" to see the result of the autodetection, I get

$ ./gocryptfs -debug -version
openssl=true
gocryptfs v0.10-rc2-7-g49b597f-dirty; on-disk format 2; go-fuse a01ba14

because my CPU does not support AES-NI.

@lxp
Copy link
Contributor Author

lxp commented May 12, 2016

Great! Thank you, for integrating it so fast 👍
I added a Skylake CPU to my above benchmark post.
It looks good, on 4 AES-NI CPUs I get (not sure when I will be able to test it on Skylake):

$ ./gocryptfs -debug -version
openssl=false
gocryptfs v0.10-rc1-16-g4ad9d4e; on-disk format 2; go-fuse ed84134

While on the 3 non AES-NI CPUs I get:

$ ./gocryptfs -debug -version
openssl=true
gocryptfs v0.10-rc1-16-g4ad9d4e; on-disk format 2; go-fuse ed84134

I compiled again with Go 1.6 and all systems are running on amd64.

@rfjakob
Copy link
Owner

rfjakob commented May 12, 2016

Great! Do you want to put the benchmarks into the wiki? Something like https://github.com/rfjakob/gocryptfs/wiki/CPU-Benchmarks ? I think it's valuable information and deserves some visibility.

Same thing for you, @alphazo ! Maybe https://github.com/rfjakob/gocryptfs/wiki/Performance-Comparison ?

@rfjakob
Copy link
Owner

rfjakob commented May 13, 2016

Released as v0.10-rc3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants