Improve memory usage and overall performance #8

anitgandhi · 2017-08-04T21:56:26Z

What's in this PR?

This builds on the changes in #3 by applying the memory usage / allocation / performance optimizations.

Changes to FF1:

Use a common backing array buffer for holding the variable sized Q, PQ, Y, R, xored buffers. P remains a single blockSize array on the stack. Example for reference on long inputs:
numA, numB, and numC are swapped via assignment and SetBytes rather than repetitive SetString and Text calls inside the Feistel round for loop
Pre-calculate the moduli based on m=u (i is even) and m=v (i is odd) since there are only 2 that are used in an alternating fashion.
Added some test cases/benchmarks for long inputs
General cleanup, some manual function inlining

On average, FF1 performance improves quite a bit from the previous master HEAD (3d2762d )

benchmark                      old ns/op     new ns/op     delta
BenchmarkEncrypt/Sample7-8     20627         5172          -74.93%

benchmark                      old allocs     new allocs     delta
BenchmarkEncrypt/Sample7-8     299            43             -85.62%

benchmark                      old bytes     new bytes     delta
BenchmarkEncrypt/Sample7-8     11248         1424          -87.34%

Changes to FF3:

Utility function cleanup
1. Manually inlined some function logic and removed the redundant functions.
2. rev now uses revB which string/[]byte conversions.
3. remove cbcEncryptor since input blocks are always 16 bytes with iv=0, so just use normal AES (ECB)
4. ciph has been replaced with direct calls to aes encrypt
General cleanup, some similar changes from FF1

MINOR BREAKING CHANGE TO BOTH:

ff1 and ff3 NewCipher now return a Cipher rather than *Cipher and methods are now on value receivers instead of pointer receivers for consistency. Because of Go's transparent indirection and since the struct fields are not exported, this shouldn't cause issues for most practical uses, as the only way to interact with the cipher is using Encrypt and Decrypt, which remain unchanged.

This was done because in situations where many NewCipher calls must be made (ex. tweak is different for each piece of data), then using value semantics causes less allocations.

TODOs

Apply same to FF1 Decrypt
Apply same to FF3 Encrypt and Decrypt

anitgandhi · 2017-08-04T21:57:45Z

Preliminary benchmark improvements


benchmark                      old ns/op     new ns/op     delta
BenchmarkEncrypt/Sample1-8     19549         10515         -46.21%
BenchmarkEncrypt/Sample2-8     18519         10811         -41.62%
BenchmarkEncrypt/Sample3-8     21497         13406         -37.64%
BenchmarkEncrypt/Sample4-8     19548         10321         -47.20%
BenchmarkEncrypt/Sample5-8     18363         10577         -42.40%
BenchmarkEncrypt/Sample6-8     21806         13281         -39.09%
BenchmarkEncrypt/Sample7-8     19591         10600         -45.89%
BenchmarkEncrypt/Sample8-8     18453         10507         -43.06%
BenchmarkEncrypt/Sample9-8     21819         13529         -37.99%

benchmark                      old allocs     new allocs     delta
BenchmarkEncrypt/Sample1-8     301            100            -66.78%
BenchmarkEncrypt/Sample2-8     296            103            -65.20%
BenchmarkEncrypt/Sample3-8     289            101            -65.05%
BenchmarkEncrypt/Sample4-8     303            100            -67.00%
BenchmarkEncrypt/Sample5-8     296            103            -65.20%
BenchmarkEncrypt/Sample6-8     289            101            -65.05%
BenchmarkEncrypt/Sample7-8     299            100            -66.56%
BenchmarkEncrypt/Sample8-8     292            100            -65.75%
BenchmarkEncrypt/Sample9-8     289            101            -65.05%

benchmark                      old bytes     new bytes     delta
BenchmarkEncrypt/Sample1-8     11264         3040          -73.01%
BenchmarkEncrypt/Sample2-8     10976         3040          -72.30%
BenchmarkEncrypt/Sample3-8     11584         3408          -70.58%
BenchmarkEncrypt/Sample4-8     11280         3040          -73.05%
BenchmarkEncrypt/Sample5-8     10976         3040          -72.30%
BenchmarkEncrypt/Sample6-8     11584         3408          -70.58%
BenchmarkEncrypt/Sample7-8     11248         3040          -72.97%
BenchmarkEncrypt/Sample8-8     10960         3040          -72.26%
BenchmarkEncrypt/Sample9-8     11584         3408          -70.58%

cstockton

Good work overall just wanted to illustrate an additional optimization that would save a lot of allocs that didn't apply to the prior pull request.

cstockton · 2017-08-06T01:09:53Z

ff1/ff1.go

+	// temp must be 16 bytes incuding j
+	// This will only be needed if maxJ > 1, for the inner for loop
+	if maxJ > 1 {
+		temp = make([]byte, 16)


Line 153, 173, 174 and this could all share sections of a single N byte buf, this will save on the allocations not only at entry but for loop iterations in the general case since chances are you can derive the length of some of those as a best-guess from the input. To illustrate:

const ( // For every buffer we have here we know we need at least aes.BlockSize bytes, excluding the // primary buffer since that will be given a larger input relative backing. minSliceBackings = 6 minPerSliceBufSize = aes.BlockSize*numberOfSliceBackings // Average additional scratch space needed relative to input, no idea what the figure is :) avgInputRelativeSize = aes.BlockSize * 10 // no idea :) BlockSize * iteration count looks close? ) // These are re-used in the for loop below var ( // For every buffer we have here we know we need at least aes.BlockSize bytes. relSize = (avgInputRelativeSize * len(X)) bufSize = relSize + minPerSliceBufSize buf = make([]byte, 0, bufSize) // Q's length is a multiple of 16 // Start it with a length of 16 // Q Seems to be the main buffer here at a quick glance? If so we can spill our relSize into that Q = buf[0:16:relSize] QOrig = buf[relSize:0:16] // R is gauranteed to be 16 bytes since it holds output of PRF R = buf[relSize+16:16:16] // TODO: rename this temp buf[relSize+32:0:16] numB big.Int numBBytes buf[relSize+48:0:16] br, bm, mod big.Int y, c big.Int Y buf[relSize+64:0:16] )

TheObnoxiousLongVarNames are for illustration of course, you could put a clear derivative in a const to help document the size of the backings though. But setting the capacity of these backings allows us to just append as we were before with the benefit of saving our minimum allocation count. You can play with the sections you give each slice relative to input length until you can get away with common best case of a single make.

Oh and I've omitted the math to calculate offsets for the capacity and length portions. You only have to write it once, but could could circumvent it with a counter and short variable declarations.

This is a good suggestion, thanks! I was initially a little apprehensive because I was worried about readability but the variables remain as is, it's just the underlying backing array that would change.

I think another thing I should be able to do is change A, B, and C to be byte slices, and only convert them to strings when needed, like when passing into SetString or at the final return

for large inputs where d > 16, maxJ becomes >1, which causes the code to enter the inner for loop. previously there was an issue where it an input could be encrypted then the decrypted form didn’t match the original plaintext. this commit fixes that. note that correctness of the cipher text in these scenarios is still unknown since no test vector covers it

this reduces only a few allocations for short inputs, but improves allocations on long inputs quite a bit

benchmark old ns/op new ns/op delta BenchmarkEncrypt/Sample7-8 10105 8913 -11.80% BenchmarkEncryptLong-8 42491 40936 -3.66% benchmark old allocs new allocs delta BenchmarkEncrypt/Sample7-8 90 84 -6.67% BenchmarkEncryptLong-8 109 99 -9.17% benchmark old bytes new bytes delta BenchmarkEncrypt/Sample7-8 2752 2464 -10.47% BenchmarkEncryptLong-8 6979 6306 -9.64%

benchmark old ns/op new ns/op delta BenchmarkEncrypt/Sample7-8 9309 5123 -44.97% BenchmarkEncryptLong-8 42819 18058 -57.83% benchmark old allocs new allocs delta BenchmarkEncrypt/Sample7-8 84 43 -48.81% BenchmarkEncryptLong-8 99 58 -41.41% benchmark old bytes new bytes delta BenchmarkEncrypt/Sample7-8 2464 1424 -42.21% BenchmarkEncryptLong-8 6306 3905 -38.07%

rev now just wraps revB | 20% improvement in speed xorBytes is now inlined depending on use numRadix is replaced with just SetString directly

there's no situation in which the input to the aes block will not be 16 bytes so there's no need for cbc

anitgandhi added 2 commits August 3, 2017 22:51

ff1 encrypt - reduce allocations

cdeffdb

ff1 encrypt - reduce more allocations

46bff3e

ff1 encrypt - use PQ buffer

5257f2c

cstockton reviewed Aug 6, 2017

View reviewed changes

anitgandhi added 17 commits August 7, 2017 09:32

Merge branch 'master' into memory

3e55041

ff1 decrypt - reduce memory allocations

6fe8fd7

remove leftover word

1a64555

ff1 encrypt - use common buffer

94c2d0c

this reduces only a few allocations for short inputs, but improves allocations on long inputs quite a bit

ff1 decrypt - mirror changes from previous commit

8dfa89d

ff1 - use value semantics on Cipher

6b21365

ff1 decrypt - mirror changes from previous commits

c46197e

ff1 - cleanup variable names

181f22e

ff3 - mirror some changes from ff1 commits; ff1 - consistency

9109bc0

ff3 - utility function optimizations

d6176e0

rev now just wraps revB | 20% improvement in speed xorBytes is now inlined depending on use numRadix is replaced with just SetString directly

ff3 - remove cbc, use aes block directly

7f41712

there's no situation in which the input to the aes block will not be 16 bytes so there's no need for cbc

ff1 - rename var bX to numX for consistency

8eba60a

ff1 - add end-to-end benchmark

740404c

remove build log (accidentally committed)

2ad7dda

anitgandhi changed the title ~~[WIP] Improve memory usage and overall performance~~ Improve memory usage and overall performance Aug 21, 2017

anitgandhi added the enhancement label Aug 21, 2017

anitgandhi added this to the 1.1 milestone Aug 21, 2017

anitgandhi merged commit e98a868 into master Aug 21, 2017

anitgandhi mentioned this pull request Aug 21, 2017

Performance Tweaks #5

Closed

anitgandhi deleted the memory branch May 17, 2018 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory usage and overall performance #8

Improve memory usage and overall performance #8

anitgandhi commented Aug 4, 2017 •

edited

Loading

anitgandhi commented Aug 4, 2017

cstockton left a comment

cstockton Aug 6, 2017 •

edited

Loading

anitgandhi Aug 10, 2017

anitgandhi Aug 10, 2017

Improve memory usage and overall performance #8

Improve memory usage and overall performance #8

Conversation

anitgandhi commented Aug 4, 2017 • edited Loading

What's in this PR?

Changes to FF1:

Changes to FF3:

MINOR BREAKING CHANGE TO BOTH:

TODOs

anitgandhi commented Aug 4, 2017

cstockton left a comment

Choose a reason for hiding this comment

cstockton Aug 6, 2017 • edited Loading

Choose a reason for hiding this comment

anitgandhi Aug 10, 2017

Choose a reason for hiding this comment

anitgandhi Aug 10, 2017

Choose a reason for hiding this comment

anitgandhi commented Aug 4, 2017 •

edited

Loading

cstockton Aug 6, 2017 •

edited

Loading