-
Notifications
You must be signed in to change notification settings - Fork 41
Improve memory usage and overall performance #8
Conversation
Preliminary benchmark improvements
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work overall just wanted to illustrate an additional optimization that would save a lot of allocs that didn't apply to the prior pull request.
ff1/ff1.go
Outdated
// temp must be 16 bytes incuding j | ||
// This will only be needed if maxJ > 1, for the inner for loop | ||
if maxJ > 1 { | ||
temp = make([]byte, 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 153, 173, 174 and this could all share sections of a single N byte buf, this will save on the allocations not only at entry but for loop iterations in the general case since chances are you can derive the length of some of those as a best-guess from the input. To illustrate:
const (
// For every buffer we have here we know we need at least aes.BlockSize bytes, excluding the
// primary buffer since that will be given a larger input relative backing.
minSliceBackings = 6
minPerSliceBufSize = aes.BlockSize*numberOfSliceBackings
// Average additional scratch space needed relative to input, no idea what the figure is :)
avgInputRelativeSize = aes.BlockSize * 10 // no idea :) BlockSize * iteration count looks close?
)
// These are re-used in the for loop below
var (
// For every buffer we have here we know we need at least aes.BlockSize bytes.
relSize = (avgInputRelativeSize * len(X))
bufSize = relSize + minPerSliceBufSize
buf = make([]byte, 0, bufSize)
// Q's length is a multiple of 16
// Start it with a length of 16
// Q Seems to be the main buffer here at a quick glance? If so we can spill our relSize into that
Q = buf[0:16:relSize]
QOrig = buf[relSize:0:16]
// R is gauranteed to be 16 bytes since it holds output of PRF
R = buf[relSize+16:16:16]
// TODO: rename this
temp buf[relSize+32:0:16]
numB big.Int
numBBytes buf[relSize+48:0:16]
br, bm, mod big.Int
y, c big.Int
Y buf[relSize+64:0:16]
)
TheObnoxiousLongVarNames are for illustration of course, you could put a clear derivative in a const to help document the size of the backings though. But setting the capacity of these backings allows us to just append as we were before with the benefit of saving our minimum allocation count. You can play with the sections you give each slice relative to input length until you can get away with common best case of a single make.
Oh and I've omitted the math to calculate offsets for the capacity and length portions. You only have to write it once, but could could circumvent it with a counter and short variable declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion, thanks! I was initially a little apprehensive because I was worried about readability but the variables remain as is, it's just the underlying backing array that would change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another thing I should be able to do is change A
, B
, and C
to be byte slices, and only convert them to strings when needed, like when passing into SetString
or at the final return
for large inputs where d > 16, maxJ becomes >1, which causes the code to enter the inner for loop. previously there was an issue where it an input could be encrypted then the decrypted form didn’t match the original plaintext. this commit fixes that. note that correctness of the cipher text in these scenarios is still unknown since no test vector covers it
this reduces only a few allocations for short inputs, but improves allocations on long inputs quite a bit
benchmark old ns/op new ns/op delta BenchmarkEncrypt/Sample7-8 10105 8913 -11.80% BenchmarkEncryptLong-8 42491 40936 -3.66% benchmark old allocs new allocs delta BenchmarkEncrypt/Sample7-8 90 84 -6.67% BenchmarkEncryptLong-8 109 99 -9.17% benchmark old bytes new bytes delta BenchmarkEncrypt/Sample7-8 2752 2464 -10.47% BenchmarkEncryptLong-8 6979 6306 -9.64%
benchmark old ns/op new ns/op delta BenchmarkEncrypt/Sample7-8 9309 5123 -44.97% BenchmarkEncryptLong-8 42819 18058 -57.83% benchmark old allocs new allocs delta BenchmarkEncrypt/Sample7-8 84 43 -48.81% BenchmarkEncryptLong-8 99 58 -41.41% benchmark old bytes new bytes delta BenchmarkEncrypt/Sample7-8 2464 1424 -42.21% BenchmarkEncryptLong-8 6306 3905 -38.07%
rev now just wraps revB | 20% improvement in speed xorBytes is now inlined depending on use numRadix is replaced with just SetString directly
there's no situation in which the input to the aes block will not be 16 bytes so there's no need for cbc
What's in this PR?
This builds on the changes in #3 by applying the memory usage / allocation / performance optimizations.
Changes to FF1:
Q
,PQ
,Y
,R
,xored
buffers.P
remains a singleblockSize
array on the stack. Example for reference on long inputs:numA
,numB
, andnumC
are swapped via assignment andSetBytes
rather than repetitiveSetString
andText
calls inside the Feistel round for loopm=u
(i is even) andm=v
(i is odd) since there are only 2 that are used in an alternating fashion.On average, FF1 performance improves quite a bit from the previous
master
HEAD (3d2762d )Changes to FF3:
rev
now usesrevB
whichstring
/[]byte
conversions.cbcEncryptor
since input blocks are always 16 bytes with iv=0, so just use normal AES (ECB)ciph
has been replaced with direct calls to aes encryptMINOR BREAKING CHANGE TO BOTH:
ff1
andff3
NewCipher now return aCipher
rather than*Cipher
and methods are now on value receivers instead of pointer receivers for consistency. Because of Go's transparent indirection and since the struct fields are not exported, this shouldn't cause issues for most practical uses, as the only way to interact with the cipher is usingEncrypt
andDecrypt
, which remain unchanged.This was done because in situations where many
NewCipher
calls must be made (ex. tweak is different for each piece of data), then using value semantics causes less allocations.TODOs