Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Improve memory usage and overall performance #8

Merged
merged 20 commits into from
Aug 21, 2017
Merged

Improve memory usage and overall performance #8

merged 20 commits into from
Aug 21, 2017

Conversation

anitgandhi
Copy link
Contributor

@anitgandhi anitgandhi commented Aug 4, 2017

What's in this PR?

This builds on the changes in #3 by applying the memory usage / allocation / performance optimizations.

Changes to FF1:

  1. Use a common backing array buffer for holding the variable sized Q, PQ, Y, R, xored buffers. P remains a single blockSize array on the stack. Example for reference on long inputs: image
  2. numA, numB, and numC are swapped via assignment and SetBytes rather than repetitive SetString and Text calls inside the Feistel round for loop
  3. Pre-calculate the moduli based on m=u (i is even) and m=v (i is odd) since there are only 2 that are used in an alternating fashion.
  4. Added some test cases/benchmarks for long inputs
  5. General cleanup, some manual function inlining

On average, FF1 performance improves quite a bit from the previous master HEAD (3d2762d )

benchmark                      old ns/op     new ns/op     delta
BenchmarkEncrypt/Sample7-8     20627         5172          -74.93%

benchmark                      old allocs     new allocs     delta
BenchmarkEncrypt/Sample7-8     299            43             -85.62%

benchmark                      old bytes     new bytes     delta
BenchmarkEncrypt/Sample7-8     11248         1424          -87.34%

Changes to FF3:

  1. Utility function cleanup
    1. Manually inlined some function logic and removed the redundant functions.
    2. rev now uses revB which string/[]byte conversions.
    3. remove cbcEncryptor since input blocks are always 16 bytes with iv=0, so just use normal AES (ECB)
    4. ciph has been replaced with direct calls to aes encrypt
  2. General cleanup, some similar changes from FF1

MINOR BREAKING CHANGE TO BOTH:

ff1 and ff3 NewCipher now return a Cipher rather than *Cipher and methods are now on value receivers instead of pointer receivers for consistency. Because of Go's transparent indirection and since the struct fields are not exported, this shouldn't cause issues for most practical uses, as the only way to interact with the cipher is using Encrypt and Decrypt, which remain unchanged.

This was done because in situations where many NewCipher calls must be made (ex. tweak is different for each piece of data), then using value semantics causes less allocations.

TODOs

  • Apply same to FF1 Decrypt
  • Apply same to FF3 Encrypt and Decrypt

@anitgandhi
Copy link
Contributor Author

Preliminary benchmark improvements


benchmark                      old ns/op     new ns/op     delta
BenchmarkEncrypt/Sample1-8     19549         10515         -46.21%
BenchmarkEncrypt/Sample2-8     18519         10811         -41.62%
BenchmarkEncrypt/Sample3-8     21497         13406         -37.64%
BenchmarkEncrypt/Sample4-8     19548         10321         -47.20%
BenchmarkEncrypt/Sample5-8     18363         10577         -42.40%
BenchmarkEncrypt/Sample6-8     21806         13281         -39.09%
BenchmarkEncrypt/Sample7-8     19591         10600         -45.89%
BenchmarkEncrypt/Sample8-8     18453         10507         -43.06%
BenchmarkEncrypt/Sample9-8     21819         13529         -37.99%

benchmark                      old allocs     new allocs     delta
BenchmarkEncrypt/Sample1-8     301            100            -66.78%
BenchmarkEncrypt/Sample2-8     296            103            -65.20%
BenchmarkEncrypt/Sample3-8     289            101            -65.05%
BenchmarkEncrypt/Sample4-8     303            100            -67.00%
BenchmarkEncrypt/Sample5-8     296            103            -65.20%
BenchmarkEncrypt/Sample6-8     289            101            -65.05%
BenchmarkEncrypt/Sample7-8     299            100            -66.56%
BenchmarkEncrypt/Sample8-8     292            100            -65.75%
BenchmarkEncrypt/Sample9-8     289            101            -65.05%

benchmark                      old bytes     new bytes     delta
BenchmarkEncrypt/Sample1-8     11264         3040          -73.01%
BenchmarkEncrypt/Sample2-8     10976         3040          -72.30%
BenchmarkEncrypt/Sample3-8     11584         3408          -70.58%
BenchmarkEncrypt/Sample4-8     11280         3040          -73.05%
BenchmarkEncrypt/Sample5-8     10976         3040          -72.30%
BenchmarkEncrypt/Sample6-8     11584         3408          -70.58%
BenchmarkEncrypt/Sample7-8     11248         3040          -72.97%
BenchmarkEncrypt/Sample8-8     10960         3040          -72.26%
BenchmarkEncrypt/Sample9-8     11584         3408          -70.58%

Copy link

@cstockton cstockton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work overall just wanted to illustrate an additional optimization that would save a lot of allocs that didn't apply to the prior pull request.

ff1/ff1.go Outdated
// temp must be 16 bytes incuding j
// This will only be needed if maxJ > 1, for the inner for loop
if maxJ > 1 {
temp = make([]byte, 16)
Copy link

@cstockton cstockton Aug 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 153, 173, 174 and this could all share sections of a single N byte buf, this will save on the allocations not only at entry but for loop iterations in the general case since chances are you can derive the length of some of those as a best-guess from the input. To illustrate:

	const (

		// For every buffer we have here we know we need at least aes.BlockSize bytes, excluding the
		// primary buffer since that will be given a larger input relative backing.
		minSliceBackings = 6
		minPerSliceBufSize = aes.BlockSize*numberOfSliceBackings
                // Average additional scratch space needed relative to input, no idea what the figure is :)
		avgInputRelativeSize = aes.BlockSize * 10 // no idea :) BlockSize * iteration count looks close?
	)

	// These are re-used in the for loop below
	var (

		// For every buffer we have here we know we need at least aes.BlockSize bytes.
                relSize = (avgInputRelativeSize * len(X))
		bufSize = relSize + minPerSliceBufSize
		buf = make([]byte, 0, bufSize)

		// Q's length is a multiple of 16
		// Start it with a length of 16
		// Q Seems to be the main buffer here at a quick glance? If so we can spill our relSize into that
		Q     = buf[0:16:relSize]
		QOrig = buf[relSize:0:16]

		// R is gauranteed to be 16 bytes since it holds output of PRF
		R = buf[relSize+16:16:16]

		// TODO: rename this
		temp buf[relSize+32:0:16]

		numB      big.Int
		numBBytes buf[relSize+48:0:16]

		br, bm, mod big.Int
		y, c        big.Int

		Y buf[relSize+64:0:16]
	)

TheObnoxiousLongVarNames are for illustration of course, you could put a clear derivative in a const to help document the size of the backings though. But setting the capacity of these backings allows us to just append as we were before with the benefit of saving our minimum allocation count. You can play with the sections you give each slice relative to input length until you can get away with common best case of a single make.

Oh and I've omitted the math to calculate offsets for the capacity and length portions. You only have to write it once, but could could circumvent it with a counter and short variable declarations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion, thanks! I was initially a little apprehensive because I was worried about readability but the variables remain as is, it's just the underlying backing array that would change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another thing I should be able to do is change A, B, and C to be byte slices, and only convert them to strings when needed, like when passing into SetString or at the final return

for large inputs where d > 16, maxJ becomes >1, which causes the code
to enter the inner for loop. previously there was an issue where it an
input could be encrypted then the decrypted form didn’t match the
original plaintext. this commit fixes that.

note that correctness of the cipher text in these scenarios is still
unknown since no test vector covers it
this reduces only a few allocations for short inputs, but improves
allocations on long inputs quite a bit
benchmark                      old ns/op     new ns/op     delta
BenchmarkEncrypt/Sample7-8     10105         8913          -11.80%
BenchmarkEncryptLong-8         42491         40936         -3.66%

benchmark                      old allocs     new allocs     delta
BenchmarkEncrypt/Sample7-8     90             84             -6.67%
BenchmarkEncryptLong-8         109            99             -9.17%

benchmark                      old bytes     new bytes     delta
BenchmarkEncrypt/Sample7-8     2752          2464          -10.47%
BenchmarkEncryptLong-8         6979          6306          -9.64%
benchmark                      old ns/op     new ns/op     delta
BenchmarkEncrypt/Sample7-8     9309          5123          -44.97%
BenchmarkEncryptLong-8         42819         18058         -57.83%

benchmark                      old allocs     new allocs     delta
BenchmarkEncrypt/Sample7-8     84             43             -48.81%
BenchmarkEncryptLong-8         99             58             -41.41%

benchmark                      old bytes     new bytes     delta
BenchmarkEncrypt/Sample7-8     2464          1424          -42.21%
BenchmarkEncryptLong-8         6306          3905          -38.07%
rev now just wraps revB | 20% improvement in speed

xorBytes is now inlined depending on use

numRadix is replaced with just SetString directly
there's no situation in which the input to the aes block will not be 16 bytes so there's no need for cbc
@anitgandhi anitgandhi changed the title [WIP] Improve memory usage and overall performance Improve memory usage and overall performance Aug 21, 2017
@anitgandhi anitgandhi added this to the 1.1 milestone Aug 21, 2017
@anitgandhi anitgandhi merged commit e98a868 into master Aug 21, 2017
@anitgandhi anitgandhi mentioned this pull request Aug 21, 2017
@anitgandhi anitgandhi deleted the memory branch May 17, 2018 21:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants