Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize XOR masking code #31586

Closed
nhooyr opened this issue Apr 20, 2019 · 30 comments
Closed

cmd/compile: optimize XOR masking code #31586

nhooyr opened this issue Apr 20, 2019 · 30 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@nhooyr
Copy link
Contributor

nhooyr commented Apr 20, 2019

In the WebSocket protocol, clients have to generate a random masking key for every frame sent to the server and XOR mask the frame payload with the key. The server will unmask it on retrieval.

The mask/unmask function looks like this https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask_safe.go#L9-L15

This is masking byte by byte but can be made significantly faster if the masking was done word by word. Unfortunately, the Go compiler does not do this automatically so you have to use unsafe and write this yourself: https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54

Benchmarks at gorilla/websocket#31

It would be nice if the Go compiler did this automatically.

@ALTree
Copy link
Member

ALTree commented Apr 20, 2019

The two links you posted are the same (copy-paste mistake?). The 2nd one in particular does not use unsafe.

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 20, 2019

@ALTree ALTree added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 20, 2019
@ALTree ALTree added this to the Unplanned milestone Apr 20, 2019
@odeke-em odeke-em changed the title cmd/compiler: optimize XOR masking code cmd/compile: optimize XOR masking code Apr 21, 2019
@odeke-em
Copy link
Member

Thank you for filing this issue @nhooyr and @ALTree for the triaging!

Kindly paging @josharian @randall77 @martisch.

@crvv
Copy link
Contributor

crvv commented Apr 22, 2019

Please see #28113 #30553 and #28465

And

func fastXORBytes(dst, a, b []byte, n int) {

If you want fast xor, there are 4 different implementations in crypto/cipher.
I don't know whether the compiler can do so much optimization.

@josharian
Copy link
Contributor

The compiler already recognizes sequences of byte loads and coalesces them into a single multi-byte loads. See e.g. the amd64 rules starting at

// Combining byte loads into larger (unaligned) loads.
. We could in theory write similar rules that convert a sequence of byte-wise load/xor/store operations to an XORQmodify.

In fact, I believe there is a way of writing this that would allow the compiler to do this optimization now. This code:

import "encoding/binary"

func f(key [4]byte, b []byte) {
	k := binary.LittleEndian.Uint32(key[:])
	v := binary.LittleEndian.Uint32(b)
	binary.LittleEndian.PutUint32(b, v^k)
}

Generates (at its core, surrounded by other stuff):

MOVQ	"".b+40(SP), AX
XORL	(AX), DX
MOVL	DX, (AX)

That seems pretty good.

What do you think, @nhooyr ?

@josharian
Copy link
Contributor

(And if this is seriously performance-critical, you might want to manually inline the LittleEndian rountines and also duplicate key into a uint64 so that you can do 8 bytes at a time.)

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 22, 2019

That seems very reasonable @josharian

Will give this is a shot and let you know how the performance compares.

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 24, 2019

So benchmarked and here are the results on my machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):

$ go test -bench=BenchmarkMaskBytes
goos: darwin
goarch: amd64
pkg: github.com/gorilla/websocket
BenchmarkMaskBytes/size-2/align-4/byte-8         	500000000	         3.74 ns/op	 535.10 MB/s
BenchmarkMaskBytes/size-2/align-4/gorilla-8      	300000000	         3.93 ns/op	 508.44 MB/s
BenchmarkMaskBytes/size-2/align-4/gobwas-8       	300000000	         4.67 ns/op	 428.21 MB/s
BenchmarkMaskBytes/size-2/align-4/nhooyr-8       	300000000	         4.73 ns/op	 422.43 MB/s
BenchmarkMaskBytes/size-2/align-4/crypto/cipher-8         	100000000	        16.9 ns/op	 118.23 MB/s
BenchmarkMaskBytes/size-4/align-4/byte-8                  	300000000	         5.58 ns/op	 716.61 MB/s
BenchmarkMaskBytes/size-4/align-4/gorilla-8               	300000000	         5.56 ns/op	 719.80 MB/s
BenchmarkMaskBytes/size-4/align-4/gobwas-8                	200000000	         7.27 ns/op	 550.31 MB/s
BenchmarkMaskBytes/size-4/align-4/nhooyr-8                	200000000	         7.04 ns/op	 567.92 MB/s
BenchmarkMaskBytes/size-4/align-4/crypto/cipher-8         	100000000	        19.6 ns/op	 204.31 MB/s
BenchmarkMaskBytes/size-16/align-4/byte-8                 	100000000	        13.7 ns/op	1170.31 MB/s
BenchmarkMaskBytes/size-16/align-4/gorilla-8              	100000000	        20.2 ns/op	 793.15 MB/s
BenchmarkMaskBytes/size-16/align-4/gobwas-8               	200000000	         6.59 ns/op	2426.43 MB/s
BenchmarkMaskBytes/size-16/align-4/nhooyr-8               	100000000	        13.6 ns/op	1174.09 MB/s
BenchmarkMaskBytes/size-16/align-4/crypto/cipher-8        	100000000	        16.2 ns/op	 985.94 MB/s
BenchmarkMaskBytes/size-32/align-4/byte-8                 	100000000	        23.9 ns/op	1336.39 MB/s
BenchmarkMaskBytes/size-32/align-4/gorilla-8              	100000000	        20.6 ns/op	1551.76 MB/s
BenchmarkMaskBytes/size-32/align-4/gobwas-8               	200000000	         8.13 ns/op	3934.47 MB/s
BenchmarkMaskBytes/size-32/align-4/nhooyr-8               	100000000	        15.6 ns/op	2054.30 MB/s
BenchmarkMaskBytes/size-32/align-4/crypto/cipher-8        	100000000	        24.7 ns/op	1298.15 MB/s
BenchmarkMaskBytes/size-512/align-4/byte-8                	 5000000	       345 ns/op	1480.68 MB/s
BenchmarkMaskBytes/size-512/align-4/gorilla-8             	30000000	        53.8 ns/op	9525.00 MB/s
BenchmarkMaskBytes/size-512/align-4/gobwas-8              	50000000	        39.9 ns/op	12840.94 MB/s
BenchmarkMaskBytes/size-512/align-4/nhooyr-8              	20000000	        83.6 ns/op	6126.90 MB/s
BenchmarkMaskBytes/size-512/align-4/crypto/cipher-8       	10000000	       241 ns/op	2122.82 MB/s
BenchmarkMaskBytes/size-4096/align-4/byte-8               	  500000	      2674 ns/op	1531.63 MB/s
BenchmarkMaskBytes/size-4096/align-4/gorilla-8            	 5000000	       285 ns/op	14322.91 MB/s
BenchmarkMaskBytes/size-4096/align-4/gobwas-8             	 5000000	       282 ns/op	14478.90 MB/s
BenchmarkMaskBytes/size-4096/align-4/nhooyr-8             	 3000000	       539 ns/op	7596.43 MB/s
BenchmarkMaskBytes/size-4096/align-4/crypto/cipher-8      	 1000000	      1818 ns/op	2252.51 MB/s
PASS
ok  	github.com/gorilla/websocket	58.645s

Source code at https://github.com/nhooyr/websocket/blob/40b4f235f2e2730ad1d8b81852e7610a8251d080/mask_test.go#L136-L174

Interestingly, the crypto/cipher assembly XOR was the slowest but I might be using it wrong.

Your solution is 2x slower than gorilla and gobwas but fast enough for me as I do not want to use unsafe.

Thanks @josharian.

I'll leave this issue open in case you think there is still improvement that can be made.

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 24, 2019

How I implemented what you described:

// xor applies the WebSocket masking algorithm to p
// with the given key where the first 3 bits of pos
// are the starting position in the key.
// See https://tools.ietf.org/html/rfc6455#section-5.3
//
// The returned value is the position of the next byte
// to be used for masking in the key. This is so that
// unmasking can be performed without the entire frame.
func nhooyr(key [4]byte, keyPos int, b []byte) int {
	// If the payload is greater than 16 bytes, then it's worth
	// masking 8 bytes at a time.
	// Optimization from https://github.com/golang/go/issues/31586#issuecomment-485530859
	if len(b) > 16 {
		// We first create a key that is 8 bytes long
		// and is aligned on the keyPos correctly.
		var alignedKey [8]byte
		for i := range alignedKey {
			alignedKey[i] = key[(i+keyPos)&3]
		}
		k := binary.LittleEndian.Uint64(alignedKey[:])

		// Then we xor until b is less than 8 bytes.
		for len(b) >= 8 {
			v := binary.LittleEndian.Uint64(b)
			binary.LittleEndian.PutUint64(b, v^k)
			b = b[8:]
		}
	}

	// xor remaining bytes.
	for i := range b {
		b[i] ^= key[keyPos&3]
		keyPos++
	}
	return keyPos & 3
}

@josharian
Copy link
Contributor

@nhooyr this is a case in which a little unrolling helps. Add something along these lines:

		for len(b) >= 32 {
			v := binary.LittleEndian.Uint64(b)
			binary.LittleEndian.PutUint64(b, v^k)
			v = binary.LittleEndian.Uint64(b[8:])
			binary.LittleEndian.PutUint64(b[8:], v^k)
			v = binary.LittleEndian.Uint64(b[16:])
			binary.LittleEndian.PutUint64(b[16:], v^k)
			v = binary.LittleEndian.Uint64(b[24:])
			binary.LittleEndian.PutUint64(b[24:], v^k)
			b = b[32:]
		}

I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot.

What's going on here is that reslicing (b = b[8:]) is expensive because the compiler needs to take care to avoid creating a pointer past the end of b. Unrolling dilutes that cost. Some of the techniques discussed in #27857 might help reduce that cost as well, but those don't appear likely to land before Go 1.14 at the earliest.

And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime.

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 29, 2019

Sweet, will use that.

Thank you for the tips.

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 29, 2019

@josharian out of curiosity but do you have any idea why the crypto/cipher xor was so slow even though its written in assembly?

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 29, 2019

Unrolling the loop to 128 makes it about as fast as gobwas/ws and gorilla/websocket for smaller sizes and faster for larger sizes 🎉

$ go test -bench=XOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8         	300000000	         4.07 ns/op	 491.49 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast-8          	300000000	         4.73 ns/op	 423.15 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/basic-8        	100000000	        12.6 ns/op	1273.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast-8         	100000000	        13.4 ns/op	1197.26 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/basic-8        	100000000	        23.1 ns/op	1388.07 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast-8         	100000000	        23.8 ns/op	1343.96 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/basic-8       	 5000000	       349 ns/op	1466.36 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast-8        	30000000	        44.4 ns/op	11531.75 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/basic-8      	  500000	      2582 ns/op	1585.85 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast-8       	 5000000	       274 ns/op	14920.34 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/basic-8     	  200000	     10239 ns/op	1600.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast-8      	 1000000	      1073 ns/op	15267.46 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	nhooyr.io/websocket	20.611s

@crvv
Copy link
Contributor

crvv commented Apr 29, 2019

crypto/cipher is fast for

largeBuffer := make([]byte, size)
xor(largeBuffer, ...)

But what you did is

largeBuffer := make([]byte, size)
for i := 0; i < 16; i += 16 {
    xor(largeBuffer[i:i+16], ...)
}

This is slow.

If you want that fast SIMD implementation, I think the assembly code needs to be changed for the WebSocket mask algorithm.

@nhooyr
Copy link
Contributor Author

nhooyr commented Apr 29, 2019

@crvv makes sense got it.

@josharian final code looks like https://github.com/nhooyr/websocket/blob/cfdbe6819dba18b345df68fa06ed1e499874cad8/xor.go#L28-L115

So just tiered loops going getting smaller in unrolled size, this doesn't seem to have any negative affect on performance at low or high byte sizes. However I noticed something interesting. if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.

$ benchcmp /tmp/if.txt /tmp/for.txt 
benchmark                      old ns/op     new ns/op     delta
BenchmarkXOR/2/basic-8         4.24          3.99          -5.90%
BenchmarkXOR/2/fast-8          4.49          4.89          +8.91%
BenchmarkXOR/16/basic-8        12.6          12.6          +0.00%
BenchmarkXOR/16/fast-8         12.6          12.6          +0.00%
BenchmarkXOR/32/basic-8        22.6          22.7          +0.44%
BenchmarkXOR/32/fast-8         13.4          13.2          -1.49%
BenchmarkXOR/512/basic-8       340           327           -3.82%
BenchmarkXOR/512/fast-8        62.9          44.2          -29.73%
BenchmarkXOR/4096/basic-8      2576          2588          +0.47%
BenchmarkXOR/4096/fast-8       409           280           -31.54%
BenchmarkXOR/16384/basic-8     10700         10291         -3.82%
BenchmarkXOR/16384/fast-8      1722          1089          -36.76%

benchmark                      old MB/s     new MB/s     speedup
BenchmarkXOR/2/basic-8         471.71       501.50       1.06x
BenchmarkXOR/2/fast-8          444.95       409.10       0.92x
BenchmarkXOR/16/basic-8        1274.44      1269.48      1.00x
BenchmarkXOR/16/fast-8         1265.40      1272.86      1.01x
BenchmarkXOR/32/basic-8        1417.10      1410.06      1.00x
BenchmarkXOR/32/fast-8         2388.94      2425.89      1.02x
BenchmarkXOR/512/basic-8       1502.64      1565.02      1.04x
BenchmarkXOR/512/fast-8        8134.34      11575.80     1.42x
BenchmarkXOR/4096/basic-8      1589.47      1582.41      1.00x
BenchmarkXOR/4096/fast-8       10009.68     14615.04     1.46x
BenchmarkXOR/16384/basic-8     1531.21      1591.97      1.04x
BenchmarkXOR/16384/fast-8      9513.69      15044.05     1.58x

@josharian
Copy link
Contributor

if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.

Interesting. Branch misprediction, perhaps? Loop head unrolling? It might be instructive to look at the generated code and at what perf counters tell you. If you learn more, please share, in case it is something we can teach the compiler to fix automatically.

as fast as gobwas/ws and gorilla/websocket

Hooray!

One important caveat: This is all tuned for amd64, which is little-endian and supports unaligned loads and stores. It'll work correctly on all architectures (because it is pure Go), but it might not be fast on them.

You could probably make it fast on a per-architecture basis (using build flags), if that matters to you. You might even be able to write a useful standalone package that lets you efficiently (per-architecture) take a byte slice and process in turn some leading bytes, a bunch of uint64s, and then some trailing bytes.

@renthraysk
Copy link

Just to note the aligning key loop can be reduced to a bits.RotateLeft64

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 6, 2019

@renthraysk To clarify, that would involve changing the key's type to uint64 from [4]byte right? i.e there is no way to just replace the loop with bits.RotateLeft64 directly.

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 6, 2019

$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8         	298493414	         3.95 ns/op	 506.25 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast-8          	273740404	         4.43 ns/op	 451.88 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast2-8         	256999929	         4.63 ns/op	 432.01 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/basic-8        	96471366	        12.5 ns/op	1280.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast-8         	96853706	        12.0 ns/op	1330.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast2-8        	222916975	         5.42 ns/op	2951.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/basic-8        	53923512	        21.8 ns/op	1469.46 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast-8         	95436625	        12.2 ns/op	2621.93 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast2-8        	189894780	         6.17 ns/op	5187.79 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/basic-8       	 3593703	       369 ns/op	1388.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast-8        	32837448	        36.0 ns/op	14223.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast2-8       	40974423	        27.6 ns/op	18531.62 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/basic-8      	  474385	      2628 ns/op	1558.64 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast-8       	 5921575	       212 ns/op	19308.44 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast2-8      	 6145724	       191 ns/op	21422.74 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/basic-8     	  118580	      9822 ns/op	1668.03 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast-8      	 1565948	       767 ns/op	21355.19 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast2-8     	 1560382	       760 ns/op	21569.24 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	nhooyr.io/websocket	26.730s

fast2 is with bits.RotateLeft64, pretty significant speedup, especially from 32 till 512, where the speed pretty much doubled.

🎊

@renthraysk
Copy link

Yeah, might have misremembered, and using binary.X.Uint32() to read 4 bytes in, perform the alignment rotation with RotateLeft32() and then expanded to 64 bits x<<32|x.

Also the last remain byte loop can use RotateLeft() instead of reading from the key array.

nhooyr added a commit to coder/websocket that referenced this issue Nov 6, 2019
See golang/go#31586 (comment)

Thanks @renthraysk

Now its faster than basic XOR at every byte size greater than 3 on
little endian amd64 machines.
@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 6, 2019

coder/websocket@646e967

https://github.com/nhooyr/websocket/blob/646e967341551f499e7fc5a2286d02c9c0c238b7/frame.go#L325-L443

$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8         	344487918	         3.43 ns/op	 582.99 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast-8          	312039567	         3.91 ns/op	 511.00 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/3/basic-8         	285057960	         4.31 ns/op	 696.14 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/3/fast-8          	295787498	         4.04 ns/op	 742.36 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4/basic-8         	244778098	         4.75 ns/op	 841.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4/fast-8          	361466797	         3.31 ns/op	1209.12 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/8/basic-8         	166173404	         7.09 ns/op	1129.10 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/8/fast-8          	262743075	         4.55 ns/op	1757.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/basic-8        	100000000	        11.8 ns/op	1361.28 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast-8         	260987524	         4.57 ns/op	3500.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/basic-8        	56211588	        21.1 ns/op	1515.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast-8         	216847578	         5.46 ns/op	5859.79 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/128/basic-8       	13744023	        83.5 ns/op	1532.62 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/128/fast-8        	128491503	         9.30 ns/op	13764.28 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/basic-8       	 3784251	       320 ns/op	1601.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast-8        	41965264	        26.9 ns/op	19009.42 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/basic-8      	  458898	      2460 ns/op	1665.01 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast-8       	 5955132	       198 ns/op	20699.90 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/basic-8     	  117724	      9961 ns/op	1644.87 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast-8      	 1532142	       760 ns/op	21545.42 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	nhooyr.io/websocket	31.076s

Faster than basic XOR at even 3 bytes :)

Thanks @renthraysk

nhooyr added a commit to coder/websocket that referenced this issue Nov 6, 2019
See golang/go#31586 (comment)

Thanks @renthraysk

Now its faster than basic XOR at every byte size greater than 2 on
little endian amd64 machines.
nhooyr added a commit to coder/websocket that referenced this issue Nov 6, 2019
See golang/go#31586 (comment)

Thanks @renthraysk

benchmark                      old MB/s     new MB/s     speedup
BenchmarkXOR/2/basic-8         504.71       569.72       1.13x
BenchmarkXOR/2/fast-8          470.88       492.61       1.05x
BenchmarkXOR/3/basic-8         570.57       731.94       1.28x
BenchmarkXOR/3/fast-8          602.24       719.25       1.19x
BenchmarkXOR/4/basic-8         488.38       831.70       1.70x
BenchmarkXOR/4/fast-8          718.82       1186.64      1.65x
BenchmarkXOR/8/basic-8         954.59       1116.38      1.17x
BenchmarkXOR/8/fast-8          1027.60      1718.71      1.67x
BenchmarkXOR/16/basic-8        1185.95      1340.08      1.13x
BenchmarkXOR/16/fast-8         1413.31      3430.46      2.43x
BenchmarkXOR/32/basic-8        1496.85      1447.83      0.97x
BenchmarkXOR/32/fast-8         2701.81      5585.42      2.07x
BenchmarkXOR/128/basic-8       1538.28      1467.95      0.95x
BenchmarkXOR/128/fast-8        7757.97      13432.37     1.73x
BenchmarkXOR/512/basic-8       1637.51      1539.67      0.94x
BenchmarkXOR/512/fast-8        15155.03     18797.79     1.24x
BenchmarkXOR/4096/basic-8      1677.37      1636.90      0.98x
BenchmarkXOR/4096/fast-8       20689.95     20334.61     0.98x
BenchmarkXOR/16384/basic-8     1688.46      1624.81      0.96x
BenchmarkXOR/16384/fast-8      21687.87     21613.94     1.00x

Now its faster than basic XOR at every byte size greater than 2 on
little endian amd64 machines.
nhooyr added a commit to coder/websocket that referenced this issue Nov 6, 2019
See golang/go#31586 (comment)

Thanks @renthraysk

benchmark                      old MB/s     new MB/s     speedup
BenchmarkXOR/2/fast-8          470.88       492.61       1.05x
BenchmarkXOR/3/fast-8          602.24       719.25       1.19x
BenchmarkXOR/4/fast-8          718.82       1186.64      1.65x
BenchmarkXOR/8/fast-8          1027.60      1718.71      1.67x
BenchmarkXOR/16/fast-8         1413.31      3430.46      2.43x
BenchmarkXOR/32/fast-8         2701.81      5585.42      2.07x
BenchmarkXOR/128/fast-8        7757.97      13432.37     1.73x
BenchmarkXOR/512/fast-8        15155.03     18797.79     1.24x
BenchmarkXOR/4096/fast-8       20689.95     20334.61     0.98x
BenchmarkXOR/16384/fast-8      21687.87     21613.94     1.00x

Now its faster than basic XOR at every byte size greater than 2 on
little endian amd64 machines.
@renthraysk
Copy link

Another optimisation. Current code seems to be producing a bounds check per PutUint64(). Instead of passing an open ended slice b[x:] to PutUint64() using b[x:x+8] will eliminate them.
Eg
https://gist.github.com/renthraysk/5617b6aea3852c80abfa662d8ca2ff37

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 7, 2019

Thanks. Will give it a shot at some point. coder/websocket#171 (comment)

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 7, 2019

@renthraysk I wonder if the bound checks could be eliminated if I reverse the order in which the uint64s are xored in each loop. Thus after the first bounds check, the rest of the PutUint64's in each loop would not need a bounds check.

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 7, 2019

Did not have any effect. Going to go with what you suggested instead.

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 7, 2019

About a 20% speedup with values >= 128 bytes with your suggestion implemented. 🚀 🔥

See coder/websocket@15d0a18

Thanks again.

@renthraysk
Copy link

Yeah, reordering wouldn't have made a difference. It was the _ = b[7] line in PutUint64() that was cause the bounds checks. The compiler failed to prove it's not needed with PutUint64(b[8:], x) however it does eliminate the first check in PutUint64(b, x).

Looks like deficiency in the bounds checking elimination.

@josharian
Copy link
Contributor

cc @zdjones

@nhooyr
Copy link
Contributor Author

nhooyr commented Nov 9, 2019

Opened #35483

@zdjones
Copy link
Contributor

zdjones commented Nov 11, 2019

see #24876 and #19126

@golang golang locked and limited conversation to collaborators Nov 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

8 participants