cmd/compile: optimize XOR masking code #31586

nhooyr · 2019-04-20T17:22:18Z

In the WebSocket protocol, clients have to generate a random masking key for every frame sent to the server and XOR mask the frame payload with the key. The server will unmask it on retrieval.

The mask/unmask function looks like this https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask_safe.go#L9-L15

This is masking byte by byte but can be made significantly faster if the masking was done word by word. Unfortunately, the Go compiler does not do this automatically so you have to use unsafe and write this yourself: https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54

Benchmarks at gorilla/websocket#31

It would be nice if the Go compiler did this automatically.

ALTree · 2019-04-20T17:44:20Z

The two links you posted are the same (copy-paste mistake?). The 2nd one in particular does not use unsafe.

nhooyr · 2019-04-20T17:47:04Z

My bad, fixed.

The second link was supposed to be https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54

odeke-em · 2019-04-21T19:54:49Z

Thank you for filing this issue @nhooyr and @ALTree for the triaging!

Kindly paging @josharian @randall77 @martisch.

crvv · 2019-04-22T07:39:41Z

Please see #28113 #30553 and #28465

And

go/src/crypto/cipher/xor_generic.go

Line 45 in 68d4b12

func fastXORBytes(dst, a, b []byte, n int) {

If you want fast xor, there are 4 different implementations in crypto/cipher.
I don't know whether the compiler can do so much optimization.

josharian · 2019-04-22T19:54:35Z

The compiler already recognizes sequences of byte loads and coalesces them into a single multi-byte loads. See e.g. the amd64 rules starting at

go/src/cmd/compile/internal/ssa/gen/AMD64.rules

Line 1531 in e6ae4e8

// Combining byte loads into larger (unaligned) loads.

. We could in theory write similar rules that convert a sequence of byte-wise load/xor/store operations to an XORQmodify.

In fact, I believe there is a way of writing this that would allow the compiler to do this optimization now. This code:

import "encoding/binary"

func f(key [4]byte, b []byte) {
	k := binary.LittleEndian.Uint32(key[:])
	v := binary.LittleEndian.Uint32(b)
	binary.LittleEndian.PutUint32(b, v^k)
}

Generates (at its core, surrounded by other stuff):

MOVQ	"".b+40(SP), AX
XORL	(AX), DX
MOVL	DX, (AX)

That seems pretty good.

What do you think, @nhooyr ?

josharian · 2019-04-22T20:05:14Z

(And if this is seriously performance-critical, you might want to manually inline the LittleEndian rountines and also duplicate key into a uint64 so that you can do 8 bytes at a time.)

nhooyr · 2019-04-22T22:30:03Z

That seems very reasonable @josharian

Will give this is a shot and let you know how the performance compares.

nhooyr · 2019-04-24T23:57:10Z

So benchmarked and here are the results on my machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):

$ go test -bench=BenchmarkMaskBytes
goos: darwin
goarch: amd64
pkg: github.com/gorilla/websocket
BenchmarkMaskBytes/size-2/align-4/byte-8         	500000000	         3.74 ns/op	 535.10 MB/s
BenchmarkMaskBytes/size-2/align-4/gorilla-8      	300000000	         3.93 ns/op	 508.44 MB/s
BenchmarkMaskBytes/size-2/align-4/gobwas-8       	300000000	         4.67 ns/op	 428.21 MB/s
BenchmarkMaskBytes/size-2/align-4/nhooyr-8       	300000000	         4.73 ns/op	 422.43 MB/s
BenchmarkMaskBytes/size-2/align-4/crypto/cipher-8         	100000000	        16.9 ns/op	 118.23 MB/s
BenchmarkMaskBytes/size-4/align-4/byte-8                  	300000000	         5.58 ns/op	 716.61 MB/s
BenchmarkMaskBytes/size-4/align-4/gorilla-8               	300000000	         5.56 ns/op	 719.80 MB/s
BenchmarkMaskBytes/size-4/align-4/gobwas-8                	200000000	         7.27 ns/op	 550.31 MB/s
BenchmarkMaskBytes/size-4/align-4/nhooyr-8                	200000000	         7.04 ns/op	 567.92 MB/s
BenchmarkMaskBytes/size-4/align-4/crypto/cipher-8         	100000000	        19.6 ns/op	 204.31 MB/s
BenchmarkMaskBytes/size-16/align-4/byte-8                 	100000000	        13.7 ns/op	1170.31 MB/s
BenchmarkMaskBytes/size-16/align-4/gorilla-8              	100000000	        20.2 ns/op	 793.15 MB/s
BenchmarkMaskBytes/size-16/align-4/gobwas-8               	200000000	         6.59 ns/op	2426.43 MB/s
BenchmarkMaskBytes/size-16/align-4/nhooyr-8               	100000000	        13.6 ns/op	1174.09 MB/s
BenchmarkMaskBytes/size-16/align-4/crypto/cipher-8        	100000000	        16.2 ns/op	 985.94 MB/s
BenchmarkMaskBytes/size-32/align-4/byte-8                 	100000000	        23.9 ns/op	1336.39 MB/s
BenchmarkMaskBytes/size-32/align-4/gorilla-8              	100000000	        20.6 ns/op	1551.76 MB/s
BenchmarkMaskBytes/size-32/align-4/gobwas-8               	200000000	         8.13 ns/op	3934.47 MB/s
BenchmarkMaskBytes/size-32/align-4/nhooyr-8               	100000000	        15.6 ns/op	2054.30 MB/s
BenchmarkMaskBytes/size-32/align-4/crypto/cipher-8        	100000000	        24.7 ns/op	1298.15 MB/s
BenchmarkMaskBytes/size-512/align-4/byte-8                	 5000000	       345 ns/op	1480.68 MB/s
BenchmarkMaskBytes/size-512/align-4/gorilla-8             	30000000	        53.8 ns/op	9525.00 MB/s
BenchmarkMaskBytes/size-512/align-4/gobwas-8              	50000000	        39.9 ns/op	12840.94 MB/s
BenchmarkMaskBytes/size-512/align-4/nhooyr-8              	20000000	        83.6 ns/op	6126.90 MB/s
BenchmarkMaskBytes/size-512/align-4/crypto/cipher-8       	10000000	       241 ns/op	2122.82 MB/s
BenchmarkMaskBytes/size-4096/align-4/byte-8               	  500000	      2674 ns/op	1531.63 MB/s
BenchmarkMaskBytes/size-4096/align-4/gorilla-8            	 5000000	       285 ns/op	14322.91 MB/s
BenchmarkMaskBytes/size-4096/align-4/gobwas-8             	 5000000	       282 ns/op	14478.90 MB/s
BenchmarkMaskBytes/size-4096/align-4/nhooyr-8             	 3000000	       539 ns/op	7596.43 MB/s
BenchmarkMaskBytes/size-4096/align-4/crypto/cipher-8      	 1000000	      1818 ns/op	2252.51 MB/s
PASS
ok  	github.com/gorilla/websocket	58.645s

Source code at https://github.com/nhooyr/websocket/blob/40b4f235f2e2730ad1d8b81852e7610a8251d080/mask_test.go#L136-L174

Interestingly, the crypto/cipher assembly XOR was the slowest but I might be using it wrong.

Your solution is 2x slower than gorilla and gobwas but fast enough for me as I do not want to use unsafe.

Thanks @josharian.

I'll leave this issue open in case you think there is still improvement that can be made.

nhooyr · 2019-04-24T23:57:52Z

How I implemented what you described:

// xor applies the WebSocket masking algorithm to p
// with the given key where the first 3 bits of pos
// are the starting position in the key.
// See https://tools.ietf.org/html/rfc6455#section-5.3
//
// The returned value is the position of the next byte
// to be used for masking in the key. This is so that
// unmasking can be performed without the entire frame.
func nhooyr(key [4]byte, keyPos int, b []byte) int {
	// If the payload is greater than 16 bytes, then it's worth
	// masking 8 bytes at a time.
	// Optimization from https://github.com/golang/go/issues/31586#issuecomment-485530859
	if len(b) > 16 {
		// We first create a key that is 8 bytes long
		// and is aligned on the keyPos correctly.
		var alignedKey [8]byte
		for i := range alignedKey {
			alignedKey[i] = key[(i+keyPos)&3]
		}
		k := binary.LittleEndian.Uint64(alignedKey[:])

		// Then we xor until b is less than 8 bytes.
		for len(b) >= 8 {
			v := binary.LittleEndian.Uint64(b)
			binary.LittleEndian.PutUint64(b, v^k)
			b = b[8:]
		}
	}

	// xor remaining bytes.
	for i := range b {
		b[i] ^= key[keyPos&3]
		keyPos++
	}
	return keyPos & 3
}

josharian · 2019-04-29T02:14:00Z

@nhooyr this is a case in which a little unrolling helps. Add something along these lines:

		for len(b) >= 32 {
			v := binary.LittleEndian.Uint64(b)
			binary.LittleEndian.PutUint64(b, v^k)
			v = binary.LittleEndian.Uint64(b[8:])
			binary.LittleEndian.PutUint64(b[8:], v^k)
			v = binary.LittleEndian.Uint64(b[16:])
			binary.LittleEndian.PutUint64(b[16:], v^k)
			v = binary.LittleEndian.Uint64(b[24:])
			binary.LittleEndian.PutUint64(b[24:], v^k)
			b = b[32:]
		}

I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot.

What's going on here is that reslicing (b = b[8:]) is expensive because the compiler needs to take care to avoid creating a pointer past the end of b. Unrolling dilutes that cost. Some of the techniques discussed in #27857 might help reduce that cost as well, but those don't appear likely to land before Go 1.14 at the earliest.

And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime.

nhooyr · 2019-04-29T02:53:57Z

Sweet, will use that.

Thank you for the tips.

nhooyr · 2019-04-29T02:55:59Z

@josharian out of curiosity but do you have any idea why the crypto/cipher xor was so slow even though its written in assembly?

nhooyr · 2019-04-29T03:19:58Z

Unrolling the loop to 128 makes it about as fast as gobwas/ws and gorilla/websocket for smaller sizes and faster for larger sizes 🎉

$ go test -bench=XOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8         	300000000	         4.07 ns/op	 491.49 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast-8          	300000000	         4.73 ns/op	 423.15 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/basic-8        	100000000	        12.6 ns/op	1273.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast-8         	100000000	        13.4 ns/op	1197.26 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/basic-8        	100000000	        23.1 ns/op	1388.07 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast-8         	100000000	        23.8 ns/op	1343.96 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/basic-8       	 5000000	       349 ns/op	1466.36 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast-8        	30000000	        44.4 ns/op	11531.75 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/basic-8      	  500000	      2582 ns/op	1585.85 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast-8       	 5000000	       274 ns/op	14920.34 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/basic-8     	  200000	     10239 ns/op	1600.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast-8      	 1000000	      1073 ns/op	15267.46 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	nhooyr.io/websocket	20.611s

crvv · 2019-04-29T03:23:07Z

crypto/cipher is fast for

largeBuffer := make([]byte, size)
xor(largeBuffer, ...)

But what you did is

largeBuffer := make([]byte, size)
for i := 0; i < 16; i += 16 {
    xor(largeBuffer[i:i+16], ...)
}

This is slow.

If you want that fast SIMD implementation, I think the assembly code needs to be changed for the WebSocket mask algorithm.

nhooyr · 2019-04-29T11:52:32Z

@crvv makes sense got it.

@josharian final code looks like https://github.com/nhooyr/websocket/blob/cfdbe6819dba18b345df68fa06ed1e499874cad8/xor.go#L28-L115

So just tiered loops going getting smaller in unrolled size, this doesn't seem to have any negative affect on performance at low or high byte sizes. However I noticed something interesting. if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.

$ benchcmp /tmp/if.txt /tmp/for.txt 
benchmark                      old ns/op     new ns/op     delta
BenchmarkXOR/2/basic-8         4.24          3.99          -5.90%
BenchmarkXOR/2/fast-8          4.49          4.89          +8.91%
BenchmarkXOR/16/basic-8        12.6          12.6          +0.00%
BenchmarkXOR/16/fast-8         12.6          12.6          +0.00%
BenchmarkXOR/32/basic-8        22.6          22.7          +0.44%
BenchmarkXOR/32/fast-8         13.4          13.2          -1.49%
BenchmarkXOR/512/basic-8       340           327           -3.82%
BenchmarkXOR/512/fast-8        62.9          44.2          -29.73%
BenchmarkXOR/4096/basic-8      2576          2588          +0.47%
BenchmarkXOR/4096/fast-8       409           280           -31.54%
BenchmarkXOR/16384/basic-8     10700         10291         -3.82%
BenchmarkXOR/16384/fast-8      1722          1089          -36.76%

benchmark                      old MB/s     new MB/s     speedup
BenchmarkXOR/2/basic-8         471.71       501.50       1.06x
BenchmarkXOR/2/fast-8          444.95       409.10       0.92x
BenchmarkXOR/16/basic-8        1274.44      1269.48      1.00x
BenchmarkXOR/16/fast-8         1265.40      1272.86      1.01x
BenchmarkXOR/32/basic-8        1417.10      1410.06      1.00x
BenchmarkXOR/32/fast-8         2388.94      2425.89      1.02x
BenchmarkXOR/512/basic-8       1502.64      1565.02      1.04x
BenchmarkXOR/512/fast-8        8134.34      11575.80     1.42x
BenchmarkXOR/4096/basic-8      1589.47      1582.41      1.00x
BenchmarkXOR/4096/fast-8       10009.68     14615.04     1.46x
BenchmarkXOR/16384/basic-8     1531.21      1591.97      1.04x
BenchmarkXOR/16384/fast-8      9513.69      15044.05     1.58x

josharian · 2019-04-29T17:46:33Z

if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.

Interesting. Branch misprediction, perhaps? Loop head unrolling? It might be instructive to look at the generated code and at what perf counters tell you. If you learn more, please share, in case it is something we can teach the compiler to fix automatically.

as fast as gobwas/ws and gorilla/websocket

Hooray!

One important caveat: This is all tuned for amd64, which is little-endian and supports unaligned loads and stores. It'll work correctly on all architectures (because it is pure Go), but it might not be fast on them.

You could probably make it fast on a per-architecture basis (using build flags), if that matters to you. You might even be able to write a useful standalone package that lets you efficiently (per-architecture) take a byte slice and process in turn some leading bytes, a bunch of uint64s, and then some trailing bytes.

renthraysk · 2019-11-05T21:58:18Z

Just to note the aligning key loop can be reduced to a bits.RotateLeft64

nhooyr · 2019-11-06T00:02:23Z

@renthraysk To clarify, that would involve changing the key's type to uint64 from [4]byte right? i.e there is no way to just replace the loop with bits.RotateLeft64 directly.

nhooyr · 2019-11-06T00:32:11Z

$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8         	298493414	         3.95 ns/op	 506.25 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast-8          	273740404	         4.43 ns/op	 451.88 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast2-8         	256999929	         4.63 ns/op	 432.01 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/basic-8        	96471366	        12.5 ns/op	1280.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast-8         	96853706	        12.0 ns/op	1330.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast2-8        	222916975	         5.42 ns/op	2951.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/basic-8        	53923512	        21.8 ns/op	1469.46 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast-8         	95436625	        12.2 ns/op	2621.93 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast2-8        	189894780	         6.17 ns/op	5187.79 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/basic-8       	 3593703	       369 ns/op	1388.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast-8        	32837448	        36.0 ns/op	14223.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast2-8       	40974423	        27.6 ns/op	18531.62 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/basic-8      	  474385	      2628 ns/op	1558.64 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast-8       	 5921575	       212 ns/op	19308.44 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast2-8      	 6145724	       191 ns/op	21422.74 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/basic-8     	  118580	      9822 ns/op	1668.03 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast-8      	 1565948	       767 ns/op	21355.19 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast2-8     	 1560382	       760 ns/op	21569.24 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	nhooyr.io/websocket	26.730s

fast2 is with bits.RotateLeft64, pretty significant speedup, especially from 32 till 512, where the speed pretty much doubled.

🎊

renthraysk · 2019-11-06T01:02:07Z

Yeah, might have misremembered, and using binary.X.Uint32() to read 4 bytes in, perform the alignment rotation with RotateLeft32() and then expanded to 64 bits x<<32|x.

Also the last remain byte loop can use RotateLeft() instead of reading from the key array.

@renthraysk

See golang/go#31586 (comment) Thanks @renthraysk Now its faster than basic XOR at every byte size greater than 3 on little endian amd64 machines.

nhooyr · 2019-11-06T04:45:00Z

coder/websocket@646e967

https://github.com/nhooyr/websocket/blob/646e967341551f499e7fc5a2286d02c9c0c238b7/frame.go#L325-L443

$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8         	344487918	         3.43 ns/op	 582.99 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/2/fast-8          	312039567	         3.91 ns/op	 511.00 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/3/basic-8         	285057960	         4.31 ns/op	 696.14 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/3/fast-8          	295787498	         4.04 ns/op	 742.36 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4/basic-8         	244778098	         4.75 ns/op	 841.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4/fast-8          	361466797	         3.31 ns/op	1209.12 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/8/basic-8         	166173404	         7.09 ns/op	1129.10 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/8/fast-8          	262743075	         4.55 ns/op	1757.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/basic-8        	100000000	        11.8 ns/op	1361.28 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16/fast-8         	260987524	         4.57 ns/op	3500.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/basic-8        	56211588	        21.1 ns/op	1515.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/32/fast-8         	216847578	         5.46 ns/op	5859.79 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/128/basic-8       	13744023	        83.5 ns/op	1532.62 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/128/fast-8        	128491503	         9.30 ns/op	13764.28 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/basic-8       	 3784251	       320 ns/op	1601.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/512/fast-8        	41965264	        26.9 ns/op	19009.42 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/basic-8      	  458898	      2460 ns/op	1665.01 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/4096/fast-8       	 5955132	       198 ns/op	20699.90 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/basic-8     	  117724	      9961 ns/op	1644.87 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR/16384/fast-8      	 1532142	       760 ns/op	21545.42 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	nhooyr.io/websocket	31.076s

Faster than basic XOR at even 3 bytes :)

Thanks @renthraysk

@renthraysk

See golang/go#31586 (comment) Thanks @renthraysk Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.

@renthraysk

See golang/go#31586 (comment) Thanks @renthraysk benchmark old MB/s new MB/s speedup BenchmarkXOR/2/basic-8 504.71 569.72 1.13x BenchmarkXOR/2/fast-8 470.88 492.61 1.05x BenchmarkXOR/3/basic-8 570.57 731.94 1.28x BenchmarkXOR/3/fast-8 602.24 719.25 1.19x BenchmarkXOR/4/basic-8 488.38 831.70 1.70x BenchmarkXOR/4/fast-8 718.82 1186.64 1.65x BenchmarkXOR/8/basic-8 954.59 1116.38 1.17x BenchmarkXOR/8/fast-8 1027.60 1718.71 1.67x BenchmarkXOR/16/basic-8 1185.95 1340.08 1.13x BenchmarkXOR/16/fast-8 1413.31 3430.46 2.43x BenchmarkXOR/32/basic-8 1496.85 1447.83 0.97x BenchmarkXOR/32/fast-8 2701.81 5585.42 2.07x BenchmarkXOR/128/basic-8 1538.28 1467.95 0.95x BenchmarkXOR/128/fast-8 7757.97 13432.37 1.73x BenchmarkXOR/512/basic-8 1637.51 1539.67 0.94x BenchmarkXOR/512/fast-8 15155.03 18797.79 1.24x BenchmarkXOR/4096/basic-8 1677.37 1636.90 0.98x BenchmarkXOR/4096/fast-8 20689.95 20334.61 0.98x BenchmarkXOR/16384/basic-8 1688.46 1624.81 0.96x BenchmarkXOR/16384/fast-8 21687.87 21613.94 1.00x Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.

@renthraysk

See golang/go#31586 (comment) Thanks @renthraysk benchmark old MB/s new MB/s speedup BenchmarkXOR/2/fast-8 470.88 492.61 1.05x BenchmarkXOR/3/fast-8 602.24 719.25 1.19x BenchmarkXOR/4/fast-8 718.82 1186.64 1.65x BenchmarkXOR/8/fast-8 1027.60 1718.71 1.67x BenchmarkXOR/16/fast-8 1413.31 3430.46 2.43x BenchmarkXOR/32/fast-8 2701.81 5585.42 2.07x BenchmarkXOR/128/fast-8 7757.97 13432.37 1.73x BenchmarkXOR/512/fast-8 15155.03 18797.79 1.24x BenchmarkXOR/4096/fast-8 20689.95 20334.61 0.98x BenchmarkXOR/16384/fast-8 21687.87 21613.94 1.00x Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.

renthraysk · 2019-11-06T20:14:49Z

Another optimisation. Current code seems to be producing a bounds check per PutUint64(). Instead of passing an open ended slice b[x:] to PutUint64() using b[x:x+8] will eliminate them.
Eg
https://gist.github.com/renthraysk/5617b6aea3852c80abfa662d8ca2ff37

nhooyr · 2019-11-07T01:46:58Z

Thanks. Will give it a shot at some point. coder/websocket#171 (comment)

nhooyr · 2019-11-07T01:48:54Z

@renthraysk I wonder if the bound checks could be eliminated if I reverse the order in which the uint64s are xored in each loop. Thus after the first bounds check, the rest of the PutUint64's in each loop would not need a bounds check.

nhooyr · 2019-11-07T01:55:38Z

Did not have any effect. Going to go with what you suggested instead.

nhooyr · 2019-11-07T02:07:13Z

About a 20% speedup with values >= 128 bytes with your suggestion implemented. 🚀 🔥

See coder/websocket@15d0a18

Thanks again.

renthraysk · 2019-11-07T13:08:57Z

Yeah, reordering wouldn't have made a difference. It was the _ = b[7] line in PutUint64() that was cause the bounds checks. The compiler failed to prove it's not needed with PutUint64(b[8:], x) however it does eliminate the first check in PutUint64(b, x).

Looks like deficiency in the bounds checking elimination.

josharian · 2019-11-07T15:30:07Z

cc @zdjones

nhooyr · 2019-11-09T22:52:09Z

Opened #35483

zdjones · 2019-11-11T09:36:54Z

see #24876 and #19126

ALTree added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 20, 2019

ALTree added this to the Unplanned milestone Apr 20, 2019

nhooyr mentioned this issue Apr 21, 2019

Super fast mask/unmask coder/websocket#64

Closed

odeke-em changed the title ~~cmd/compiler: optimize XOR masking code~~ cmd/compile: optimize XOR masking code Apr 21, 2019

josharian mentioned this issue Apr 22, 2019

cmd/compile: apparently extraneous inlmarks #31618

Closed

josharian mentioned this issue Apr 29, 2019

cmd/compile: unexpected performance difference accessing slices with different caps #27857

Open

josharian closed this as completed Apr 29, 2019

nhooyr mentioned this issue Apr 29, 2019

Optimize xor function further coder/websocket#76

Closed

nhooyr mentioned this issue Jun 7, 2019

Benchmark FastXOR on non amd64 platforms coder/websocket#96

Closed

josharian mentioned this issue Nov 5, 2019

crypto/cipher: optimize safeXORBytes #35381

Open

nhooyr added a commit to coder/websocket that referenced this issue Nov 6, 2019

Optimize fastXOR with math/bits

646e967

See golang/go#31586 (comment) Thanks @renthraysk Now its faster than basic XOR at every byte size greater than 3 on little endian amd64 machines.

nhooyr added a commit to coder/websocket that referenced this issue Nov 6, 2019

Optimize fastXOR with math/bits

64223e4

See golang/go#31586 (comment) Thanks @renthraysk Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.

nhooyr mentioned this issue Nov 6, 2019

Optimize masking with math/bits coder/websocket#171

Merged

josharian mentioned this issue Nov 6, 2019

proposal: encoding/binary: add NativeEndian #35398

Closed

nhooyr mentioned this issue Nov 9, 2019

cmd/compile: improve BCE to detect safe unbounded slicing #35483

Closed

golang locked and limited conversation to collaborators Nov 10, 2020

gopherbot added the FrozenDueToAge label Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: optimize XOR masking code #31586

cmd/compile: optimize XOR masking code #31586

nhooyr commented Apr 20, 2019 •

edited

Loading

ALTree commented Apr 20, 2019

nhooyr commented Apr 20, 2019

odeke-em commented Apr 21, 2019

crvv commented Apr 22, 2019

josharian commented Apr 22, 2019

josharian commented Apr 22, 2019

nhooyr commented Apr 22, 2019

nhooyr commented Apr 24, 2019 •

edited

Loading

nhooyr commented Apr 24, 2019

josharian commented Apr 29, 2019

nhooyr commented Apr 29, 2019 •

edited

Loading

nhooyr commented Apr 29, 2019

nhooyr commented Apr 29, 2019 •

edited

Loading

crvv commented Apr 29, 2019

nhooyr commented Apr 29, 2019 •

edited

Loading

josharian commented Apr 29, 2019

renthraysk commented Nov 5, 2019

nhooyr commented Nov 6, 2019

nhooyr commented Nov 6, 2019

renthraysk commented Nov 6, 2019

nhooyr commented Nov 6, 2019

renthraysk commented Nov 6, 2019

nhooyr commented Nov 7, 2019

nhooyr commented Nov 7, 2019

nhooyr commented Nov 7, 2019

nhooyr commented Nov 7, 2019

renthraysk commented Nov 7, 2019

josharian commented Nov 7, 2019

nhooyr commented Nov 9, 2019

zdjones commented Nov 11, 2019

cmd/compile: optimize XOR masking code #31586

cmd/compile: optimize XOR masking code #31586

Comments

nhooyr commented Apr 20, 2019 • edited Loading

ALTree commented Apr 20, 2019

nhooyr commented Apr 20, 2019

odeke-em commented Apr 21, 2019

crvv commented Apr 22, 2019

josharian commented Apr 22, 2019

josharian commented Apr 22, 2019

nhooyr commented Apr 22, 2019

nhooyr commented Apr 24, 2019 • edited Loading

nhooyr commented Apr 24, 2019

josharian commented Apr 29, 2019

nhooyr commented Apr 29, 2019 • edited Loading

nhooyr commented Apr 29, 2019

nhooyr commented Apr 29, 2019 • edited Loading

crvv commented Apr 29, 2019

nhooyr commented Apr 29, 2019 • edited Loading

josharian commented Apr 29, 2019

renthraysk commented Nov 5, 2019

nhooyr commented Nov 6, 2019

nhooyr commented Nov 6, 2019

renthraysk commented Nov 6, 2019

nhooyr commented Nov 6, 2019

renthraysk commented Nov 6, 2019

nhooyr commented Nov 7, 2019

nhooyr commented Nov 7, 2019

nhooyr commented Nov 7, 2019

nhooyr commented Nov 7, 2019

renthraysk commented Nov 7, 2019

josharian commented Nov 7, 2019

nhooyr commented Nov 9, 2019

zdjones commented Nov 11, 2019

nhooyr commented Apr 20, 2019 •

edited

Loading

nhooyr commented Apr 24, 2019 •

edited

Loading

nhooyr commented Apr 29, 2019 •

edited

Loading

nhooyr commented Apr 29, 2019 •

edited

Loading

nhooyr commented Apr 29, 2019 •

edited

Loading