Optimized setbytes-functions #70

holiman · 2020-06-01T07:45:23Z

This was a little experiment that turned out pretty nicely.

Go-ethereum branch for experimentation: https://github.com/holiman/go-ethereum/tree/fixedbig_optimizations

Using size-specific setbyte methods, things got a lot faster, 1.34µs vs 165ns:

name                          old time/op    new time/op    delta
SetBytes/SetBytes-generic-6     1.34µs ± 1%    1.34µs ± 3%    ~     (p=0.465 n=8+9)
SetBytes/SetBytes-specific-6     178ns ± 5%     165ns ± 3%  -7.39%  (p=0.000 n=10+9)

By generating code for PUSHX instructions, I got the following improvements in geth:

name           old time/op    new time/op    delta
PushPopLoop-6    209ms ±101%      81ms ± 7%  -61.44%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
PushPopLoop-6    42.9kB ± 0%    42.6kB ± 0%   -0.66%  (p=0.000 n=8+10)

name           old allocs/op  new allocs/op  delta
PushPopLoop-6      73.4 ± 2%      71.0 ± 0%   -3.24%  (p=0.000 n=8+8)

Further (but unrelated to this PR), by also generating code for SWAPX/DUPX, I got the following:

name           old time/op    new time/op    delta
SwapDupLoop-6    65.6ms ± 5%    58.0ms ± 5%  -11.60%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
SwapDupLoop-6    48.5kB ± 0%    42.5kB ± 0%  -12.36%  (p=0.000 n=9+10)

name           old allocs/op  new allocs/op  delta
SwapDupLoop-6      69.0 ± 0%      70.5 ± 1%   +2.17%  (p=0.000 n=9+10)

codecov-commenter · 2020-06-01T11:51:10Z

Codecov Report

Merging #70 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master       #70    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            3         3            
  Lines          674       937   +263     
==========================================
+ Hits           674       937   +263

chfast · 2020-06-04T15:00:09Z

I did inlining so none SetBytesN calls other SetBytesN (most of these calls were not inlined).

name                          old time/op    new time/op    delta
SetBytes/SetBytes-generic-8      801ns ± 1%     799ns ± 1%   -0.24%  (p=0.028 n=20+20)
SetBytes/SetBytes-specific-8     127ns ± 0%      92ns ± 0%  -27.86%  (p=0.000 n=18+18)

This requires final round of cleanups.

I like this pattern
```
z[3] = ... 
z[2] = ...
z[1] = ...
z[0] = ...
```
but z[3], z[2], z[1], z[0] = ... can be used in some places. Surprisingly, they generate different assembly. Still I'd apply the first form everywhere.
Helpers like bigEndianUint40 can be implement as combination of binary.BigEndian calls. But Go compiler generates the same assembly in both cases. So we can also get rid of using binary.BigEndian entirely. Any comments?

uint256_test.go

holiman · 2020-06-04T16:38:39Z

Nice improvement!

Helpers like bigEndianUint40 can be implement as combination of binary.BigEndian calls. But Go compiler generates the same assembly in both cases. So we can also get rid of using binary.BigEndian entirely. Any comments?

Not sure -- are you saying we should copy-paste the remaining functions from binary.BigEndian too? I'd rather not, if there's no gain in it.

And another things, since the "specific" is ~10x faster than `generic", it might make sense to replace the "generic" with a "switch + specific" ?

holiman · 2020-06-04T16:42:49Z

Oh - Using a switch:

BenchmarkSetBytes/generic-6         	 4944483	       216 ns/op	       0 B/op	       0 allocs/op

vs

BenchmarkSetBytes/generic-6         	  805371	      1450 ns/op	       0 B/op	       0 allocs/op

holiman · 2020-06-04T16:51:02Z

Pushed a naive version now, along with a failing testcase as a reminder to fix it :)

chfast · 2020-06-04T17:08:00Z

Not sure -- are you saying we should copy-paste the remaining functions from binary.BigEndian too? I'd rather not, if there's no gain in it.

Both variants generate the same assembly (https://godbolt.org/z/wJE3K_):

func bigEndianUint40_1(b []byte) uint64 {
	_ = b[4] // bounds check hint to compiler; see golang.org/issue/14808
	return uint64(b[4]) | uint64(b[3])<<8 | uint64(b[2])<<16 | uint64(b[1])<<24 |
		uint64(b[0])<<32
}

func bigEndianUint40_2(b []byte) uint64 {
	_ = b[4] // bounds check hint to compiler; see golang.org/issue/14808
	return uint64(binary.BigEndian.Uint32(b[1:5])) | uint64(b[0])<<32
}

So there are 3 options here:

Keep it as it is (mixture of binary.BigEndian and single byte shifts).
Use binary.BigEndian also in bigEndianUintXX helpers.
Add more helpers based on single byte shifts - import of "encoding/binary" not needed.

chfast · 2020-06-04T17:09:00Z

And another things, since the "specific" is ~10x faster than `generic", it might make sense to replace the "generic" with a "switch + specific" ?

Yes, make sense. I actually thought this is already done, but it was only code in TestSetBytes...

holiman · 2020-06-04T19:16:21Z

So there are 3 options here:

I don't really have a strong opinion... All else being equal, I think we should take whichever looks cleanest, so that the code is easy to understand and easy for a reviewer to vet.

I personally think bigEndianUint40_1 is easier to understand than bigEndianUint40_2. I don't think we should explicitly avoid binary (option 3), since it's a (hopefully) very stable native dependency.

holiman · 2020-06-04T19:19:37Z

Oh right, another thing. In SetBytes, which behaviour do we want?

	bytearr := hex2Bytes("AAAA12131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031BBBB")
	z.SetBytes(bytearr)
	fmt.Printf("%x\n", z) // 1415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031bbbb
	z2.SetBytesx(bytearr)
	fmt.Printf("%x\n", z2) // aaaa12131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f

The current master uses the last bytes, whereas my poc used the first bytes. Which makes more sense, and why?

holiman · 2020-06-04T19:20:49Z

I guess, if we want to align with big.Int, then we should stick with master. Then SetBytes is identical to bigInt.SetBytes() followed by conversion to uint256.Int.

chfast

Looks good, but I'd like to rebase to test on big-endian. I can take care of that.

chfast · 2020-06-05T10:51:04Z

uint256.go

@@ -23,6 +23,7 @@ func NewInt() *Int {

 // SetBytes interprets buf as the bytes of a big-endian unsigned
 // integer, sets z to that value, and returns z.
+/*


yeah, why not. You or me?

holiman · 2020-06-05T11:28:19Z

uint256.go

@@ -111,7 +189,7 @@ func (z *Int) Uint64() uint64 {

 // Uint64WithOverflow returns the lower 64-bits of z and bool whether overflow occurred
 func (z *Int) Uint64WithOverflow() (uint64, bool) {
-	return z[0], z[1] != 0 || z[2] != 0 || z[3] != 0
+	return z[0], (z[1] | z[2] | z[3]) != 0


Either this isn't go-formatted, or the lines at 800 , 805 aren't go-formatted

I double checked. It looks this is different depending if part of longer boolean expression (&&, ||) or not.

also improve SetBytes-methods via compiler hints name old time/op new time/op delta SetBytes/SetBytes-generic-6 1.34µs ± 1% 1.34µs ± 3% ~ (p=0.465 n=8+9) SetBytes/SetBytes-specific-6 178ns ± 5% 165ns ± 3% -7.39% (p=0.000 n=10+9)

holiman · 2020-06-05T12:26:11Z

@chfast I rebased it, squashed it lightly and removed the junk

chfast · 2020-06-05T12:32:32Z

Merge!

holiman requested a review from chfast June 1, 2020 07:45

chfast reviewed Jun 4, 2020

View reviewed changes

uint256_test.go Outdated Show resolved Hide resolved

uint256_test.go Outdated Show resolved Hide resolved

uint256_test.go Outdated Show resolved Hide resolved

chfast reviewed Jun 5, 2020

View reviewed changes

holiman commented Jun 5, 2020

View reviewed changes

holiman and others added 4 commits June 5, 2020 14:24

conversion: length-specific setbyte methods

e2931d2

also improve SetBytes-methods via compiler hints name old time/op new time/op delta SetBytes/SetBytes-generic-6 1.34µs ± 1% 1.34µs ± 3% ~ (p=0.465 n=8+9) SetBytes/SetBytes-specific-6 178ns ± 5% 165ns ± 3% -7.39% (p=0.000 n=10+9)

Inline implementations of SetBytesN

ce6a5d5

review concerns + move tests to conversion_tests + generic dispatcher

052e292

Bring some consistency to SetBytesN implementations

cf7da23

holiman force-pushed the setbytesN branch from cb20882 to cf7da23 Compare June 5, 2020 12:25

chfast approved these changes Jun 5, 2020

View reviewed changes

holiman merged commit 12e0fb4 into master Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized setbytes-functions #70

Optimized setbytes-functions #70

holiman commented Jun 1, 2020

codecov-commenter commented Jun 1, 2020 •

edited

Loading

chfast commented Jun 4, 2020

holiman commented Jun 4, 2020

holiman commented Jun 4, 2020 •

edited

Loading

holiman commented Jun 4, 2020

chfast commented Jun 4, 2020

chfast commented Jun 4, 2020

holiman commented Jun 4, 2020

holiman commented Jun 4, 2020

holiman commented Jun 4, 2020

chfast left a comment

chfast Jun 5, 2020

holiman Jun 5, 2020

holiman Jun 5, 2020

chfast Jun 5, 2020

holiman commented Jun 5, 2020

chfast commented Jun 5, 2020

Optimized setbytes-functions #70

Optimized setbytes-functions #70

Conversation

holiman commented Jun 1, 2020

codecov-commenter commented Jun 1, 2020 • edited Loading

Codecov Report

chfast commented Jun 4, 2020

holiman commented Jun 4, 2020

holiman commented Jun 4, 2020 • edited Loading

holiman commented Jun 4, 2020

chfast commented Jun 4, 2020

chfast commented Jun 4, 2020

holiman commented Jun 4, 2020

holiman commented Jun 4, 2020

holiman commented Jun 4, 2020

chfast left a comment

Choose a reason for hiding this comment

chfast Jun 5, 2020

Choose a reason for hiding this comment

holiman Jun 5, 2020

Choose a reason for hiding this comment

holiman Jun 5, 2020

Choose a reason for hiding this comment

chfast Jun 5, 2020

Choose a reason for hiding this comment

holiman commented Jun 5, 2020

chfast commented Jun 5, 2020

codecov-commenter commented Jun 1, 2020 •

edited

Loading

holiman commented Jun 4, 2020 •

edited

Loading