Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized setbytes-functions #70

Merged
merged 4 commits into from
Jun 5, 2020
Merged

Optimized setbytes-functions #70

merged 4 commits into from
Jun 5, 2020

Conversation

holiman
Copy link
Owner

@holiman holiman commented Jun 1, 2020

This was a little experiment that turned out pretty nicely.

Go-ethereum branch for experimentation: https://github.com/holiman/go-ethereum/tree/fixedbig_optimizations

Using size-specific setbyte methods, things got a lot faster, 1.34µs vs 165ns:

name                          old time/op    new time/op    delta
SetBytes/SetBytes-generic-6     1.34µs ± 1%    1.34µs ± 3%    ~     (p=0.465 n=8+9)
SetBytes/SetBytes-specific-6     178ns ± 5%     165ns ± 3%  -7.39%  (p=0.000 n=10+9)

By generating code for PUSHX instructions, I got the following improvements in geth:

name           old time/op    new time/op    delta
PushPopLoop-6    209ms ±101%      81ms ± 7%  -61.44%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
PushPopLoop-6    42.9kB ± 0%    42.6kB ± 0%   -0.66%  (p=0.000 n=8+10)

name           old allocs/op  new allocs/op  delta
PushPopLoop-6      73.4 ± 2%      71.0 ± 0%   -3.24%  (p=0.000 n=8+8)

Further (but unrelated to this PR), by also generating code for SWAPX/DUPX, I got the following:

name           old time/op    new time/op    delta
SwapDupLoop-6    65.6ms ± 5%    58.0ms ± 5%  -11.60%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
SwapDupLoop-6    48.5kB ± 0%    42.5kB ± 0%  -12.36%  (p=0.000 n=9+10)

name           old allocs/op  new allocs/op  delta
SwapDupLoop-6      69.0 ± 0%      70.5 ± 1%   +2.17%  (p=0.000 n=9+10)

@holiman holiman requested a review from chfast June 1, 2020 07:45
@codecov-commenter
Copy link

codecov-commenter commented Jun 1, 2020

Codecov Report

Merging #70 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master       #70    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            3         3            
  Lines          674       937   +263     
==========================================
+ Hits           674       937   +263     

@chfast
Copy link
Collaborator

chfast commented Jun 4, 2020

I did inlining so none SetBytesN calls other SetBytesN (most of these calls were not inlined).

name                          old time/op    new time/op    delta
SetBytes/SetBytes-generic-8      801ns ± 1%     799ns ± 1%   -0.24%  (p=0.028 n=20+20)
SetBytes/SetBytes-specific-8     127ns ± 0%      92ns ± 0%  -27.86%  (p=0.000 n=18+18)

This requires final round of cleanups.

  1. I like this pattern

    z[3] = ... 
    z[2] = ...
    z[1] = ...
    z[0] = ...
    

    but z[3], z[2], z[1], z[0] = ... can be used in some places. Surprisingly, they generate different assembly. Still I'd apply the first form everywhere.

  2. Helpers like bigEndianUint40 can be implement as combination of binary.BigEndian calls. But Go compiler generates the same assembly in both cases. So we can also get rid of using binary.BigEndian entirely. Any comments?

uint256_test.go Outdated Show resolved Hide resolved
uint256_test.go Outdated Show resolved Hide resolved
uint256_test.go Outdated Show resolved Hide resolved
@holiman
Copy link
Owner Author

holiman commented Jun 4, 2020

Nice improvement!

Helpers like bigEndianUint40 can be implement as combination of binary.BigEndian calls. But Go compiler generates the same assembly in both cases. So we can also get rid of using binary.BigEndian entirely. Any comments?

Not sure -- are you saying we should copy-paste the remaining functions from binary.BigEndian too? I'd rather not, if there's no gain in it.

And another things, since the "specific" is ~10x faster than `generic", it might make sense to replace the "generic" with a "switch + specific" ?

@holiman
Copy link
Owner Author

holiman commented Jun 4, 2020

Oh - Using a switch:

BenchmarkSetBytes/generic-6         	 4944483	       216 ns/op	       0 B/op	       0 allocs/op

vs

BenchmarkSetBytes/generic-6         	  805371	      1450 ns/op	       0 B/op	       0 allocs/op

@holiman
Copy link
Owner Author

holiman commented Jun 4, 2020

Pushed a naive version now, along with a failing testcase as a reminder to fix it :)

@chfast
Copy link
Collaborator

chfast commented Jun 4, 2020

Not sure -- are you saying we should copy-paste the remaining functions from binary.BigEndian too? I'd rather not, if there's no gain in it.

Both variants generate the same assembly (https://godbolt.org/z/wJE3K_):

func bigEndianUint40_1(b []byte) uint64 {
	_ = b[4] // bounds check hint to compiler; see golang.org/issue/14808
	return uint64(b[4]) | uint64(b[3])<<8 | uint64(b[2])<<16 | uint64(b[1])<<24 |
		uint64(b[0])<<32
}

func bigEndianUint40_2(b []byte) uint64 {
	_ = b[4] // bounds check hint to compiler; see golang.org/issue/14808
	return uint64(binary.BigEndian.Uint32(b[1:5])) | uint64(b[0])<<32
}

So there are 3 options here:

  1. Keep it as it is (mixture of binary.BigEndian and single byte shifts).
  2. Use binary.BigEndian also in bigEndianUintXX helpers.
  3. Add more helpers based on single byte shifts - import of "encoding/binary" not needed.

@chfast
Copy link
Collaborator

chfast commented Jun 4, 2020

And another things, since the "specific" is ~10x faster than `generic", it might make sense to replace the "generic" with a "switch + specific" ?

Yes, make sense. I actually thought this is already done, but it was only code in TestSetBytes...

@holiman
Copy link
Owner Author

holiman commented Jun 4, 2020

So there are 3 options here:

I don't really have a strong opinion... All else being equal, I think we should take whichever looks cleanest, so that the code is easy to understand and easy for a reviewer to vet.

I personally think bigEndianUint40_1 is easier to understand than bigEndianUint40_2. I don't think we should explicitly avoid binary (option 3), since it's a (hopefully) very stable native dependency.

@holiman
Copy link
Owner Author

holiman commented Jun 4, 2020

Oh right, another thing. In SetBytes, which behaviour do we want?

	bytearr := hex2Bytes("AAAA12131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031BBBB")
	z.SetBytes(bytearr)
	fmt.Printf("%x\n", z) // 1415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031bbbb
	z2.SetBytesx(bytearr)
	fmt.Printf("%x\n", z2) // aaaa12131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f

The current master uses the last bytes, whereas my poc used the first bytes. Which makes more sense, and why?

@holiman
Copy link
Owner Author

holiman commented Jun 4, 2020

I guess, if we want to align with big.Int, then we should stick with master. Then SetBytes is identical to bigInt.SetBytes() followed by conversion to uint256.Int.

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'd like to rebase to test on big-endian. I can take care of that.

uint256.go Outdated
@@ -23,6 +23,7 @@ func NewInt() *Int {

// SetBytes interprets buf as the bytes of a big-endian unsigned
// integer, sets z to that value, and returns z.
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, why not. You or me?

@@ -111,7 +189,7 @@ func (z *Int) Uint64() uint64 {

// Uint64WithOverflow returns the lower 64-bits of z and bool whether overflow occurred
func (z *Int) Uint64WithOverflow() (uint64, bool) {
return z[0], z[1] != 0 || z[2] != 0 || z[3] != 0
return z[0], (z[1] | z[2] | z[3]) != 0
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this isn't go-formatted, or the lines at 800 , 805 aren't go-formatted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked. It looks this is different depending if part of longer boolean expression (&&, ||) or not.

holiman and others added 4 commits June 5, 2020 14:24
also improve SetBytes-methods via compiler hints

name                          old time/op    new time/op    delta
SetBytes/SetBytes-generic-6     1.34µs ± 1%    1.34µs ± 3%    ~     (p=0.465 n=8+9)
SetBytes/SetBytes-specific-6     178ns ± 5%     165ns ± 3%  -7.39%  (p=0.000 n=10+9)
@holiman
Copy link
Owner Author

holiman commented Jun 5, 2020

@chfast I rebased it, squashed it lightly and removed the junk

@chfast
Copy link
Collaborator

chfast commented Jun 5, 2020

Merge!

@holiman holiman merged commit 12e0fb4 into master Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants