Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p/rlpx: reduce allocation and syscalls #22899

Merged
merged 7 commits into from
May 27, 2021
Merged

Conversation

fjl
Copy link
Contributor

@fjl fjl commented May 18, 2021

This change significantly improves the performance of RLPx message reads
and writes. In the previous implementation, reading and writing of
message frames performed multiple reads and writes on the underlying
network connection, and allocated a new []byte buffer for every read.

In the new implementation, reads and writes re-use buffers, and perform
much fewer system calls on the underlying connection. This doubles the
theoretically achievable throughput on a single connection, as shown by
the benchmark result:

name             old speed      new speed       delta
Throughput-8     70.3MB/s ± 0%  155.4MB/s ± 0%  +121.11%  (p=0.000 n=9+8)

The change also removes support for the legacy, pre-EIP-8 handshake encoding.
As of May 2021, no actively maintained client sends this format.

@fjl fjl requested a review from zsfelfoldi as a code owner May 18, 2021 14:24
@fjl fjl added this to the 1.10.4 milestone May 18, 2021
p2p/rlpx/buffer.go Outdated Show resolved Hide resolved
p2p/rlpx/buffer.go Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor

holiman commented May 21, 2021

Some charts from a snap-sync, where this PR is yellow, master is green.
Screenshot_2021-05-21 Dual Geth - Grafana(3)
Screenshot_2021-05-21 Dual Geth - Grafana(2)
Screenshot_2021-05-21 Dual Geth - Grafana(1)
Screenshot_2021-05-21 Dual Geth - Grafana

No really visible effects, but nothing broken, so LGTM

@fjl
Copy link
Contributor Author

fjl commented May 21, 2021

I'm also still testing this, and it seems the snappy buffer management doesn't fully work yet.

fjl added 6 commits May 21, 2021 10:54
This is the minimal necessary change to make the next commit work.
This change significantly improves the performance of RLPx message reads
and writes. In the previous implementation, reading and writing of
message frames performed multiple reads and writes on the underlying
network connection, and allocated a new []byte buffer for every read.

In the new implementation, reads and writes re-use buffers, and perform
much fewer system calls on the underlying connection. This effectively
doubles the theoretically achievable throughput on a single connection,
as shown by the benchmark result:

    name             old speed      new speed       delta
    Throughput-8     66.5MB/s ± 0%  139.7MB/s ± 1%  +110.16%  (p=0.001 n=7+7)

The change also removes support for the legacy, pre-EIP-8 handshake encoding.
As of May 2021, no actively maintained client sends this format.
The snappy functions only check the length of the destination buffer and
ignore its capacity. When message sizes alternated between large and
small, we would still allocate a new buffer for every other message.
This fixes it.
Comment on lines +222 to +223
c.snappyWriteBuffer = growslice(c.snappyWriteBuffer, snappy.MaxEncodedLen(len(data)))
data = snappy.Encode(c.snappyWriteBuffer, data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be good to truncate it again before this method returns? as it is, it will grow to the "largest packet you ever sent" and just stick around forever..?

Copy link
Contributor Author

@fjl fjl May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended. The idea is to keep allocating the buffer until it is large enough for the largest message we will ever send. RLPx messages are limited to 16MB, so it won't grow much larger than that.

@holiman
Copy link
Contributor

holiman commented May 25, 2021

The computeFrame function calculates the hash, then calls m.compute which does it all over again, costing an extra alloc. If that is removed:

[user@work rlpx]$ benchstat before.txt after1.txt 
name          old time/op    new time/op    delta
Throughput-6    18.4µs ±12%    18.4µs ±13%     ~     (p=0.573 n=8+10)

name          old speed      new speed      delta
Throughput-6  54.7MB/s ± 4%  55.9MB/s ±14%     ~     (p=0.315 n=7+10)

name          old alloc/op   new alloc/op   delta
Throughput-6    4.81kB ± 0%    3.84kB ± 0%  -20.08%  (p=0.000 n=8+8)

name          old allocs/op  new allocs/op  delta
Throughput-6      20.0 ± 0%      16.0 ± 0%  -20.00%  (p=0.000 n=10+10)

Also, unrolling the loop in that new compute function had a non-negligible impact, it seems:

[user@work rlpx]$ benchstat before.txt after2.txt 
name          old time/op    new time/op    delta
Throughput-6    18.4µs ±12%    17.0µs ± 6%   -7.55%  (p=0.008 n=8+9)

name          old speed      new speed      delta
Throughput-6  54.7MB/s ± 4%  60.3MB/s ± 6%  +10.20%  (p=0.000 n=7+9)

name          old alloc/op   new alloc/op   delta
Throughput-6    4.81kB ± 0%    3.84kB ± 0%  -20.07%  (p=0.000 n=8+8)

name          old allocs/op  new allocs/op  delta
Throughput-6      20.0 ± 0%      16.0 ± 0%  -20.00%  (p=0.000 n=10+10)

Hacks:

diff --git a/p2p/rlpx/rlpx.go b/p2p/rlpx/rlpx.go
index 72ba28d2ca..9132129ea9 100644
--- a/p2p/rlpx/rlpx.go
+++ b/p2p/rlpx/rlpx.go
@@ -272,7 +272,6 @@ func (m *hashMAC) compute(seed []byte) []byte {
 	if len(seed) != len(m.aesBuffer) {
 		panic("invalid MAC seed")
 	}
-
 	sum1 := m.hash.Sum(m.hashBuffer[:0])
 	m.cipher.Encrypt(m.aesBuffer[:], sum1)
 	for i := range m.aesBuffer {
@@ -283,11 +282,42 @@ func (m *hashMAC) compute(seed []byte) []byte {
 	return sum2[:16]
 }
 
+func (m *hashMAC) compute2(seed []byte) []byte {
+	if len(seed) != len(m.aesBuffer) {
+		panic("invalid MAC seed")
+	}
+	copy(m.hashBuffer[:0], seed)
+	sum1 := m.hashBuffer[0:16]
+	m.cipher.Encrypt(m.aesBuffer[:], sum1)
+
+	m.aesBuffer[0] ^= seed[0]
+	m.aesBuffer[1] ^= seed[1]
+	m.aesBuffer[2] ^= seed[2]
+	m.aesBuffer[3] ^= seed[3]
+	m.aesBuffer[4] ^= seed[4]
+	m.aesBuffer[5] ^= seed[5]
+	m.aesBuffer[6] ^= seed[6]
+	m.aesBuffer[7] ^= seed[7]
+	m.aesBuffer[8] ^= seed[8]
+	m.aesBuffer[9] ^= seed[9]
+	m.aesBuffer[10] ^= seed[10]
+	m.aesBuffer[11] ^= seed[11]
+	m.aesBuffer[12] ^= seed[12]
+	m.aesBuffer[13] ^= seed[13]
+	m.aesBuffer[14] ^= seed[14]
+	m.aesBuffer[15] ^= seed[15]
+
+	m.hash.Write(m.aesBuffer[:])
+
+	sum2 := m.hash.Sum(m.hashBuffer[:0])
+	return sum2[:16]
+}
+
 // computeFrame computes the MAC of framedata.
 func (m *hashMAC) computeFrame(framedata []byte) []byte {
 	m.hash.Write(framedata)
 	seed := m.hash.Sum(m.seedBuffer[:0])
-	return m.compute(seed[:16])
+	return m.compute2(seed[:16])
 }
 
 // Handshake performs the handshake. This must be called before any data is written

@holiman
Copy link
Contributor

holiman commented May 26, 2021

Current status:

50% is sha3:

   20420ms 34.94% 34.94%    20420ms 34.94%  golang.org/x/crypto/sha3.keccakF1600
     970ms  1.66% 75.07%     8590ms 14.70%  golang.org/x/crypto/sha3.(*state).padAndPermute

20% syscalls:

 11890ms 20.34% 55.28%    12280ms 21.01%  syscall.Syscall

20% AES:

    4990ms  8.54% 63.82%     4990ms  8.54%  crypto/aes.encryptBlockAsm
    1040ms  1.78% 73.41%     6550ms 11.21%  crypto/cipher.(*ctr).refill

Remaining 10% is mainly test-overhead:

    3240ms  5.54% 69.36%     4220ms  7.22%  runtime.suspendG
    1330ms  2.28% 71.63%     1370ms  2.34%  runtime.nanotime (inline)
     720ms  1.23% 76.30%      720ms  1.23%  runtime.memclrNoHeapPointers
     700ms  1.20% 77.50%      840ms  1.44%  github.com/golang/snappy.encodeBlock
     700ms  1.20% 78.70%      700ms  1.20%  runtime.memmove

@holiman
Copy link
Contributor

holiman commented May 26, 2021

Another potential tiny thing -- this isn't covered by any existing benchmark though, so hard to evaluate:

diff --git a/p2p/rlpx/rlpx.go b/p2p/rlpx/rlpx.go
index 326c7c4941..99fd69a452 100644
--- a/p2p/rlpx/rlpx.go
+++ b/p2p/rlpx/rlpx.go
@@ -34,6 +34,7 @@ import (
 	"net"
 	"time"
 
+	"github.com/ethereum/go-ethereum/common/bitutil"
 	"github.com/ethereum/go-ethereum/crypto"
 	"github.com/ethereum/go-ethereum/crypto/ecies"
 	"github.com/ethereum/go-ethereum/rlp"
@@ -486,11 +487,14 @@ func (h *handshakeState) secrets(auth, authResp []byte) (Secrets, error) {
 	}
 
 	// setup sha3 instances for the MACs
+	xorSum := make([]byte, len(s.MAC))
 	mac1 := sha3.NewLegacyKeccak256()
-	mac1.Write(xor(s.MAC, h.respNonce))
+	bitutil.XORBytes(xorSum, s.MAC, h.respNonce)
+	mac1.Write(xorSum)
 	mac1.Write(auth)
 	mac2 := sha3.NewLegacyKeccak256()
-	mac2.Write(xor(s.MAC, h.initNonce))
+	bitutil.XORBytes(xorSum, s.MAC, h.initNonce)
+	mac2.Write(xorSum)
 	mac2.Write(authResp)
 	if h.initiator {
 		s.EgressMAC, s.IngressMAC = mac1, mac2

@fjl fjl merged commit 7194c84 into ethereum:master May 27, 2021
atif-konasl pushed a commit to frozeman/pandora-execution-engine that referenced this pull request Oct 15, 2021
This change significantly improves the performance of RLPx message reads
and writes. In the previous implementation, reading and writing of
message frames performed multiple reads and writes on the underlying
network connection, and allocated a new []byte buffer for every read.

In the new implementation, reads and writes re-use buffers, and perform
much fewer system calls on the underlying connection. This doubles the
theoretically achievable throughput on a single connection, as shown by
the benchmark result:

    name             old speed      new speed       delta
    Throughput-8     70.3MB/s ± 0%  155.4MB/s ± 0%  +121.11%  (p=0.000 n=9+8)

The change also removes support for the legacy, pre-EIP-8 handshake encoding.
As of May 2021, no actively maintained client sends this format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants