p2p/rlpx: reduce allocation and syscalls #22899

fjl · 2021-05-18T14:24:27Z

This change significantly improves the performance of RLPx message reads
and writes. In the previous implementation, reading and writing of
message frames performed multiple reads and writes on the underlying
network connection, and allocated a new []byte buffer for every read.

In the new implementation, reads and writes re-use buffers, and perform
much fewer system calls on the underlying connection. This doubles the
theoretically achievable throughput on a single connection, as shown by
the benchmark result:

name             old speed      new speed       delta
Throughput-8     70.3MB/s ± 0%  155.4MB/s ± 0%  +121.11%  (p=0.000 n=9+8)

The change also removes support for the legacy, pre-EIP-8 handshake encoding.
As of May 2021, no actively maintained client sends this format.

p2p/rlpx/buffer.go

holiman · 2021-05-21T07:55:10Z

Some charts from a snap-sync, where this PR is yellow, master is green.

No really visible effects, but nothing broken, so LGTM

fjl · 2021-05-21T08:11:40Z

I'm also still testing this, and it seems the snappy buffer management doesn't fully work yet.

This is the minimal necessary change to make the next commit work.

This change significantly improves the performance of RLPx message reads and writes. In the previous implementation, reading and writing of message frames performed multiple reads and writes on the underlying network connection, and allocated a new []byte buffer for every read. In the new implementation, reads and writes re-use buffers, and perform much fewer system calls on the underlying connection. This effectively doubles the theoretically achievable throughput on a single connection, as shown by the benchmark result: name old speed new speed delta Throughput-8 66.5MB/s ± 0% 139.7MB/s ± 1% +110.16% (p=0.001 n=7+7) The change also removes support for the legacy, pre-EIP-8 handshake encoding. As of May 2021, no actively maintained client sends this format.

The snappy functions only check the length of the destination buffer and ignore its capacity. When message sizes alternated between large and small, we would still allocate a new buffer for every other message. This fixes it.

p2p/rlpx/rlpx.go

holiman · 2021-05-25T11:16:06Z

p2p/rlpx/rlpx.go

+		c.snappyWriteBuffer = growslice(c.snappyWriteBuffer, snappy.MaxEncodedLen(len(data)))
+		data = snappy.Encode(c.snappyWriteBuffer, data)


Wouldn't it be good to truncate it again before this method returns? as it is, it will grow to the "largest packet you ever sent" and just stick around forever..?

This is intended. The idea is to keep allocating the buffer until it is large enough for the largest message we will ever send. RLPx messages are limited to 16MB, so it won't grow much larger than that.

holiman · 2021-05-25T13:29:59Z

The computeFrame function calculates the hash, then calls m.compute which does it all over again, costing an extra alloc. If that is removed:

[user@work rlpx]$ benchstat before.txt after1.txt 
name          old time/op    new time/op    delta
Throughput-6    18.4µs ±12%    18.4µs ±13%     ~     (p=0.573 n=8+10)

name          old speed      new speed      delta
Throughput-6  54.7MB/s ± 4%  55.9MB/s ±14%     ~     (p=0.315 n=7+10)

name          old alloc/op   new alloc/op   delta
Throughput-6    4.81kB ± 0%    3.84kB ± 0%  -20.08%  (p=0.000 n=8+8)

name          old allocs/op  new allocs/op  delta
Throughput-6      20.0 ± 0%      16.0 ± 0%  -20.00%  (p=0.000 n=10+10)

Also, unrolling the loop in that new compute function had a non-negligible impact, it seems:

[user@work rlpx]$ benchstat before.txt after2.txt 
name          old time/op    new time/op    delta
Throughput-6    18.4µs ±12%    17.0µs ± 6%   -7.55%  (p=0.008 n=8+9)

name          old speed      new speed      delta
Throughput-6  54.7MB/s ± 4%  60.3MB/s ± 6%  +10.20%  (p=0.000 n=7+9)

name          old alloc/op   new alloc/op   delta
Throughput-6    4.81kB ± 0%    3.84kB ± 0%  -20.07%  (p=0.000 n=8+8)

name          old allocs/op  new allocs/op  delta
Throughput-6      20.0 ± 0%      16.0 ± 0%  -20.00%  (p=0.000 n=10+10)

Hacks:

diff --git a/p2p/rlpx/rlpx.go b/p2p/rlpx/rlpx.go
index 72ba28d2ca..9132129ea9 100644
--- a/p2p/rlpx/rlpx.go
+++ b/p2p/rlpx/rlpx.go
@@ -272,7 +272,6 @@ func (m *hashMAC) compute(seed []byte) []byte {
 	if len(seed) != len(m.aesBuffer) {
 		panic("invalid MAC seed")
 	}
-
 	sum1 := m.hash.Sum(m.hashBuffer[:0])
 	m.cipher.Encrypt(m.aesBuffer[:], sum1)
 	for i := range m.aesBuffer {
@@ -283,11 +282,42 @@ func (m *hashMAC) compute(seed []byte) []byte {
 	return sum2[:16]
 }
 
+func (m *hashMAC) compute2(seed []byte) []byte {
+	if len(seed) != len(m.aesBuffer) {
+		panic("invalid MAC seed")
+	}
+	copy(m.hashBuffer[:0], seed)
+	sum1 := m.hashBuffer[0:16]
+	m.cipher.Encrypt(m.aesBuffer[:], sum1)
+
+	m.aesBuffer[0] ^= seed[0]
+	m.aesBuffer[1] ^= seed[1]
+	m.aesBuffer[2] ^= seed[2]
+	m.aesBuffer[3] ^= seed[3]
+	m.aesBuffer[4] ^= seed[4]
+	m.aesBuffer[5] ^= seed[5]
+	m.aesBuffer[6] ^= seed[6]
+	m.aesBuffer[7] ^= seed[7]
+	m.aesBuffer[8] ^= seed[8]
+	m.aesBuffer[9] ^= seed[9]
+	m.aesBuffer[10] ^= seed[10]
+	m.aesBuffer[11] ^= seed[11]
+	m.aesBuffer[12] ^= seed[12]
+	m.aesBuffer[13] ^= seed[13]
+	m.aesBuffer[14] ^= seed[14]
+	m.aesBuffer[15] ^= seed[15]
+
+	m.hash.Write(m.aesBuffer[:])
+
+	sum2 := m.hash.Sum(m.hashBuffer[:0])
+	return sum2[:16]
+}
+
 // computeFrame computes the MAC of framedata.
 func (m *hashMAC) computeFrame(framedata []byte) []byte {
 	m.hash.Write(framedata)
 	seed := m.hash.Sum(m.seedBuffer[:0])
-	return m.compute(seed[:16])
+	return m.compute2(seed[:16])
 }
 
 // Handshake performs the handshake. This must be called before any data is written

holiman · 2021-05-26T08:22:27Z

Current status:

50% is sha3:

   20420ms 34.94% 34.94%    20420ms 34.94%  golang.org/x/crypto/sha3.keccakF1600
     970ms  1.66% 75.07%     8590ms 14.70%  golang.org/x/crypto/sha3.(*state).padAndPermute

20% syscalls:

 11890ms 20.34% 55.28%    12280ms 21.01%  syscall.Syscall

20% AES:

    4990ms  8.54% 63.82%     4990ms  8.54%  crypto/aes.encryptBlockAsm
    1040ms  1.78% 73.41%     6550ms 11.21%  crypto/cipher.(*ctr).refill

Remaining 10% is mainly test-overhead:

    3240ms  5.54% 69.36%     4220ms  7.22%  runtime.suspendG
    1330ms  2.28% 71.63%     1370ms  2.34%  runtime.nanotime (inline)
     720ms  1.23% 76.30%      720ms  1.23%  runtime.memclrNoHeapPointers
     700ms  1.20% 77.50%      840ms  1.44%  github.com/golang/snappy.encodeBlock
     700ms  1.20% 78.70%      700ms  1.20%  runtime.memmove

holiman · 2021-05-26T09:39:56Z

Another potential tiny thing -- this isn't covered by any existing benchmark though, so hard to evaluate:

diff --git a/p2p/rlpx/rlpx.go b/p2p/rlpx/rlpx.go
index 326c7c4941..99fd69a452 100644
--- a/p2p/rlpx/rlpx.go
+++ b/p2p/rlpx/rlpx.go
@@ -34,6 +34,7 @@ import (
 	"net"
 	"time"
 
+	"github.com/ethereum/go-ethereum/common/bitutil"
 	"github.com/ethereum/go-ethereum/crypto"
 	"github.com/ethereum/go-ethereum/crypto/ecies"
 	"github.com/ethereum/go-ethereum/rlp"
@@ -486,11 +487,14 @@ func (h *handshakeState) secrets(auth, authResp []byte) (Secrets, error) {
 	}
 
 	// setup sha3 instances for the MACs
+	xorSum := make([]byte, len(s.MAC))
 	mac1 := sha3.NewLegacyKeccak256()
-	mac1.Write(xor(s.MAC, h.respNonce))
+	bitutil.XORBytes(xorSum, s.MAC, h.respNonce)
+	mac1.Write(xorSum)
 	mac1.Write(auth)
 	mac2 := sha3.NewLegacyKeccak256()
-	mac2.Write(xor(s.MAC, h.initNonce))
+	bitutil.XORBytes(xorSum, s.MAC, h.initNonce)
+	mac2.Write(xorSum)
 	mac2.Write(authResp)
 	if h.initiator {
 		s.EgressMAC, s.IngressMAC = mac1, mac2

This change significantly improves the performance of RLPx message reads and writes. In the previous implementation, reading and writing of message frames performed multiple reads and writes on the underlying network connection, and allocated a new []byte buffer for every read. In the new implementation, reads and writes re-use buffers, and perform much fewer system calls on the underlying connection. This doubles the theoretically achievable throughput on a single connection, as shown by the benchmark result: name old speed new speed delta Throughput-8 70.3MB/s ± 0% 155.4MB/s ± 0% +121.11% (p=0.000 n=9+8) The change also removes support for the legacy, pre-EIP-8 handshake encoding. As of May 2021, no actively maintained client sends this format.

fjl requested a review from zsfelfoldi as a code owner May 18, 2021 14:24

fjl added this to the 1.10.4 milestone May 18, 2021

holiman reviewed May 20, 2021

View reviewed changes

p2p/rlpx/buffer.go Outdated Show resolved Hide resolved

p2p/rlpx/buffer.go Outdated Show resolved Hide resolved

fjl added 6 commits May 21, 2021 10:54

p2p/rlpx: add benchmarks

6fe8c25

rlp: add IntSize

678e5ef

p2p: copy message data on read

bf726d1

This is the minimal necessary change to make the next commit work.

p2p/rlpx: rename uint24 functions

b459810

p2p/rlpx: fix snappy buffer handling

6375417

The snappy functions only check the length of the destination buffer and ignore its capacity. When message sizes alternated between large and small, we would still allocate a new buffer for every other message. This fixes it.

fjl force-pushed the p2p-rlpx-eip8-final branch from d3b8efe to 6375417 Compare May 21, 2021 08:54

holiman reviewed May 25, 2021

View reviewed changes

p2p/rlpx/rlpx.go Show resolved Hide resolved

holiman reviewed May 25, 2021

View reviewed changes

p2p/rlpx: one less call to hash.Sum in computeFrame

deebb34

holiman approved these changes May 26, 2021

View reviewed changes

fjl merged commit 7194c84 into ethereum:master May 27, 2021

This was referenced Sep 23, 2022

Metadium to master METADIUM/go-metadium#24

Closed

Metadium to master METADIUM/go-metadium#25

Merged

exalate-issue-sync bot mentioned this pull request May 15, 2023

Update rlpx package klaytn/klaytn#1599

Closed

gzliudan mentioned this pull request May 15, 2024

upgarde package rlp to 2024-05-15 XinFinOrg/XDPoSChain#542

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p/rlpx: reduce allocation and syscalls #22899

p2p/rlpx: reduce allocation and syscalls #22899

fjl commented May 18, 2021 •

edited

Loading

holiman commented May 21, 2021

fjl commented May 21, 2021

holiman May 25, 2021

fjl May 25, 2021 •

edited

Loading

holiman commented May 25, 2021 •

edited

Loading

holiman commented May 26, 2021

holiman commented May 26, 2021

		c.snappyWriteBuffer = growslice(c.snappyWriteBuffer, snappy.MaxEncodedLen(len(data)))
		data = snappy.Encode(c.snappyWriteBuffer, data)

p2p/rlpx: reduce allocation and syscalls #22899

p2p/rlpx: reduce allocation and syscalls #22899

Conversation

fjl commented May 18, 2021 • edited Loading

holiman commented May 21, 2021

fjl commented May 21, 2021

holiman May 25, 2021

Choose a reason for hiding this comment

fjl May 25, 2021 • edited Loading

Choose a reason for hiding this comment

holiman commented May 25, 2021 • edited Loading

holiman commented May 26, 2021

holiman commented May 26, 2021

fjl commented May 18, 2021 •

edited

Loading

fjl May 25, 2021 •

edited

Loading

holiman commented May 25, 2021 •

edited

Loading