rlp: use atomic.Value for type cache #22902

fjl · 2021-05-18T20:17:10Z

All encoding/decoding operations read the type cache to find the
writer/decoder function responsible for a type. When analyzing CPU
profiles of geth during sync, I found that the use of sync.RWMutex in
cache lookups appears in the profiles. It seems we are running into
CPU cache contention problems when package rlp is heavily used
on all CPU cores during sync.

This change makes it use atomic.Value + a writer lock instead of
sync.RWMutex. In the common case where the typeinfo entry is present in
the cache, we simply fetch the map and lookup the type.

Unfortunately, it is very hard to prove whether this is an improvement
because single-threaded benchmarks won't hit the slowdown.

All encoding/decoding operations read the type cache to find the writer/decoder function responsible for a type. When analyzing CPU profiles of geth during sync, I found that the use of sync.RWMutex in cache lookups appears in the profiles. It seems we are running into cache contention problems because package rlp is heavily used on all CPU cores during sync. This change makes it use atomic.Value + a writer lock instead of sync.RWMutex. In the common case where the typeinfo entry is present in the cache, we simply fetch the map and lookup the type. Unfortunately, it is very hard to prove whether this is an improvement because single-threaded benchmarks won't hit the slowdown.

fjl · 2021-05-19T09:58:13Z

I have now added a benchmark to show the performance difference. The benchmark function encodes an []interface{} value on all cores simultaneously, which is the worst case for the old RWMutex-based design.

Here's the difference:

name                         old time/op  new time/op  delta
EncodeConcurrentInterface-8  4.92µs ± 0%  1.07µs ± 2%  -78.33%  (p=0.000 n=10+9)

The RWLock is clearly visible in the profile before the change:

rlp/typecache.go

* focus on performance improvement in many aspects. 1. Do BlockBody verification concurrently; 2. Do calculation of intermediate root concurrently; 3. Preload accounts before processing blocks; 4. Make the snapshot layers configurable. 5. Reuse some object to reduce GC. add * rlp: improve decoder stream implementation (ethereum#22858) This commit makes various cleanup changes to rlp.Stream. * rlp: shrink Stream struct This removes a lot of unused padding space in Stream by reordering the fields. The size of Stream changes from 120 bytes to 88 bytes. Stream instances are internally cached and reused using sync.Pool, so this does not improve performance. * rlp: simplify list stack The list stack kept track of the size of the current list context as well as the current offset into it. The size had to be stored in the stack in order to subtract it from the remaining bytes of any enclosing list in ListEnd. It seems that this can be implemented in a simpler way: just subtract the size from the enclosing list context in List instead. * rlp: use atomic.Value for type cache (ethereum#22902) All encoding/decoding operations read the type cache to find the writer/decoder function responsible for a type. When analyzing CPU profiles of geth during sync, I found that the use of sync.RWMutex in cache lookups appears in the profiles. It seems we are running into CPU cache contention problems when package rlp is heavily used on all CPU cores during sync. This change makes it use atomic.Value + a writer lock instead of sync.RWMutex. In the common case where the typeinfo entry is present in the cache, we simply fetch the map and lookup the type. * rlp: optimize byte array handling (ethereum#22924) This change improves the performance of encoding/decoding [N]byte. name old time/op new time/op delta DecodeByteArrayStruct-8 336ns ± 0% 246ns ± 0% -26.98% (p=0.000 n=9+10) EncodeByteArrayStruct-8 225ns ± 1% 148ns ± 1% -34.12% (p=0.000 n=10+10) name old alloc/op new alloc/op delta DecodeByteArrayStruct-8 120B ± 0% 48B ± 0% -60.00% (p=0.000 n=10+10) EncodeByteArrayStruct-8 0.00B 0.00B ~ (all equal) * rlp: optimize big.Int decoding for size <= 32 bytes (ethereum#22927) This change grows the static integer buffer in Stream to 32 bytes, making it possible to decode 256bit integers without allocating a temporary buffer. In the recent commit 088da24, Stream struct size decreased from 120 bytes down to 88 bytes. This commit grows the struct to 112 bytes again, but the size change will not degrade performance because Stream instances are internally cached in sync.Pool. name old time/op new time/op delta DecodeBigInts-8 12.2µs ± 0% 8.6µs ± 4% -29.58% (p=0.000 n=9+10) name old speed new speed delta DecodeBigInts-8 230MB/s ± 0% 326MB/s ± 4% +42.04% (p=0.000 n=9+10) * eth/protocols/eth, les: avoid Raw() when decoding HashOrNumber (ethereum#22841) Getting the raw value is not necessary to decode this type, and decoding it directly from the stream is faster. * fix testcase * debug no lazy * fix can not repair * address comments Co-authored-by: Felix Lange <fjl@twurst.com>

All encoding/decoding operations read the type cache to find the writer/decoder function responsible for a type. When analyzing CPU profiles of geth during sync, I found that the use of sync.RWMutex in cache lookups appears in the profiles. It seems we are running into CPU cache contention problems when package rlp is heavily used on all CPU cores during sync. This change makes it use atomic.Value + a writer lock instead of sync.RWMutex. In the common case where the typeinfo entry is present in the cache, we simply fetch the map and lookup the type.

fjl added 2 commits May 19, 2021 11:52

rlp: add concurrent encode benchmark

fe0202a

fjl force-pushed the rlp-typecache-atomic branch from ec41660 to b63a365 Compare May 19, 2021 09:54

karalabe reviewed May 21, 2021

View reviewed changes

rlp/typecache.go Show resolved Hide resolved

rlp: check current typecache after taking lock

4ee540d

fjl merged commit 0d076d9 into ethereum:master May 22, 2021

This was referenced Sep 23, 2022

Metadium to master METADIUM/go-metadium#24

Closed

Metadium to master METADIUM/go-metadium#25

Merged

gzliudan mentioned this pull request May 15, 2024

upgarde package rlp to 2024-05-15 XinFinOrg/XDPoSChain#542

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rlp: use atomic.Value for type cache #22902

rlp: use atomic.Value for type cache #22902

fjl commented May 18, 2021 •

edited

Loading

fjl commented May 19, 2021 •

edited

Loading

rlp: use atomic.Value for type cache #22902

rlp: use atomic.Value for type cache #22902

Conversation

fjl commented May 18, 2021 • edited Loading

fjl commented May 19, 2021 • edited Loading

fjl commented May 18, 2021 •

edited

Loading

fjl commented May 19, 2021 •

edited

Loading