Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declarative RLP Encoding/Decoding #7975

Draft
wants to merge 115 commits into
base: master
Choose a base branch
from
Draft

Conversation

emlautarom1
Copy link
Contributor

@emlautarom1 emlautarom1 commented Dec 26, 2024

Changes

  • Introduce an alternative approach to RLP encoding and decoding, based on a declarative API with support for code generation through Source Generators

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Notes on testing

The core library has 100% test coverage. Source generated code might not be fully covered.

Documentation

Requires documentation update

  • Yes
  • No

Requires explanation in Release Notes

  • Yes
  • No

Remarks

When we started working on refactoring our TxDecoder one thing that came up was how unergonomic is to work with our current RLP API. We even have some comments on the code itself mentioning these difficulties, for example:

/// <summary>
/// We pay a high code quality tax for the performance optimization on RLP.
/// Adding more RLP decoders is costly (time wise) but the path taken saves a lot of allocations and GC.
/// Shall we consider code generation for this? We could potentially generate IL from attributes for each
/// RLP serializable item and keep it as a compiled call available at runtime.
/// It would be slightly slower but still much faster than what we would get from using dynamic serializers.
/// </summary>

/// <summary>
/// We pay a big copy-paste tax to maintain ValueDecoders but we believe that the amount of allocations saved
/// make it worth it. To be reviewed periodically.
/// Question to Lukasz here -> would it be fine to always use ValueDecoderContext only?
/// I believe it cannot be done for the network items decoding and is only relevant for the DB loads.
/// </summary>

This PR introduces a new RLP API based on #7334 (comment) with several improvements:

  • Describe the structure of a record and get encoding and decoding for free. No code duplication required.
  • Records can be described using other records. Supports conditional, exceptions, function calls, etc.
  • Decoding and encoding are extensible through classes that can be defined anywhere, plus some extension methods.
  • Minimal core library with 100% code coverage.
  • Supports backtracking.
  • All function calls are known ahead of time (no virtual or override). Interfaces are only used to enforce implementations.
  • Despite the extensive usage of lambdas, no closures are required (all lambdas are static). You can still use them if you want to, but overloads are provided to avoid them.
  • Automatically generate the required code through Source Generators.

@emlautarom1
Copy link
Contributor Author

I've added a benchmark that encodes and decodes an AccessList as defined in:

public class AccessList : IEnumerable<(Address Address, AccessList.StorageKeysEnumerable StorageKeys)>

Results on my machine are the following:

| Method  | Mean     | Error   | StdDev  | Ratio |
|-------- |---------:|--------:|--------:|------:|
| Current | 343.9 us | 1.43 us | 1.34 us |  1.00 |
| Fluent  | 834.9 us | 2.34 us | 2.19 us |  2.43 |

There is room for a possible optimization: some records like Address have a known fixed byte size which we can leverage to avoid having to copy bytes twice: once to figure out the length and the other to actually copy the bytes.

@emlautarom1
Copy link
Contributor Author

Replacing Marshal.SizeOf<T>() with sizeof(T) and some unsafe annotations gives quite the boost at no cost:

| Method  | Mean     | Error   | StdDev  | Ratio | RatioSD |
|-------- |---------:|--------:|--------:|------:|--------:|
| Current | 359.8 us | 5.03 us | 4.70 us |  1.00 |    0.02 |
| Fluent  | 626.2 us | 2.90 us | 2.42 us |  1.74 |    0.02 |

var size = sizeof(T);
Span<byte> bigEndian = stackalloc byte[size];
value.WriteBigEndian(bigEndian);
bigEndian = bigEndian.TrimStart((byte)0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrimStart does not seem to be heavily optimized. There might be something better that we can use, specially considering that we're removing leading zeros.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like .net doesn't use SIMD for it!
We could write Vector based way to find start index.
@benaadams

@emlautarom1 emlautarom1 requested review from Scooletz and LukaszRozmej and removed request for Scooletz January 2, 2025 19:17
@Scooletz
Copy link
Contributor

Scooletz commented Jan 3, 2025

Replacing Marshal.SizeOf<T>() with sizeof(T) and some unsafe annotations gives quite the boost at no cost:

| Method  | Mean     | Error   | StdDev  | Ratio | RatioSD |
|-------- |---------:|--------:|--------:|------:|--------:|
| Current | 359.8 us | 5.03 us | 4.70 us |  1.00 |    0.02 |
| Fluent  | 626.2 us | 2.90 us | 2.42 us |  1.74 |    0.02 |

2x slower. Quite a lot. Can you add the ASM diagnoser and memory diagnoser? Would be nice to compare it more.

@emlautarom1
Copy link
Contributor Author

After running some benchmarks I found that UInt256 was getting boxed due to the usage of interface default bodies. Fixing that issue improves performance and drastically reduces memory allocation (added [MemoryDiagnoser] as requested by @Scooletz):

| Method  | Mean     | Error   | StdDev  | Ratio | Gen0     | Gen1     | Gen2     | Allocated  | Alloc Ratio |
|-------- |---------:|--------:|--------:|------:|---------:|---------:|---------:|-----------:|------------:|
| Current | 359.5 us | 0.98 us | 0.86 us |  1.00 | 166.5039 | 166.5039 | 166.5039 | 1033.87 KB |        1.00 |
| Fluent  | 530.1 us | 4.67 us | 4.14 us |  1.47 |  51.7578 |  51.7578 |  51.7578 |   617.5 KB |        0.60 |

var decoder = Eip2930.AccessListDecoder.Instance;

var length = decoder.GetLength(_current, RlpBehaviors.None);
var stream = new RlpStream(length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is responsible for all the allocations as it allocates a new array underneath

Is this the case that we want to benchmark or rather it should be a reused RlpStream here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be comparable with the FluentRlp approach we should allocate a new buffer. Since both will allocate the same buffer size then it should not matter.

Copy link
Contributor

@Scooletz Scooletz Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If they allocate the same buffer, what makes the current allocate over 400kb more then? Is it the different return type Nethermind.Core.Eip2930.AccessList vs AccessList in the new one or something else? With 400kb more the current will be greatly penalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly they're not allocating the same buffer size: the fluent approach uses a buffer of 170850 bytes while the current one uses 172845. That does not account for the 400kb you mention though. Rider's profiler is not giving me anything useful so I'm kind of stuck now.

Maybe we should add other objects (ex. LogEntry, BlockInfo, etc.) to get more accurate benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use NettyRlpStream that uses arena memory

@emlautarom1 emlautarom1 force-pushed the feature/declarative-rlp branch from dcd6d85 to 6dc21f4 Compare January 6, 2025 18:35
- Return value is now `ReadOnlyMemory<byte>`
- Add overloads for reading `ReadOnlyMemory<byte>`
- Add `FluentAssertions` extensions
…ature/declarative-rlp

# Conflicts:
#	src/Nethermind/Nethermind.Serialization.FluentRlp/Rlp.cs
…ature/declarative-rlp

# Conflicts:
#	src/Nethermind/Nethermind.Serialization.FluentRlp/Rlp.cs
/// <param name="capacity">The capacity of the underlying buffer.</param>
public FixedArrayBufferWriter(int capacity)
{
_buffer = new T[capacity];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayPool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default I think we should go with a plain array to match the ArrayBufferWriter behavior. At the end of the day, the RLP static class is like a "safe default" API.

If we want more control over the buffers we use we can write a custom IBufferWriter as you suggested earlier while using RlpReader and RlpWriter directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants