cmd/blsync, light/beacon: standalone beacon light sync tool (refactored version, WIP) #26874

zsfelfoldi · 2023-03-13T12:59:34Z

This PR is the new refactored version of #25901 and is still WIP (no tests, no comments on the new parts) and some cosmetic changes are also planned but it is already working.
Please note that my Lodestar test node is resyncing atm so you can try it with the Nimbus node:

./blsync --sepolia --beacon.api "http://104.248.139.137:5052" --blsync.engine.api "http://127.0.0.1:8551" --blsync.jwtsecret "jwt.hex"

See also this top level overview doc intended to help the review:
https://gist.github.com/zsfelfoldi/254d9356632d384a05a56905fa401db6

holiman · 2023-03-13T13:13:09Z

beacon/light/sync/head_sync.go

+		}
+	}
+	for _, server := range servers {
+		if server, ok := server.(signedHeadServer); ok {


Suggested change

if server, ok := server.(signedHeadServer); ok {

if server, ok := server.(signedHeadServer); !ok {

continue

}

...

Please unindent

holiman · 2023-03-13T13:13:44Z

beacon/light/sync/scheduler.go

+}
+
+// call before starting the scheduler
+func (s *Scheduler) RegisterModule(m syncModule) {


Please document what a module represents, and why we need several of them

holiman · 2023-03-13T13:20:27Z

beacon/light/sync/scheduler.go

+func (s *Scheduler) syncLoop() {
+	s.lock.Lock()
+	for {
+		wait := true


This sync loop is not very straight-forward, with the modules-abstraction, the logic regarding 'changed', the async wait for reqDone and combination with s.trigger.

What is the goal here?

I have the feeling that there are many possible "not happy paths" that can happen, if some internal assumption about consistency is broken.

Indeed, it was a quick and dirty first version, I think it's a lot better now.

Lou2is

Nice

holiman · 2023-04-03T09:12:55Z

beacon/light/types/committee.go

+// SerializedCommitteeRoot calculates the root hash of the binary tree representation
+// of a sync committee provided in serialized format
+func (s *SerializedCommittee) Root() common.Hash {


What hash format is this? Is it a guerilla ssz hasher you implemented here?

Basically yes :) I wrote some simple tools for binary merkle trees because I needed kinda special things but then I started using it for hashing everything :) Now I started using tools from github.com/protolambda/zrnt for hashing beacon blocks, I'll check if I can easily use it here too.

…d state proof db storage

fjl · 2023-11-10T22:44:14Z

@lightclient and I did a longer review session on this PR today, and it became clear that there is a big issue with the overall structure of the sync implementation.

The change introduces the beacon/light/request framework, and builds out the sync mechanism on top of it. The overall feedback from @holiman and @lightclient has been that the framework can be a hindrance when trying to understand how the whole thing works.

If we leave the framework in, then nobody but @zsfelfoldi will be able to maintain the light client code. The core idea is pretty neat however. Having reviewed a lot of LES protocol code also written by Zsolt in the past, I can totally see where he is coming from with this, and would like to figure out a way forward.

Let me first try to explain why the request.Scheduler / Module abstraction exists.

active vs. passive components

We tend to distinguish 'active' and 'passive' components in our designs.

Examples of active components are eth.Downloader or miner.worker. These types spawn and manage goroutines by themselves, and interactions with the type's methods will cause communication with those goroutines. An active component will typically have Start and Stop methods manageing its lifecycle. It will often have a 'main loop' driving the action.

On the other end, we have passive components, which are essentially just data structures. Examples are trie.Sync or dnsdisc.clientTree. These objects are meant to be 'driven' through their cycle by the mainloop of another component. Sticking with the example, trie.Sync decides upon the requests that will be made when syncing a trie, and handles committing the result nodes to disk, but does not actually perform the network requests itself.

The motivation for splitting up a problem into active and passive components is that active components with their goroutines and channels can be hard to test. This is especially true when timers are involved or the logic becomes complex. The best you can hope for are end-to-end tests on toy data, which will never be exhaustive and usually take kind of long to run.

If you have a complex 'passive component' driven by a simpler 'active component', it is much easier to come up with exhaustive test cases that will run in almost no time at all. Over time, we have added tools such as package common/mclock to help with testing of such components and make it independent of the system clock, so even timeout scenarios can be tested.

request.Scheduler/Module, the good parts

So what about the structure of sync in this PR? What @zsfelfoldi has realized here, is a system where whole mechanism is defined using mostly passive components. In his design, the client is split into Modules, which are invoked by the Scheduler.

Each Module can be tested independently and to some degree also be understood independently. When you go looking how header syncing works, you don't have to care too much about the other modules. All you need to know is that the Process function will run whenever anything interesting has happened, and you can verify that it will update its state correctly. As long as the module presents a consistent view to any other part that interacts with it, not much can go wrong.

Since modules run sequentially and not concurrently, they don't need to be safe for concurrent use.

What's also nice, is that the scheduler abstracts away most of the complexity that comes with having multiple servers. The modules don't have to care about peers too much, all they need to do is react to changes of the 'best chain head' and then use TryRequest to dispatch it.

Finally, the abstraction allows having optional modules defined across multiple packages. This PR uses this to good effect by defining the block-syncing logic and engine API driver in cmd/blsync, and these modules are not very large at all!

the bad parts: trigger system

As mentioned in the beginning, the main issue with the proposed design is additional complexity for code reviewers. When you look at this thing for the first time, it is very hard to see how sync actually works, because there is no obvious place where 'the steps happen'. Rather, it's the interaction between modules 'triggering' each other that makes it all work.

I kind of feel @zsfelfoldi has taken the abstraction a bit too far. Scheduler will become NodeStateMachine all over again. It's a network of events somehow causing other events, based on a graph that only really materializes at program runtime. The scheduler is not super straightforward now, and I can already see it getting more complex when the next module comes along and needs another special kind of trigger or something.

The triggers themselves are a bit weird too. They are identified by name, and can be subscribed or non-subscribed. Modules are chained together by registering a trigger of the same name. So any module can trigger any other module and you won't really know why, unless you look at trace logs. Wait, there is no logging of scheduler triggers? How did you ever debug this Zsolt!?

the bad parts: servers and requests

Request sending is heavily integrated within the Scheduler, so a special kind of 'server trigger' will be processed when a request returns. A named module trigger can be registered to run when a specific request times out.

Modules must discover the available request capabilities of servers at runtime by testing for interface implementation. It's a neat use of Go interfaces, but is it really necessary? Can't we just commit to a fixed interface provided by all servers?

In the current implementation, modules can store per-server state by putting it into an *any pointer on the server. What kind of system is that? C vibes. @zsfelfoldi probably added this to avoid tracking servers in a map in each module. That's how it should be done though.

And finally there is the issue of handling responses. While a lot of the logic runs within Process functions on the scheduler goroutine, the request methods are invoked with a callback for the response, and the callback runs on another goroutine. So the modules need to lock their state after all, and then get 'triggered' to process the updates to their state made by the callbacks. It's not great.

how to fix it

I like the idea of structuring the beacon client code as independent simple passive components. It will definitely help with testing and future extensibility. We can make this work.

Here are some ideas for improving the trigger system.

The Process functions already contain cheap checks whether any work needs to be performed. So it's basically always OK to run them. I propose simplifying the trigger system down into a single 'universal trigger' that runs all modules.
Alternatively could pre-define events in the scheduler. Any number of modules could be attached to run when an event happens, but we'd be in control of all available events. With this structure it's possible to gain a high-level understanding about the sync process based on the events/stages and memorize it.

An idea for the requests:

The beacon API exposes a limited number of operations. We should just define all possible requests and responses as plain types and make the modules emit them when they want to send. Responses should be given to all modules through an argument in Process and the module can check if it was waiting for the request.

This sounds more complicated at first but we can paper over the details with a request queue helper type that can be added to a module.

the alternative

The alternative is rewriting the sync completely from scratch and just throwing the framework away. The lower-level light client types like CommitteeChain could be reused.

I'm personally not drawn to this alternative, because I think the current approach results in a more robust and flexible codebase if done properly. Even if we ship something very simple now, we will need to add support for multiple servers and strange edge cases later. It won't stay simple.

However, it is up to the people who will actually work on it to decide.

zsfelfoldi · 2024-02-01T06:00:23Z

Closed in favor of #28822

zsfelfoldi added the status:work-in-progress label Mar 13, 2023

zsfelfoldi requested review from fjl and rjl493456442 as code owners March 13, 2023 12:59

holiman reviewed Mar 13, 2023

View reviewed changes

zsfelfoldi force-pushed the blsync-new branch from 9a8301e to b1dbc2e Compare March 20, 2023 08:50

Lou2is approved these changes Mar 22, 2023

View reviewed changes

zsfelfoldi mentioned this pull request Mar 23, 2023

cmd/blsync, light/beacon: standalone beacon light sync tool #25901

Closed

holiman reviewed Apr 3, 2023

View reviewed changes

zsfelfoldi force-pushed the blsync-new branch from bb43ee6 to 77e2ee9 Compare April 6, 2023 07:46

zsfelfoldi force-pushed the blsync-new branch from d4cfb22 to 2baa033 Compare April 19, 2023 19:59

zsfelfoldi added 14 commits May 10, 2023 14:20

cmd/blsync, beacon/light: standalone beacon light sync tool

4f67f43

cmd/blsync, beacon/light: implement execBlockSyncer with request.Module

4d3c2c4

beacon/light: implemented LightChain database

e825f34

beacon/light: LightChain test chain generator and GetParent optimization

36a7292

beacon/light: add LightChain unit tests and fix bugs

a051e30

beacon/light: add CommitteeChain unit tests

fcb1d9c

beacon/light: made PeriodRange a parent struct of canonicalStore

1657e4e

beacon/light: more LightChain tests

232b7a3

beacon/light/request: comments and minor changes

aba5ecb

beacon/light/request: add unit tests

555f823

beacon/merkle: full implementation of compact proof format, simplifie…

e622a20

…d state proof db storage

beacon/merkle: fixed unit test

7c613fe

beacon/light: various small changes

f044766

beacon/light/types: parse config.yaml with go-yaml

838f96b

zsfelfoldi force-pushed the blsync-new branch from 58c7daf to 838f96b Compare May 11, 2023 22:33

zsfelfoldi mentioned this pull request May 11, 2023

cmd/blsync, beacon/light: standalone beacon light sync tool (without finalized feature) #27253

Closed

blsync: use new payload v2

46a154e

zsfelfoldi mentioned this pull request Dec 12, 2023

beacon/light/request: general request framework (WIP) #28656

Closed

zsfelfoldi removed the status:work-in-progress label Dec 19, 2023

zsfelfoldi closed this Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/blsync, light/beacon: standalone beacon light sync tool (refactored version, WIP) #26874

cmd/blsync, light/beacon: standalone beacon light sync tool (refactored version, WIP) #26874

zsfelfoldi commented Mar 13, 2023 •

edited

Loading

holiman Mar 13, 2023

holiman Mar 13, 2023

holiman Mar 13, 2023

zsfelfoldi Apr 11, 2023

Lou2is left a comment

holiman Apr 3, 2023

zsfelfoldi Apr 11, 2023

fjl commented Nov 10, 2023 •

edited

Loading

zsfelfoldi commented Feb 1, 2024

-		if server, ok := server.(signedHeadServer); ok {
+		if server, ok := server.(signedHeadServer); !ok {
+			continue
+		}
+		...

cmd/blsync, light/beacon: standalone beacon light sync tool (refactored version, WIP) #26874

cmd/blsync, light/beacon: standalone beacon light sync tool (refactored version, WIP) #26874

Conversation

zsfelfoldi commented Mar 13, 2023 • edited Loading

holiman Mar 13, 2023

Choose a reason for hiding this comment

holiman Mar 13, 2023

Choose a reason for hiding this comment

holiman Mar 13, 2023

Choose a reason for hiding this comment

zsfelfoldi Apr 11, 2023

Choose a reason for hiding this comment

Lou2is left a comment

Choose a reason for hiding this comment

holiman Apr 3, 2023

Choose a reason for hiding this comment

zsfelfoldi Apr 11, 2023

Choose a reason for hiding this comment

fjl commented Nov 10, 2023 • edited Loading

active vs. passive components

request.Scheduler/Module, the good parts

the bad parts: trigger system

the bad parts: servers and requests

how to fix it

the alternative

zsfelfoldi commented Feb 1, 2024

zsfelfoldi commented Mar 13, 2023 •

edited

Loading

fjl commented Nov 10, 2023 •

edited

Loading