Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/blsync, light/beacon: standalone beacon light sync tool (refactored version, WIP) #26874

Closed
wants to merge 15 commits into from

Conversation

zsfelfoldi
Copy link
Contributor

@zsfelfoldi zsfelfoldi commented Mar 13, 2023

This PR is the new refactored version of #25901 and is still WIP (no tests, no comments on the new parts) and some cosmetic changes are also planned but it is already working.
Please note that my Lodestar test node is resyncing atm so you can try it with the Nimbus node:

./blsync --sepolia --beacon.api "http://104.248.139.137:5052" --blsync.engine.api "http://127.0.0.1:8551" --blsync.jwtsecret "jwt.hex"

See also this top level overview doc intended to help the review:
https://gist.github.com/zsfelfoldi/254d9356632d384a05a56905fa401db6

}
}
for _, server := range servers {
if server, ok := server.(signedHeadServer); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if server, ok := server.(signedHeadServer); ok {
if server, ok := server.(signedHeadServer); !ok {
continue
}
...

Please unindent

}

// call before starting the scheduler
func (s *Scheduler) RegisterModule(m syncModule) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what a module represents, and why we need several of them

func (s *Scheduler) syncLoop() {
s.lock.Lock()
for {
wait := true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sync loop is not very straight-forward, with the modules-abstraction, the logic regarding 'changed', the async wait for reqDone and combination with s.trigger.

What is the goal here?

I have the feeling that there are many possible "not happy paths" that can happen, if some internal assumption about consistency is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it was a quick and dirty first version, I think it's a lot better now.

Copy link

@Lou2is Lou2is left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment on lines +78 to +81
// SerializedCommitteeRoot calculates the root hash of the binary tree representation
// of a sync committee provided in serialized format
func (s *SerializedCommittee) Root() common.Hash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What hash format is this? Is it a guerilla ssz hasher you implemented here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically yes :) I wrote some simple tools for binary merkle trees because I needed kinda special things but then I started using it for hashing everything :) Now I started using tools from github.com/protolambda/zrnt for hashing beacon blocks, I'll check if I can easily use it here too.

@fjl
Copy link
Contributor

fjl commented Nov 10, 2023

@lightclient and I did a longer review session on this PR today, and it became clear that there is a big issue with the overall structure of the sync implementation.

The change introduces the beacon/light/request framework, and builds out the sync mechanism on top of it. The overall feedback from @holiman and @lightclient has been that the framework can be a hindrance when trying to understand how the whole thing works.

If we leave the framework in, then nobody but @zsfelfoldi will be able to maintain the light client code. The core idea is pretty neat however. Having reviewed a lot of LES protocol code also written by Zsolt in the past, I can totally see where he is coming from with this, and would like to figure out a way forward.

Let me first try to explain why the request.Scheduler / Module abstraction exists.

active vs. passive components

We tend to distinguish 'active' and 'passive' components in our designs.

Examples of active components are eth.Downloader or miner.worker. These types spawn and manage goroutines by themselves, and interactions with the type's methods will cause communication with those goroutines. An active component will typically have Start and Stop methods manageing its lifecycle. It will often have a 'main loop' driving the action.

On the other end, we have passive components, which are essentially just data structures. Examples are trie.Sync or dnsdisc.clientTree. These objects are meant to be 'driven' through their cycle by the mainloop of another component. Sticking with the example, trie.Sync decides upon the requests that will be made when syncing a trie, and handles committing the result nodes to disk, but does not actually perform the network requests itself.

The motivation for splitting up a problem into active and passive components is that active components with their goroutines and channels can be hard to test. This is especially true when timers are involved or the logic becomes complex. The best you can hope for are end-to-end tests on toy data, which will never be exhaustive and usually take kind of long to run.

If you have a complex 'passive component' driven by a simpler 'active component', it is much easier to come up with exhaustive test cases that will run in almost no time at all. Over time, we have added tools such as package common/mclock to help with testing of such components and make it independent of the system clock, so even timeout scenarios can be tested.

request.Scheduler/Module, the good parts

So what about the structure of sync in this PR? What @zsfelfoldi has realized here, is a system where whole mechanism is defined using mostly passive components. In his design, the client is split into Modules, which are invoked by the Scheduler.

Each Module can be tested independently and to some degree also be understood independently. When you go looking how header syncing works, you don't have to care too much about the other modules. All you need to know is that the Process function will run whenever anything interesting has happened, and you can verify that it will update its state correctly. As long as the module presents a consistent view to any other part that interacts with it, not much can go wrong.

Since modules run sequentially and not concurrently, they don't need to be safe for concurrent use.

What's also nice, is that the scheduler abstracts away most of the complexity that comes with having multiple servers. The modules don't have to care about peers too much, all they need to do is react to changes of the 'best chain head' and then use TryRequest to dispatch it.

Finally, the abstraction allows having optional modules defined across multiple packages. This PR uses this to good effect by defining the block-syncing logic and engine API driver in cmd/blsync, and these modules are not very large at all!

the bad parts: trigger system

As mentioned in the beginning, the main issue with the proposed design is additional complexity for code reviewers. When you look at this thing for the first time, it is very hard to see how sync actually works, because there is no obvious place where 'the steps happen'. Rather, it's the interaction between modules 'triggering' each other that makes it all work.

I kind of feel @zsfelfoldi has taken the abstraction a bit too far. Scheduler will become NodeStateMachine all over again. It's a network of events somehow causing other events, based on a graph that only really materializes at program runtime. The scheduler is not super straightforward now, and I can already see it getting more complex when the next module comes along and needs another special kind of trigger or something.

The triggers themselves are a bit weird too. They are identified by name, and can be subscribed or non-subscribed. Modules are chained together by registering a trigger of the same name. So any module can trigger any other module and you won't really know why, unless you look at trace logs. Wait, there is no logging of scheduler triggers? How did you ever debug this Zsolt!?

the bad parts: servers and requests

Request sending is heavily integrated within the Scheduler, so a special kind of 'server trigger' will be processed when a request returns. A named module trigger can be registered to run when a specific request times out.

Modules must discover the available request capabilities of servers at runtime by testing for interface implementation. It's a neat use of Go interfaces, but is it really necessary? Can't we just commit to a fixed interface provided by all servers?

In the current implementation, modules can store per-server state by putting it into an *any pointer on the server. What kind of system is that? C vibes. @zsfelfoldi probably added this to avoid tracking servers in a map in each module. That's how it should be done though.

And finally there is the issue of handling responses. While a lot of the logic runs within Process functions on the scheduler goroutine, the request methods are invoked with a callback for the response, and the callback runs on another goroutine. So the modules need to lock their state after all, and then get 'triggered' to process the updates to their state made by the callbacks. It's not great.

how to fix it

I like the idea of structuring the beacon client code as independent simple passive components. It will definitely help with testing and future extensibility. We can make this work.

Here are some ideas for improving the trigger system.

  • The Process functions already contain cheap checks whether any work needs to be performed. So it's basically always OK to run them. I propose simplifying the trigger system down into a single 'universal trigger' that runs all modules.

  • Alternatively could pre-define events in the scheduler. Any number of modules could be attached to run when an event happens, but we'd be in control of all available events. With this structure it's possible to gain a high-level understanding about the sync process based on the events/stages and memorize it.

An idea for the requests:

  • The beacon API exposes a limited number of operations. We should just define all possible requests and responses as plain types and make the modules emit them when they want to send. Responses should be given to all modules through an argument in Process and the module can check if it was waiting for the request.

    This sounds more complicated at first but we can paper over the details with a request queue helper type that can be added to a module.

the alternative

The alternative is rewriting the sync completely from scratch and just throwing the framework away. The lower-level light client types like CommitteeChain could be reused.

I'm personally not drawn to this alternative, because I think the current approach results in a more robust and flexible codebase if done properly. Even if we ship something very simple now, we will need to add support for multiple servers and strange edge cases later. It won't stay simple.

However, it is up to the people who will actually work on it to decide.

@zsfelfoldi
Copy link
Contributor Author

Closed in favor of #28822

@zsfelfoldi zsfelfoldi closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants