-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beat [2/4]: implement blockbeat
#8894
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
a0393a0
to
9d6f8e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass to load context.
The sweeper commits towards the end seem to make sense, but I'm missing all the review context from the sweeper PR saga to make a good judgement call.
But I really like the concept of the block beat! This should help us quite a bit in synchronizing all subsystems.
I assume a follow-up PR that refactors all subsystems to use this will come once this lands?
16e2353
to
7183080
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really liking this! Left some questions so long (pretty much all just questions for my understanding). Gonna need another round to really grok things
755c7f3
to
9ec032a
Compare
1b9497c
to
842d88f
Compare
blockbeat
blockbeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple high-level concerns:
- The timeout might be too short, causing complete node shutdown under transient load when previously recovery would have been possible.
TxPublisher
should receive beats beforeUtxoSweeper
, as discussed previously.
contractcourt/chain_arbitrator.go
Outdated
c.Unlock() | ||
|
||
// Iterate all the copied channels and send the blockbeat to them. | ||
err := beat.DispatchConcurrent(channels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about timeouts in extreme cases here. handleBlockbeat
has 30 seconds to complete, but every single channel arbitrator needs to process the block in that time. If there's thousands of channels, how long does it actually take to dispatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's no force close it takes microseconds otherwise it's milliseconds.
7b05026
to
d76a26e
Compare
9ec032a
to
f6c5de6
Compare
@yyforyongyu, remember to re-request review from reviewers when ready |
f6c5de6
to
e4c6bd4
Compare
4e00941
to
7b1b444
Compare
e4c6bd4
to
d7d2036
Compare
7b1b444
to
d900f98
Compare
d900f98
to
0ad97ec
Compare
e40d7aa
to
736edda
Compare
dd1d406
to
b6c7165
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it very close, we still need to decide on the following:
- Will we add ctx where possible and also have the benefit of structured logging where possible
Maybe fix the lingering processBlock goroutine maybe via a errGroup with a timeout ctx ?=> was not taking the buffered errorChan into account
// NOTE: The consumer must try its best to NOT return an error. If an | ||
// error is returned from processing the block, it means the subsystem | ||
// cannot react to onchain state changes and lnd will shutdown. | ||
ProcessBlock(b Blockbeat) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the change in design however will keep piling up the goroutines now when we timeout for whatever reason. So if the consumer does not report in 60 sec one time this goroutine will linger around forever until the consumer closes.
} | ||
|
||
// Create a customized logger for the blockbeat. | ||
logPrefix := fmt.Sprintf("Height[%6d]:", b.Height()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go with structured logging from the beginning, this might also faciliated the structure and we do not need a logger object in the blockbeat.
|
||
errChan := make(chan error, 1) | ||
go func() { | ||
errChan <- c.ProcessBlock(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An argument for the ctx is also structured logging.
return fmt.Errorf("%s got err in ProcessBlock: %w", c.Name(), | ||
err) | ||
|
||
case <-time.After(timeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is a good idea to call the timeout here rather than cancelling a context, because then to spawned goroutine might linger around if it does not return in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I misunderstood, the errorChan is buffered so there is no way the spawned ProcessBlock worker get's stuck in case we prematurely exit because of a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - ideally we want everything to be attached to the overall LND parent context in some way so that a cancel cancel things like this.
but i think implementing this in a contained way is easier said than done. I think we may need a big LND overhall that makes sure that the same parent context is shared by all subprocesses before we can do this properly
// Otherwise, when living in the same queue, the subsystems are notified of the | ||
// new blocks sequentially, which means it's critical to understand the | ||
// relationship of these systems to properly handle the order. | ||
type BlockbeatDispatcher struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean that the NotifyBlockProcessed
should be a callback when calling processblock ? I think that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎊
As discussed via other means, the structured logging will not be part of this series and needs a greater change for the whole LND design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work & fantastic commit structure 🥇
// NOTE: The consumer must try its best to NOT return an error. If an | ||
// error is returned from processing the block, it means the subsystem | ||
// cannot react to onchain state changes and lnd will shutdown. | ||
ProcessBlock(b Blockbeat) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we should have the ability to cancel the state change made from calling the methods of a subsystem?
we can have this discussion offline too cause i dont think we need to do this in this pr.
But yes: i think any API interface method should take a context so that if the caller is quiting, then the method they are calling into doesnt just hang while it is busy.
} | ||
|
||
// Create a customized logger for the blockbeat. | ||
logPrefix := fmt.Sprintf("Height[%6d]:", b.Height()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's tricky cause we need to do proper context usage first which is hard to contain to just a single system...
} | ||
|
||
// Create a customized logger for the blockbeat. | ||
logPrefix := fmt.Sprintf("Height[%6d]:", b.Height()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean we don't need the prefix logger anymore
yeah we can eventually remove the prefix logger. I think we just need to get used to the k=v
pairs at the end & of the log line that we can filter on instead of the prefix. Prefix i think should just be the subsystem.
return fmt.Errorf("%s got err in ProcessBlock: %w", c.Name(), | ||
err) | ||
|
||
case <-time.After(timeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - ideally we want everything to be attached to the overall LND parent context in some way so that a cancel cancel things like this.
but i think implementing this in a contained way is easier said than done. I think we may need a big LND overhall that makes sure that the same parent context is shared by all subprocesses before we can do this properly
// Otherwise, when living in the same queue, the subsystems are notified of the | ||
// new blocks sequentially, which means it's critical to understand the | ||
// relationship of these systems to properly handle the order. | ||
type BlockbeatDispatcher struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, could be a Processed
method on BlockBeat
736edda
to
1e17c5c
Compare
b6c7165
to
616399e
Compare
This commit inits the package `chainio` and defines the interface `Blockbeat` and `Consumer`. The `Consumer` must be implemented by other subsystems if it requires block epoch subscription.
In this commit, a minimal implementation of `Blockbeat` is added to synchronize block heights, which will be used in `ChainArb`, `Sweeper`, and `TxPublisher` so blocks are processed sequentially among them.
This commit adds two methods to handle dispatching beats. These are exported methods so other systems can send beats to their managed subinstances.
This commit adds a blockbeat dispatcher which handles sending new blocks to all subscribed consumers.
This commit implements `Consumer` on `TxPublisher`, `UtxoSweeper`, `ChainArbitrator` and `ChannelArbitrator`.
This commit removes the independent block subscriptions in `UtxoSweeper` and `TxPublisher`. These subsystems now listen to the `BlockbeatChan` for new blocks.
This commit removes the hack introduced in #4851. Previously we had this issue because the chain notifier was stopped before the sweeper, which was changed a while back and we now always stop the chain notifier last. In addition, since we no longer subscribe to the block epoch chan directly, this issue can no longer happen.
The sweeper can handle the waiting so there's no need to wait for blocks inside the resolvers. By offering the inputs prior to their mature heights also guarantees the inputs with the same deadline are aggregated.
This commit removes the block subscriptions used in `ChainArbitrator` and replaced them with the blockbeat managed by `BlockbeatDispatcher`.
This commit removes the block subscriptions used in `ChannelArbitrator`, replaced them with the blockbeat managed by `BlockbeatDispatcher`.
This `immediate` flag was added as a hack so during a restart, the pending resolvers would offer the inputs to the sweeper and ask it to sweep them immediately. This is no longer need due to `blockbeat`, as now during restart, a block is always sent to all subsystems via the flow `ChainArb` -> `ChannelArb` -> resolvers -> sweeper. Thus, when there are pending inputs offered, they will be processed by the sweeper immediately.
To avoid calling GetBestBlock again.
This is needed so the consumers have an initial state about the current block.
In this commit we start to break up the starting process into smaller pieces, which is needed in the following commit to initialize blockbeat consumers.
Refactor the `Start` method to fix the linter error: ``` contractcourt/chain_arbitrator.go:568: Function 'Start' is too long (242 > 200) (funlen) ```
1e17c5c
to
9ab2e53
Compare
Depends on #8893.
NOTE: itest is fixed in the final PR #9260
This PR introduces a minimal version of the new service
Blockbeat
as described in #7952, to handle block synchronization among different subsystems.During startup, blockbeat consumers are registered in the
BlockbeatDispatcher
, a subservice that's responsible for dispatching new blockbeats to its consumers and waiting for its consumers to finish processing the blocks. If any of the consumers fail to process the block under 30s, or encounter an error during block processing, the system will shut down as it's critical to handle blocks.This PR focuses on implementing blockbeat
Consumer
interface forChainArb
,UtxoSweeper
andTxPublisher
, the following PR focuses on finalizing blockbeat processing inChainArb
's subservices -ChannelArbitrator
,chainWatcher
, andContractResolver
.Overview
The flow of the blockbeat process is shown below, whenever a new block arrives, it goes through the waterfall like this,
blockbeat
, and sends it to its consumers sequentially.ChainArb
receives the blockbeat, processes it and signals back when done.UtxoSweeper
receives the blockbeat, processes it and signals back when done.TxPublisher
receives the blockbeat, processes it and signals back when done.TODO