routing: shutdown chanrouter correctly. #8497

ziggie1984 · 2024-02-22T12:26:56Z

~~Fixes #8489~~
EDIT: Fixes #8721

So in the above linked issue, the channel graph could not be synced correctly so the ChanRouter:

2024-02-20 11:18:39.217 [INF] CRTR: Syncing channel graph from height=830127 (hash=00000000000000000003a7ed3b7a5f5fd5571a658972e9db0af2a650f6ade198) to height=831246 (hash=00000000000000000002946973960d53538a7d93333ff7d4653a37a577ba4b58)

...

2024-02-20 11:19:11.325 [WRN] LNWL: Query(34) from peer 142.132.193.144:8333 failed, rescheduling: did not get response before timeout
2024-02-20 11:19:11.326 [DBG] BTCN: Sending getdata (witness block 000000000000000000028352e09a42f6d26d0514a3d483f7f1fb56b2c2954361) to 142.132.193.144:8333 (outbound)
2024-02-20 11:19:15.327 [WRN] LNWL: Query(34) from peer 142.132.193.144:8333 failed and reached maximum number of retries, not rescheduling: did not get response before timeout
2024-02-20 11:19:15.327 [DBG] LNWL: Canceled batch 34
2024-02-20 11:19:15.328 [INF] DISC: Authenticated gossiper shutting down
2024-02-20 11:19:15.328 [INF] DISC: Authenticated Gossiper is stopping

so the 34 query failed and therefore the startup of the chanrouter failed as well.

We fail here and never call the Stop function of the channel router.
https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L628

When cleaning up all the other subsystems we get stuck however:

goroutine 1652 [select]:
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).UpdateEdge(0xc0002190e0, 0xc00028fea0, {0x0, 0x0, 0x0})
        github.com/lightningnetwork/lnd/routing/router.go:2605 +0x155
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).updateChannel(0xc0004a2790, 0xc0004f0580, 0xc00028fea0)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2182 +0x1f1
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).retransmitStaleAnns(0xc0004a2790, {0x0?, 0x100c004d10870?, 0x31f6c60?})
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1643 +0x272
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler(0xc0004a2790)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1342 +0x19d
created by github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).start in goroutine 1
        github.com/lightningnetwork/lnd/discovery/gossiper.go:599 +0x145

because we don't close the quit channel of the channel router and therefore the Authenticated Gossiper cannot stop as well so the cleanup process is stuck holding up the shutdown of all subsystems, causing some sideeffects because other subsystems are still running.

2024-02-20 11:19:15.328 [INF] DISC: Authenticated gossiper shutting down
2024-02-20 11:19:15.328 [INF] DISC: Authenticated Gossiper is stopping

Goroutine Dump:

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc0039a30e0?)
        runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0008ff7a0?)
        sync/waitgroup.go:116 +0x48
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).stop(0xc0004a2790)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:746 +0x115
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).Stop.func1()
        github.com/lightningnetwork/lnd/discovery/gossiper.go:732 +0x69
sync.(*Once).doSlow(0x3?, 0xc00030e6a0?)
        sync/once.go:74 +0xbf
sync.(*Once).Do(...)
        sync/once.go:65
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).Stop(0xc0039a32d8?)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:730 +0x3c
github.com/lightningnetwork/lnd.cleaner.run({0xc001c1bc00, 0x1e209aa?, 0x4?})
        github.com/lightningnetwork/lnd/server.go:1858 +0x42git.luolix.top/lightningnetwork/lnd.(*server).Start(0xcb01c?)
        github.com/lightningnetwork/lnd/server.go:2248 +0x8egit.luolix.top/lightningnetwork/lnd.Main(0xc0001d0100, {{0x0?, 0x7f4703ac7c40?, 0x101c0000b2000?}}, 0xc000104f60, {0xc0000b3e60, 0xc000222180, 0xc0002221e0, 0xc000222240, {0x0}})
        github.com/lightningnetwork/lnd/lnd.go:684 +0x3be5
main.main()
        github.com/lightningnetwork/lnd/cmd/lnd/main.go:38 +0x1ee

So we need to think how to prevent those situations, because I think we don't close the quit channel for almost all subsystems when the start fails.

coderabbitai · 2024-02-22T12:27:02Z

Important

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The recent changes introduce robust error handling and state management in various components of the Lightning Network Daemon (LND). Key enhancements include ensuring that certain methods are only executed when their corresponding components have been initialized, preventing nil pointer dereferences and managing lifecycle states with atomic boolean flags. These improvements enhance the stability and reliability of the system during startup and shutdown processes.

Changes

Files	Change Summary
`chainntnfs/.../bitcoind.go`	Added nil checks for `txNotifier` in `Stop` method to prevent nil pointer dereference.
`chainntnfs/.../neutrino.go`	Similar nil checks for `txNotifier` added in `Stop` method.
`chanfitness/.../chaneventstore.go`	Introduced `started` and `stopped` atomic boolean fields; modified `Start` and `Stop` methods for state management and error handling.
`discovery/.../gossiper.go`	Added nil check for `blockEpochs` in `stop` method to prevent panics.
`docs/release-notes/.../release-notes-0.18.3.md`	Fixed bugs related to fee rates during batch channel openings and improved shutdown handling, enhancing overall stability.
`graph/.../builder.go`	Improved logging in `Start` and `Stop` methods for better monitoring and debugging.
`htlcswitch/.../interceptable_switch.go`	Added state tracking with `started` and `stopped` flags in `InterceptableSwitch`; enhanced method checks to prevent multiple invocations.
`invoices/.../invoiceregistry.go`	Similar state tracking enhancements for `InvoiceRegistry` methods to ensure proper lifecycle management.
`lnd.go`	Modified server startup to be asynchronous, allowing better error handling and graceful shutdowns.
`server.go`	Enhanced cleanup logic in `Start` and improved error handling in `Stop` for better lifecycle management.
`sweep/.../fee_bumper.go`	Added state management with atomic booleans in `TxPublisher`, changing `Stop` method to return errors for better control flow.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Server
    participant Component

    User->>Server: Start
    Server->>Component: Initialize
    Component-->>Server: Initialized
    Server->>User: Success

    User->>Server: Stop
    Server->>Component: Cleanup
    Component-->>Server: Cleaned
    Server->>User: Success

Assessment against linked issues

Objective	Addressed	Explanation
Ensure node remains synced and channels reconnect after restart (#8489)	❌	Changes don't directly address the sync issue.
Allow `ChannelRouter` to be shutdown while `syncGraphWithChain` runs (#8721)	✅	Introduced checks to handle shutdown during operations.

Possibly related issues

[bug]: On restart LND attempts to broadcast different FC tx for already closing channel, shuts down #8850: The changes to error handling during force close transactions may mitigate issues related to rebroadcasting failed transactions.

🐇 In the meadow, we leap and play,
Fixing bugs, keeping chaos at bay.
With atomic states we hop and bound,
In our code, no errors are found.
So let's celebrate this fine endeavor,
Together we'll make LND clever! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

ziggie1984 · 2024-02-22T12:46:09Z

But this is just part of the fix why the other node is not able to sync the graph to the chain, we definitely need to retry the blockfetch and not fail immediately if we cannot get the block from the first peer. This issue is already tracked in this issue:

btcsuite/btcwallet#904

ziggie1984 · 2024-02-22T12:53:14Z

cc @yyforyongyu @Roasbeef

ziggie1984 · 2024-04-04T18:29:19Z

Swapped the order when we add the cleanup stop function to the garbage-collector. Let's see if tests pass.

routing/router.go

server.go

ziggie1984 · 2024-05-08T19:44:36Z

@yyforyongyu while adding the interruptibility to the startup of the server, I figured out that we need to make sure that each stop call is atomic (only happens once) otherwise we first call it in the cleanup method and when we return an error its also called in the server.Stop() function. While testing I had such a panic with the invoice registry but also the txpublisher.

But I think when the tests pass the switch of the cleanup order should have no side effects and can prevent some cases where subsystems depend on each other and therefore cannot shutdown correctly in case on of them does not close the quit channel.

lightninglabs-deploy · 2024-07-04T00:39:14Z

@ziggie1984, remember to re-request review from reviewers when ready

yyforyongyu

Left some comments, will check the itest logs to understand more about the new behavior.

chanfitness/chaneventstore.go

docs/release-notes/release-notes-0.18.1.md

invoices/invoiceregistry.go

lnd.go

server.go

ziggie1984 · 2024-07-11T23:27:46Z

Let's see whether all the itests pass after the change to error out when a start/stop is called twice.

yyforyongyu

Looking good, just a few nits and needs a rebase - think there's a new subserver added, we may need to change that here too.

chanfitness/chaneventstore.go

invoices/invoiceregistry.go

server.go

sweep/fee_bumper.go

docs/release-notes/release-notes-0.18.1.md

ellemouton · 2024-07-25T09:02:25Z

just gonna make a note of the one's ive run into:

the authgossiper one mentioned in my review
Ran into this panic
The InterceptableSwitch s.blockEpochStream.Cancel() in Stop panics.
nil pointer dereference of close(n.quit) in (n *TxNotifier) TearDown() which is caused cause the TxNotifier constructors are called in the various notifier Start methods (and not the constructors). So TearDown is called on a nil txNotifier.

Found these by basically commenting out all the Start calls & thus only calling Stop calls

yyforyongyu · 2024-07-25T14:32:32Z

I back the change. However, I think it will end up revealing some panics though in the cases where Stop methods depend on certain pointer members being set which are only set in Start methods. But in those cases, we should anyways either always set variables in the service constructors or we should do nil checks in the Stop methods where appropriate.

Good observation - I think it means if we want to safely move Stop before Start, we need to check all the Start methods and see if there are any struct initialization that should happen instead in New.

ziggie1984 · 2024-07-26T13:36:16Z

Thank you for this important analysis, did not think about this, will try to analysis all the cases and provide a proper solution.

ziggie1984 · 2024-07-27T12:51:11Z

the authgossiper one mentioned in my review
Ran into this panic
The InterceptableSwitch s.blockEpochStream.Cancel() in Stop panics.
nil pointer dereference of close(n.quit) in (n *TxNotifier) TearDown() which is caused cause the TxNotifier constructors are called in the various notifier Start methods (and not the constructors). So TearDown is called on a nil txNotifier.

went through the list of stop/start methods and tried mostly added nil pointer checks in the stop() methods, because the design always kind of required the variables to be initialized in the start method (from my technical understanding).

Also went through your list of examples above and addressed them, only exception was the panic you refered to in point 2, which was caused by the chainnotifier not running. However we already start the chainnotifier before the SubSwapper which is then able to subscribe to the channel events.

I don't think I regarded every case in the code base where the stop method is called before the start method, but I focused on the subsystems changed by this PR.

ellemouton

Thanks for the updates 🙏

lnd.go

chanfitness/chaneventstore.go

discovery/gossiper.go

ellemouton · 2024-07-29T10:21:47Z

invoices/invoiceregistry.go

+	if i.expiryWatcher == nil {
+		return fmt.Errorf("InvoiceRegistry expiryWatcher not " +
+			"initialized")
+	}
+


i think we still want the rest of the function to run though. iiuc, the whole reason we want to call Stop before Start so that quit channels can be closed & hence synch processes in Start methods can be stopped

agree changed it to:

if i.expiryWatcher != nil { i.expiryWatcher.Stop() }

we could also change the constructor of the invoice registry to catch the case where we have a nil pointer for the expiryWatcher. Went with the above for now, but happy to change it.

ziggie1984 · 2024-07-29T21:46:07Z

@coderabbitai review

coderabbitai · 2024-07-29T21:46:17Z

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (2)

sweep/fee_bumper.go (1)
703-703: Ensure correct initialization of the new fields in the NewTxPublisher function.

The NewTxPublisher function should correctly initialize the new atomic boolean fields started and stopped. Currently, these fields are not being initialized.

File: sweep/fee_bumper.go

Function: NewTxPublisher

Please add the initialization for the started and stopped fields.

Analysis chain

Ensure correct initialization of the new fields in the NewTxPublisher function.

The NewTxPublisher function should correctly initialize the new atomic boolean fields started and stopped.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 10 $'func NewTxPublisher(cfg TxPublisherConfig) *TxPublisher {'
Length of output: 258

Script:
#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewTxPublisher$cfg TxPublisherConfig$ \*TxPublisher {'
Length of output: 267

Script:
#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewTxPublisher(cfg TxPublisherConfig) *TxPublisher {'
Length of output: 258

Script:
#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewTxPublisher'
Length of output: 1276
invoices/invoiceregistry.go (1)
273-273: Ensure correct initialization of the new fields in the NewRegistry function.

The NewRegistry function does not initialize the new atomic boolean fields started and stopped. Please add the initialization for these fields.

invoices/invoiceregistry.go: NewRegistry function

Analysis chain

Ensure correct initialization of the new fields in the NewRegistry function.

The NewRegistry function should correctly initialize the new atomic boolean fields started and stopped.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the `NewRegistry` function initializes the new fields correctly.

# Test: Search for the `NewRegistry` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 10 $'func NewRegistry(idb InvoiceDB, expiryWatcher *InvoiceExpiryWatcher, cfg *RegistryConfig) *InvoiceRegistry {'
Length of output: 411

Script:
#!/bin/bash
# Description: Verify the `NewRegistry` function initializes the new fields correctly.

# Test: Search for the `NewRegistry` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewRegistry' 
Length of output: 1439

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b7c59b3 and 8c3abec.

Files selected for processing (11)

chainntnfs/bitcoindnotify/bitcoind.go (1 hunks)
chainntnfs/neutrinonotify/neutrino.go (1 hunks)
chanfitness/chaneventstore.go (4 hunks)
discovery/gossiper.go (1 hunks)
docs/release-notes/release-notes-0.18.3.md (1 hunks)
graph/builder.go (3 hunks)
htlcswitch/interceptable_switch.go (4 hunks)
invoices/invoiceregistry.go (2 hunks)
lnd.go (1 hunks)
server.go (6 hunks)
sweep/fee_bumper.go (3 hunks)

Additional context used

LanguageTool

docs/release-notes/release-notes-0.18.3.md

[grammar] ~40-~40: The word “shutdown” is a noun. The verb is spelled with a space.
Context: ...pull/8497) a case where LND would not shutdown properly when interrupted via e.g. SIGT...

(NOUN_VERB_CONFUSION)

[uncategorized] ~40-~40: A comma may be missing after the conjunctive/linking adverb ‘Moreover’.
Context: ...erly when interrupted via e.g. SIGTERM. Moreover LND now shutsdown correctly in case o...

(SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)

Markdownlint

docs/release-notes/release-notes-0.18.3.md

39-39: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

Additional comments not posted (57)

chanfitness/chaneventstore.go (4)

15-17: LGTM: Import of sync/atomic.

The import of sync/atomic is necessary for the atomic boolean flags used in the file.

53-55: LGTM: Addition of started and stopped fields.

The addition of started and stopped fields of type atomic.Bool enhances the lifecycle management of the ChannelEventStore.

150-154: LGTM: Prevent multiple invocations of Start.

The check using c.started.Swap(true) ensures that the Start method cannot be invoked more than once, preventing potential issues.

216-238: LGTM: Prevent multiple invocations of Stop and return an error.

The check using c.stopped.Swap(true) ensures that the Stop method cannot be invoked more than once, and returning an error adds robustness to the shutdown process.

htlcswitch/interceptable_switch.go (4)

7-7: LGTM: Import of sync/atomic.

The import of sync/atomic is necessary for the atomic boolean flags used in the file.

37-39: LGTM: Addition of started and stopped fields.

The addition of started and stopped fields of type atomic.Bool enhances the lifecycle management of the InterceptableSwitch.

208-213: LGTM: Prevent multiple invocations of Start.

The check using s.started.Swap(true) ensures that the Start method cannot be invoked more than once, preventing potential issues.

236-251: LGTM: Prevent multiple invocations of Stop and return an error.

The check using s.stopped.Swap(true) ensures that the Stop method cannot be invoked more than once, and returning an error adds robustness to the shutdown process.

lnd.go (4)

677-684: Good use of asynchronous server startup.

Starting the server within a goroutine allows for better control flow and graceful shutdown capabilities.

685-691: Good use of deferred function for graceful shutdown.

The deferred function ensures that the server is stopped gracefully, logging any errors that occur during the shutdown process.

693-702: Improved control flow with select statement.

The select statement allows for better error handling and graceful shutdown capabilities by listening for either an error from errChan or a shutdown signal from interceptor.ShutdownChannel().

700-702: Proper use of break statement in select block.

The break statement ensures that the function proceeds normally if the server starts successfully.

chainntnfs/bitcoindnotify/bitcoind.go (1)

154-159: Robustness improvement with nil check for txNotifier.

The conditional statement ensures that txNotifier is not nil before attempting to call TearDown, preventing runtime errors associated with dereferencing a nil pointer.

chainntnfs/neutrinonotify/neutrino.go (1)

156-160: Robustness Improvement: Conditional Check for txNotifier.

The added conditional check ensures that n.txNotifier is not nil before calling its TearDown method. This prevents potential nil pointer dereference errors and enhances the robustness of the Stop method.
sweep/fee_bumper.go (4)
264-265: LGTM! The addition of atomic boolean fields improves state management.

The started and stopped fields ensure that the TxPublisher can only be started or stopped once, enhancing robustness.

Line range hint 672-686:
LGTM! The Start method now ensures single invocation.

The use of t.started.Swap(true) prevents multiple starts, and the logging statement aligns better with the logical flow.

691-703: LGTM! The Stop method now ensures single invocation and returns an error.

The use of t.stopped.Swap(true) prevents multiple stops, and the method signature update aligns with Go best practices.

703-703: Ensure proper handling of the quit channel in the monitor method.

The monitor method should correctly handle the quit channel, which is closed in the Stop method, to ensure a graceful shutdown.
Verification successful

The monitor method correctly handles the quit channel.

The monitor method includes a select statement that properly handles the quit channel, ensuring a graceful shutdown when the quit signal is received.

The quit channel is checked within the select statement, and the method returns when a message is received on the quit channel.

Code Snippet:
for {
	select {
	case epoch, ok := <-blockEvent.Epochs:
		if !ok {
			log.Error("Block epoch channel closed, exit monitor")
			return
		}
		log.Debugf("TxPublisher received new block: %v", epoch.Height)
		t.currentHeight.Store(epoch.Height)
		t.processRecords()
	case <-t.quit:
		log.Debug("Fee bumper stopped, exit monitor")
		return
	}
}
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `monitor` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 10 $'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {'
Length of output: 300

Script:
#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `monitor` method and include more lines to capture the full method implementation.
rg --type go -A 50 'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {'
Length of output: 300

Script:
#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `monitor` method and include more lines to capture the full method implementation.
rg --type go -A 50 'func (t \*TxPublisher\) monitor(blockEvent \*chainntnfs.BlockEpochEvent) {'
Length of output: 309

Script:
#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `monitor` method and include more lines to capture the full method implementation.
rg --type go -A 50 'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {'
Length of output: 300

Script:
#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Use `ast-grep` to search for the `monitor` method and include more lines to capture the full method implementation.
ast-grep --lang go --pattern $'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {
  $$$
}'
Length of output: 1650
invoices/invoiceregistry.go (4)
104-105: LGTM! The addition of atomic boolean fields improves state management.

The started and stopped fields ensure that the InvoiceRegistry can only be started or stopped once, enhancing robustness.

219-250: LGTM! The Start method now ensures single invocation.

The use of i.started.Swap(true) prevents multiple starts, and the logging statement provides clear information about the state transitions.

255-273: LGTM! The Stop method now ensures single invocation and includes logging.

The use of i.stopped.Swap(true) prevents multiple stops, and the logging statements enhance visibility into the state transitions.

273-273: Ensure proper handling of the quit channel in the invoiceEventLoop method.

The invoiceEventLoop method should correctly handle the quit channel, which is closed in the Stop method, to ensure a graceful shutdown.

Verification successful

Let's correct the regex pattern and rerun the script to verify the handling of the quit channel in the invoiceEventLoop method.

Let's simplify the regex pattern and rerun the script to verify the handling of the quit channel in the invoiceEventLoop method.

Let's simplify the regex pattern even further and rerun the script to verify the handling of the quit channel in the invoiceEventLoop method.

Let's use ast-grep to search for the invoiceEventLoop method and verify the handling of the quit channel.

Ensure proper handling of the quit channel in the invoiceEventLoop method.

The invoiceEventLoop method correctly handles the quit channel by returning from the loop when the quit channel is closed, ensuring a graceful shutdown.

invoices/invoiceregistry.go:348-349: The select statement listens for the quit channel and returns from the method when it receives a signal.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 10 $'func (i *InvoiceRegistry) invoiceEventLoop() {'
Length of output: 225

Script:
#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 20 'func (i \*InvoiceRegistry\) invoiceEventLoop {'
Length of output: 237

Script:
#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 20 'func $i \*InvoiceRegistry$ invoiceEventLoop {'
Length of output: 240

Script:
#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 20 'func .*invoiceEventLoop.* {'
Length of output: 180

Script:
#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `invoiceEventLoop` method and its handling of the `quit` channel.
ast-grep --lang go --pattern $'func (i *InvoiceRegistry) invoiceEventLoop() {\n  $$$\n}'
Length of output: 3609
graph/builder.go (2)

303-303: Approved: Log statement addition in Start function.

The debug log statement log.Debug("Builder started") enhances visibility into the startup process of the builder.

329-329: Approved: Log statement modification in Stop function.

The debug log statement log.Debug("Builder shutdown complete") is now placed correctly to log immediately after the shutdown sequence is concluded.

discovery/gossiper.go (1)

756-761: Robustness Improvement: Added nil check for d.blockEpochs

The addition of the nil check before calling d.blockEpochs.Cancel() prevents potential panics if d.blockEpochs is not initialized. This enhances the robustness of the Stop method and ensures a safer shutdown process.

server.go (32)

Line range hint 1883-1893:
Initialize cleanup with the first subsystem.

The cleanup variable is initialized and the first subsystem (customMessageServer) is added to the cleanup list. This ensures that if any subsequent subsystem fails to start, the already started subsystems will be stopped in reverse order.

1900-1900: Add host announcer to cleanup list.

The hostAnn subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.

1908-1908: Add liveness monitor to cleanup list.

The livenessMonitor subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.

1920-1920: Add signature pool to cleanup list.

The sigPool subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1926-1926: Add write pool to cleanup list.

The writePool subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1932-1932: Add read pool to cleanup list.

The readPool subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1938-1938: Add chain notifier to cleanup list.

The cc.ChainNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1944-1944: Add best block tracker to cleanup list.

The cc.BestBlockTracker subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1950-1950: Add channel notifier to cleanup list.

The channelNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1956-1958: Add peer notifier to cleanup list.

The peerNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1964-1964: Add HTLC notifier to cleanup list.

The htlcNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1971-1971: Add tower client manager to cleanup list.

The towerClientMgr subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.

1978-1978: Add transaction publisher to cleanup list.

The txPublisher subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1984-1984: Add UTXO sweeper to cleanup list.

The sweeper subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1990-1990: Add UTXO nursery to cleanup list.

The utxoNursery subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

1996-1996: Add breach arbitrator to cleanup list.

The breachArbitrator subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2002-2002: Add funding manager to cleanup list.

The fundingMgr subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2011-2011: Add HTLC switch to cleanup list.

The htlcSwitch subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2017-2017: Add interceptable switch to cleanup list.

The interceptableSwitch subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2023-2023: Add chain arbitrator to cleanup list.

The chainArb subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2029-2030: Add graph builder to cleanup list.

The graphBuilder subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2035-2036: Add channel router to cleanup list.

The chanRouter subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2042-2043: Add authenticated gossiper to cleanup list.

The authGossiper subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2048-2048: Add invoices registry to cleanup list.

The invoices subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2054-2054: Add sphinx to cleanup list.

The sphinx subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2060-2060: Add channel status manager to cleanup list.

The chanStatusMgr subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2066-2066: Add channel event store to cleanup list.

The chanEventStore subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2113-2113: Add channel sub swapper to cleanup list.

The chanSubSwapper subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.

2120-2120: Add Tor controller to cleanup list.

The torController subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.

2137-2137: Start connection manager last.

The connMgr is started last to prevent connections before initialization is complete. This ensures that all necessary subsystems are up and running before accepting connections.

2324-2326: Add error handling for txPublisher.Stop.

The txPublisher.Stop method now includes error handling to log any issues encountered during the stop process.

2346-2349: Add channel event store to stop process.

The chanEventStore.Stop method is now included in the stop process, ensuring it is properly stopped and any errors are logged.

docs/release-notes/release-notes-0.18.3.md

yyforyongyu

Think we are missing a few nil check,

diff --git a/htlcswitch/link.go b/htlcswitch/link.go
index f39a12b2b..7b5b60295 100644
--- a/htlcswitch/link.go
+++ b/htlcswitch/link.go
@@ -533,6 +533,7 @@ func (l *channelLink) Start() error {
 		}()
 	}
 
+	// Needs to check this.
 	l.updateFeeTimer = time.NewTimer(l.randomFeeUpdateTimeout())
 
 	l.wg.Add(1)
diff --git a/lnwallet/chainfee/estimator.go b/lnwallet/chainfee/estimator.go
index d9a402964..0f291b724 100644
--- a/lnwallet/chainfee/estimator.go
+++ b/lnwallet/chainfee/estimator.go
@@ -860,6 +860,7 @@ func (w *WebAPIEstimator) Start() error {
 	log.Infof("Web API fee estimator using update timeout of %v",
 		feeUpdateTimeout)
 
+	// Needs to check this.
 	w.updateFeeTicker = time.NewTicker(feeUpdateTimeout)
 
 	w.wg.Add(1)
diff --git a/tor/controller.go b/tor/controller.go
index 47ea6e129..9c5eb13d6 100644
--- a/tor/controller.go
+++ b/tor/controller.go
@@ -164,6 +164,7 @@ func (c *Controller) Start() error {
 		return fmt.Errorf("unable to connect to Tor server: %w", err)
 	}
 
+	// Need check this.
 	c.conn = conn
 
 	return c.authenticate()

chanfitness/chaneventstore.go

chainntnfs/bitcoindnotify/bitcoind.go

discovery/gossiper.go

invoices/invoiceregistry.go

Make sure that each subsystem only starts and stop once. This makes sure we don't close e.g. quit channels twice.

This commit does two things. It starts up the server in a way that it can be interrupted and shutdown gracefully. Moreover it makes sure that subsystems clean themselves up when they fail to start. This makes sure that depending subsytems can shutdown gracefully as well and the shutdown process is not stuck.

chainntnfs/bitcoindnotify/bitcoind.go

invoices/invoiceregistry.go

With this PR we might call the stop method even when the start method of a subsystem did not successfully finish therefore we need to make sure we guard the stop methods for potential panics if some variables are not initialized in the contructors of the subsystems.

yyforyongyu

LGTM🙏 Would love to see some unit tests but it cannot be done atm. Have some rough ideas about how to implement #8958. I tried my best to check all the possible nil-panic cases, but it's only a pair of human eyes. Think we should proceed quickly to #8958 and add tests there.

ellemouton

🙏

ellemouton · 2024-08-01T06:36:51Z

chanfitness/chaneventstore.go

+		err = fmt.Errorf("ChannelEventStore FlapCountTicker not " +
+			"initialized")
+	} else {
+		c.cfg.FlapCountTicker.Stop()
+	}

 	log.Debugf("ChannelEventStore shutdown complete")

-	return nil
+	return err


non blocking: i'd say this is an error worth logging but not returning. The Stop function itself did not error here, it was just that Start never ran/completed. cause this makes it seem like "error stopping chanEventStore" even though there wasnt really an error stopping it. But defs not a big deal

same comment for a few other spots in this commit

ziggie1984 force-pushed the shutdown-bugfix branch from 801e50f to 3223097 Compare February 22, 2024 12:47

saubyk added this to the v0.18.0 milestone Feb 25, 2024

saubyk assigned ziggie1984 Mar 24, 2024

ziggie1984 mentioned this pull request Mar 27, 2024

query: fix retry query case. lightninglabs/neutrino#297

Merged

ziggie1984 marked this pull request as ready for review April 4, 2024 15:37

ziggie1984 force-pushed the shutdown-bugfix branch from 3223097 to 85a52aa Compare April 4, 2024 18:27

ziggie1984 mentioned this pull request Apr 15, 2024

[bug]: LTND: Shutting down because error in main method: unable to start server: did not get reget response before timeout #8651

Closed

yyforyongyu reviewed Apr 24, 2024

View reviewed changes

routing/router.go Outdated Show resolved Hide resolved

server.go Show resolved Hide resolved

server.go Show resolved Hide resolved

saubyk modified the milestones: v0.18.0, v0.18.1 Apr 25, 2024

yyforyongyu mentioned this pull request May 6, 2024

[bug]: ChannelRouter cannot be shutdown while the syncGraphWithChain function is running. #8721

Closed

ziggie1984 force-pushed the shutdown-bugfix branch 2 times, most recently from f21e62c to 8c831e5 Compare May 8, 2024 19:38

saubyk added the P1 MUST be fixed or reviewed label Jun 25, 2024

yyforyongyu requested changes Jul 10, 2024

View reviewed changes

ziggie1984 force-pushed the shutdown-bugfix branch 2 times, most recently from 0ac6836 to 5508ca3 Compare July 11, 2024 23:25

ziggie1984 commented Jul 11, 2024

View reviewed changes

server.go Show resolved Hide resolved

yyforyongyu mentioned this pull request Jul 15, 2024

refactor: move graph responsibilities from routing.ChannelRouter to new graph.Builder #8848

Merged

yyforyongyu reviewed Jul 16, 2024

View reviewed changes

ziggie1984 force-pushed the shutdown-bugfix branch 3 times, most recently from 928ef1a to 73c6a3b Compare July 23, 2024 09:55

ziggie1984 force-pushed the shutdown-bugfix branch 2 times, most recently from a488da4 to 5209f80 Compare July 27, 2024 12:43

ziggie1984 requested a review from ellemouton July 27, 2024 12:52

ellemouton reviewed Jul 29, 2024

View reviewed changes

ziggie1984 force-pushed the shutdown-bugfix branch from 5209f80 to 8c3abec Compare July 29, 2024 21:31

ziggie1984 requested a review from ellemouton July 29, 2024 21:32

coderabbitai bot reviewed Jul 29, 2024

View reviewed changes

docs/release-notes/release-notes-0.18.3.md Show resolved Hide resolved

docs/release-notes/release-notes-0.18.3.md Outdated Show resolved Hide resolved

ziggie1984 force-pushed the shutdown-bugfix branch from 8c3abec to 5639468 Compare July 30, 2024 09:20

yyforyongyu reviewed Jul 30, 2024

View reviewed changes

chanfitness/chaneventstore.go Show resolved Hide resolved

chainntnfs/bitcoindnotify/bitcoind.go Show resolved Hide resolved

discovery/gossiper.go Show resolved Hide resolved

invoices/invoiceregistry.go Outdated Show resolved Hide resolved

ziggie1984 added 4 commits July 31, 2024 13:12

multi: Add atomic start/stop functions.

08b68bb

Make sure that each subsystem only starts and stop once. This makes sure we don't close e.g. quit channels twice.

lnd: change startup order of authGossiper.

598d6e2

graph: add log lines for stop and start func.

e19f891

ziggie1984 force-pushed the shutdown-bugfix branch from 5639468 to db1b09f Compare July 31, 2024 11:12

ziggie1984 mentioned this pull request Jul 31, 2024

[feature]: Refactor the main start method of LND for robustness reasons #8958

Open

ziggie1984 requested a review from yyforyongyu July 31, 2024 11:19

yyforyongyu reviewed Jul 31, 2024

View reviewed changes

chainntnfs/bitcoindnotify/bitcoind.go Show resolved Hide resolved

invoices/invoiceregistry.go Outdated Show resolved Hide resolved

ziggie1984 added 2 commits July 31, 2024 14:43

docs: add release notes for 18.3.

0adcb5c

ziggie1984 force-pushed the shutdown-bugfix branch from db1b09f to 0adcb5c Compare July 31, 2024 12:43

ziggie1984 requested a review from yyforyongyu July 31, 2024 17:37

yyforyongyu approved these changes Jul 31, 2024

View reviewed changes

ellemouton approved these changes Aug 1, 2024

View reviewed changes

Roasbeef merged commit 4a3c4e4 into lightningnetwork:master Aug 1, 2024
31 of 34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

routing: shutdown chanrouter correctly. #8497

routing: shutdown chanrouter correctly. #8497

ziggie1984 commented Feb 22, 2024 •

edited by saubyk

Loading

coderabbitai bot commented Feb 22, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

ziggie1984 commented Feb 22, 2024

ziggie1984 commented Feb 22, 2024

ziggie1984 commented Apr 4, 2024

ziggie1984 commented May 8, 2024

lightninglabs-deploy commented Jul 4, 2024

yyforyongyu left a comment

ziggie1984 commented Jul 11, 2024

yyforyongyu left a comment

ellemouton commented Jul 25, 2024 •

edited

Loading

yyforyongyu commented Jul 25, 2024

ziggie1984 commented Jul 26, 2024

ziggie1984 commented Jul 27, 2024

ellemouton left a comment

ellemouton Jul 29, 2024

ziggie1984 Jul 29, 2024

ziggie1984 Jul 29, 2024

ziggie1984 commented Jul 29, 2024

coderabbitai bot commented Jul 29, 2024

coderabbitai bot left a comment

yyforyongyu left a comment

yyforyongyu left a comment

ellemouton left a comment

ellemouton Aug 1, 2024

ellemouton Aug 1, 2024

routing: shutdown chanrouter correctly. #8497

routing: shutdown chanrouter correctly. #8497

Conversation

ziggie1984 commented Feb 22, 2024 • edited by saubyk Loading

coderabbitai bot commented Feb 22, 2024 • edited Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Possibly related issues

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

ziggie1984 commented Feb 22, 2024

ziggie1984 commented Feb 22, 2024

ziggie1984 commented Apr 4, 2024

ziggie1984 commented May 8, 2024

lightninglabs-deploy commented Jul 4, 2024

yyforyongyu left a comment

Choose a reason for hiding this comment

ziggie1984 commented Jul 11, 2024

yyforyongyu left a comment

Choose a reason for hiding this comment

ellemouton commented Jul 25, 2024 • edited Loading

yyforyongyu commented Jul 25, 2024

ziggie1984 commented Jul 26, 2024

ziggie1984 commented Jul 27, 2024

ellemouton left a comment

Choose a reason for hiding this comment

ellemouton Jul 29, 2024

Choose a reason for hiding this comment

ziggie1984 Jul 29, 2024

Choose a reason for hiding this comment

ziggie1984 Jul 29, 2024

Choose a reason for hiding this comment

ziggie1984 commented Jul 29, 2024

coderabbitai bot commented Jul 29, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

yyforyongyu left a comment

Choose a reason for hiding this comment

yyforyongyu left a comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

ellemouton Aug 1, 2024

Choose a reason for hiding this comment

ellemouton Aug 1, 2024

Choose a reason for hiding this comment

ziggie1984 commented Feb 22, 2024 •

edited by saubyk

Loading

coderabbitai bot commented Feb 22, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

ellemouton commented Jul 25, 2024 •

edited

Loading