-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: LND frequently loses sync to chain with pruned backend #8250
Comments
I checked your pprofile and it seems like you did not use debug-level 2 for the dump_log, not a lot of useful infos to catch the problem in the normal debug level 1 dump. Could you next time use the debuglevel 2 instead of 1? Thank you for reporting this issue.
|
Thank you! I will get the more detailed goroutine the next time I run into the issue! |
I also ran into the issue a few times and added a cronjob with the output of Everything is fine up to 10:23:00 CET (09:23:00 UTC). The then-tip block 820238 was received at 10:05:02 on my bitcoind.
|
I checked the code for related log messages and noticed that I do NOT see the "Pruning channel graph using block" message for blocks 820239 and following. Something got stuck between block 820239 and block 820240. The last log message of this kind:
|
My lnd also refused to shut down ( |
This could be related to some of the shutdown issues we fixed in 0.17.2 and 0.17.3. If certain goroutines are blocked because of a deadlock, then it can be that the sync doesn't progress. Can you try with |
So I could trace down the bug a little bit more because our tunnelsats node suffered from the same behaviour. Thanks for your input @C-Otto got me the right idea where to look.
So somehow the validationBarrier is used up and no new blocks can be processed because lnd is waiting for new slots.
|
@guggero I think I'm already running 0.17.3 of sorts (I manually added all relevant commits to my messy code base). |
Analysing this a bit more I think I found the problem: The ChannelRouter initializes the Barrier with https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L1145 Now the pprof of the time when the problem happens count exacty 16 stuck goroutines using up 16 slots hence we are in a deadlock somehow in terms of advancing the blockheight for the router: The stuck routines are 2 here:
and 14 times on the
So not sure why the Wdyt @Crypt-iQ ? |
here is the pprof already deduplicated: |
I'm also using a node with 4 (virtual) CPUs. |
This dump shows that |
I'm running code that includes |
could you provide a pprof-goroutine dump next time it happens for you @C-Otto ? |
Will do |
Another noderunner reach out to me running on 17.3 and prunded: I am still trying to track-down the issue in the code (I think I am close trying to reconstruct this problem locally), but the main problem is that the prunded workers somehow do not report failure or timeout and signal it back to the initial call Main Problem is here, why is this call stuck for 1305 minutes, something is messed up with the
22 goroutines now blocked waiting for dependants:
so we are in a deadlock here. A new block would be ready to advance the block_height of the ChanRouter but the channel is not able to receive the data, which can be seen here:
Most likely the dependants build up over time related to the two ChannelAnnouncments which are stuck in the 4 workers waiting for work (idling):
and the workmanager waiting for results, so it does not seem something is still being processed.
We will investigate these behaviour further to track it down finally. |
godump_notsynced_dedup.log |
|
Thanks for the dump, same in your case, you mentioned 4 virtual cpus: 16 slots: 2 slots hang here:
14 hang here:
And seems like the node was unofsync for more than 10 hours, because the block could not be delivered (but not sure whether the 593 minutes really mean that the goroutine was stuck at this particular line or if the goroutine is running for that long). Do you know when the know got out_of_sync ?
|
I can confirm the 10 hours. The mails started coming in at 03:44 and I did the dump at around 13:15, which is roughly 10 hours difference. |
I added logging for the |
I'm now also running |
So I think we have some problems in the neutrino pkg, I increased the logs for the mentioned case and found that we have a block which is rescheduled over 10 times although the maximum retry count should not be greater than 2. https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L345
So this explains why we have this stuck I added some logs in a PR which increases some logs, which might be helpful tracking down the issue maybe you could run it too, very much appreciated. |
I just added the changes from #8267 and restarted my node (I'm not using neutrino, though?). |
Yeah this a bit confusing, but as soon as you run in pruned mode you use the neutrino codebase to fetch blocks from peers. (https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L192-L208) You would need to put the LNWL=debug to see the new logs. |
I think your ideas fit perfectly @Roasbeef I will work on the implementation draft then. |
hello dear people, I'm reading this post with huge interesst. We also suffer from this problem, and since a while, we cant get our new node properly online anymore. I was searching myself to death and thought that im unable to set up our deployment. We have initially after a restart a sync to chain, and after 15min we loose sync and it states both If you need help, pls let me know, but if its deepdive logs, you need to tell me what to do. Just a short question, i know its out, when its out, but do you guys know when a fix is out? We are offline with our shop now for a couple of weeks because we started to have weird connection problems and then I did the most stupid thing ever... |
In my experience the graph-sync isn't affected. Are you sure you're hitting this issue? Could you run the profile command (see above and here: https://docs.lightning.engineering/lightning-network-tools/lnd/debugging_lnd#capturing-pprof-data-with-lnd) to see if you're hitting the same limits? |
If you're affected by this issue, it might help to run a full (non-pruned) node and/or provide more CPU cores to lnd (which isn't a fix, but it might delay the need for a restart). |
I am not sure that I hit this issue, but we never had such problems before and since 2 weeks we dont have a synced LND anymore and the only thing we did is updated the deployment to 0.17.2beta then after 2 or 3 news blocks, --> out of synced LND runs inside a docker container, if you would tell me how to start this with lnd --profile=9736 then I could provide you with some infos. I am sorry I am not as technically experienced as you guys |
You can edit your |
@C-Otto At least I'm good to go again. Cheers |
Additional info; |
Hey everyone, chiming in to say that this issue is critical for using LND in BTCPay Server because most of our instances are pruned. We're now starting to get inquiries every day about how to solve this and it's impractical to say to people to restart LND every day. @saubyk - it's really appreciated you guys are already looking into this. Is there clarity on timeline here so we can forward it for our merchants and ambassadors? I see you mapped it to 0.18 - when is that planned to release? And if it's more than a month, is it possible to go forward with 0.17.4 to fix this since it's such a big issue for a lot of deployments? |
Hi @rockstardev looking into the possibility of including a fix with 0.17.4. No promises yet @ziggie1984 is on the case |
@saubyk & @ziggie1984 godspeed, love seeing this slated for 0.17.4 |
Hello folks Can anyone elaborate why pruned nodes may cause this issue ? I have been searching more information about if running Bitcoin Nodes in pruned mode is actually an issue or is it something that can be fixed inside LND to deal with these scenarios better ? What is the relevance for LND to be able to read a older block that may not be locally in the Bitcoin Node ? |
lnd sometimes needs to access information from old blocks, which I think is related to validating channels that have been opened a long time ago. It's rather trivial to request this from a connected full node, as this full node can just grab the block from disk and return it to lnd. However, if the connected bitcoind does not have the requested block, it needs to be downloaded from the internet. This process is rather slow and unreliable, which is why the corresponding code is a bit more complex. One bug in this complex code has been described and fixed here, but there may be other bugs that haven't been reported, yet. I run my node with a pruned bitcoind and I don't think it's a large issue. It's not the perfect solution, though. |
Thanks for the exaplanation @C-Otto Some things that come to mind in order to dig this this further and perhaps help clarify the possible causes and solution are:
|
I don't think this discussion belongs here. The issue is fixed. |
@C-Otto from your previous message you don't seem to suggest the problem is completely solved. You mentioned that one issue is fixed and I am trying by the above to anticipate of other possible scenarios and solutions. |
Background
I run LND compiled from source on a Ubuntu VPS. I use Bitcoin Core as my backend, pruned to 100GB (
prune=100000
). I've noticed on various occasions that LND loses its sync to chain ("synced_to_chain": false,`) and does not recover by itself. A restart always fixes the issue promptly, although in some cases LND cannot be gracefully shut down. I have various peers with the same issue, all are running a very similar setup. The issue seems to be related to pruning the node. The more aggressively you prune the node, the more often LND will lose sync to the chain.Your environment
0.17.1-beta commit=lightning-terminal-v0.12.1-alpha-dirty
Linux 5.4.0-167-generic
Version: 240100
prune=100000
lnd.tor.active=true
lnd.tor.skip-proxy-for-clearnet-targets=1
Steps to reproduce
Tell us how to reproduce this issue. Please provide stacktraces and links to code in question.
Expected behaviour
I expect LND to stay synced to chain as long as the backend is synced to chain.
Actual behaviour
LND occasionally loses sync to chain and does not recover while Bitcoin Core is running and synced.
I have about 1.5h of logs and a go profile attached below.
04lnd.log
03lnd.log
02lnd.log
01lnd.log
goprofile.txt
The text was updated successfully, but these errors were encountered: