-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peer connections hang temporarily or fail permanently #1435
Comments
Yea, the PR was not merged when i started that test earlier today. Thanks, i will reopen if i see it again. |
I am seeing the same issue with an updated branch that includes #1425 No response from the sync component after 30 minutes. A restart fixes it. Click triangle to reveal logs
|
@oxarbitrage if you have a 10s of gigabytes of spare disk space, you can help us diagnose the issue using a trace log:
If you don't have lots of disk space, using |
I was able to find the problem in debug mode so i have some logs https://gist.github.com/oxarbitrage/d2369fd4e5abb24e683e5a6301fcc6ff#file-sync_debug_logs_mainnet-txt I didn't researched much yet but one possibility is i am losing connection locally(my adsl change IP every 12 hours or so) and by the amount of I will confirm this, can post bigger logs if needed. |
Yea, it seems that is what is happening. I resumed and blocks are downloaded fine, then i closed my connection in purpose, while zebra is still running. Waited like 30 seconds and connected back, zebra was never able to download more blocks long after my connection was active again. I have some logs for this too: |
Is the cut is short enough(tested a cut of 3-5 seconds) and there is no time to mark all the peers as failing then the sync will continue |
The cut haves to be long(of at least 1 minute here) to actually be totally unrecoverable in the same session. In a 1>minute cut sometimes i get this warnings long after the connection is restored:
They at least indicate something is going on with the network with default info level. In some other runs i am not seeing this warns, i suppose this can depend on the amount of time the connection was off. |
Hmm, I don't think that the failure of an existing connection causes a peer address to be put in the "failed candidates" list, that's a list of peers where the connection attempt failed. (It's not a permanent rejection, it's just the lowest-preference for reconnection attempts). The peer set will attempt to reconnect to peers it disconnected from, by waiting until they could no longer be connected (ie. waiting beyond a timeout interval), then treating them as high-priority candidates for new connections. The peer set is supposed to send a demand signal when it does not have enough peers, causing attempts to connect to new peers. It's unclear why that mechanism isn't working. |
sync
component dead during sync
I'm also seeing this issue, it looks like my machine or router stopped allowing connections for a few hours this morning. When it came back up, Zebra reconnected to 1 peer, but has failed to reconnect to any other peers. Edit: this is on testnet, so I should be seeing 5-20 active peers |
I've just seen a Testnet Zebra rapidly lose all its peers, and never reconnect: Click triangle to reveal logs
|
I've taken the triage off this bug, because we already triaged some of its duplicates into the first alpha |
Between each warn here is what i can get in a trace level: Click triangle to reveal logs
At around every 1 minute(connected or not connected) we get this one: https://github.com/ZcashFoundation/zebra/blob/main/zebra-network/src/peer_set/set.rs#L452 so we have a working loop but it seems that the demand signal at the following line is not working. |
We just merged #1468, which should provide better diagnostics for this issue, and panics if our assumptions about the peer state machine don't hold. |
We think #1531 will fix most of these issues - once it's merged we'll close this ticket. Any new hangs should have separate tickets - ideally based on the part of Zebra that's hanging. |
Testing shows that #1531 doesn't fix this issue |
This issue persists after #1586, but it seems less frequent, and has a lower impact (until there are no peers at all, which still results in permanent failure.) |
When So there appear to be two issues here:
|
Analysis
Peer connection failures can result in a slow sync, temporary sync hangs, or permanent sync hangs.
These failures can happen after network interruptions, or due to normal connection churn. They might also happen because Zebra's protocol state machine gets in an invalid or unrecoverable state.
Next Steps
Here are some things we could try:
Version
zebrad 3.0.0-alpha.0
Current
main
branchEdit: as of 2020-12-02
Platform
Linux oxarbitrage 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description
During sync on the mainnet the
sync
component stopped responding. The program didn't crashed or hanged totally as the inbound requests where still being responded however no more block downloads were made. I have a log of this at https://gist.github.com/oxarbitrage/2f067fed9588c3d942e499d1252fc777 where the last msg from the sync component is at https://gist.github.com/oxarbitrage/2f067fed9588c3d942e499d1252fc777#file-gistfile1-txt-L2843 and no further msg after 30 minutes.I tried stooping the program and start again. The sync resumed.
--
When trying to sync the zcash blockchain using zebra i will expect to download all the blocks from start to end without intervention. Instead, i had to stop(ctrl-c) the program and restart to keep going.
The text was updated successfully, but these errors were encountered: