Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TierTwo] Fix and improve several synchronization problems #2659

Merged
merged 17 commits into from
Dec 14, 2021

Conversation

furszy
Copy link

@furszy furszy commented Dec 2, 2021

A never ending story, more improvements and corrections for the tier two initial synchronization.

This time, the following points are tackled:

  1. Fix a bad workflow that causes the budget data initial synchronization to be skipped.
    if we received for example a proposal +50 seconds ago, while we are synchronizing another asset, then once the budget sync starts, it will look like the sync is finished when it's not. Not waiting to receive any budget data, declaring the sync completed.

  2. Decrease mnw sync threshold to 4 peers to "minimize" the insane amount of network flood and initial tier two sync delay.
    MNW initial sync is mostly for historical reasons, getting +20k items per peer (+120k items total for the previous threshold of 6) is a high amount of duplicated data relayed over the wire. 80k is still very high but.. it's something that we have to live with until v6.0 (which removes entirely the need of mnw messages..).

  3. Fix extra sync attempt for the mnlist asset badly incremented in the sporks sync flow.

  4. Connect the not-connected ProcessSyncStatusMsg (ex masternodesync.ProcessMessage()). Thus why the tier two sync stats counters are always in 0, this function was being skipped.

  5. The node, after triggering the internal budget data relay process (relaying budget data to every known peer), is invalidly blocking every remote peer, for an hour, for follow-up budget sync requests.

  6. The orphans votes maps are linking a single prop/bud parent hash to a single vote, discarding the previous stored item every time that a new orphan arrives. Blocking any follow-up reception of the orphan votes that were discarded, prohibiting its connection.
 (edge case scenario: splitting risk)

  7. Added test coverage for:

    1. Single proposal sync based on an orphan proposal vote reception.
    2. Single budget finalization sync based on an orphan finalization vote reception.
    3. Orphan proposal votes and orphan budget fin votes connection when parent is received.

  8. Removed unneeded extra items network sync roundtrip: No need to trigger an inv-based sync when the peer sent us a single proposal sync request. And stop walking through the budgets and proposals maps, locking their mutexes, on every single peer unique proposal/bud_fin item request.


  9. Generalized the budget full sync items relay process.

Let's aim to include this work on the coming v5.4.0.

@furszy furszy self-assigned this Dec 2, 2021
@furszy furszy added this to the 5.4.0 milestone Dec 2, 2021
@furszy
Copy link
Author

furszy commented Dec 2, 2021

Added another commit, co-authored with zebra.
The node, after triggering the internal budget data relay process (relaying budget data to every known peer), is invalidly blocking every remote peer, for an hour, for follow-up budget sync requests.

-- Added issue to PR description as well --

@furszy furszy changed the title [TierTwo] Fix and improve several synchronization problems [WIP][TierTwo] Fix and improve several synchronization problems Dec 3, 2021
@furszy furszy force-pushed the 2021_mnsync_more_improvements branch 2 times, most recently from c31bba9 to 84678a8 Compare December 4, 2021 01:03
@furszy
Copy link
Author

furszy commented Dec 4, 2021

Final update: fixed an scenario which was causing the node to reject and not connect budget system votes, removed extra tier two sync msgs roundtrips (unneeded network flood), expanded test coverage and more:

  1. The orphans votes maps are linking a single prop/bud parent hash to a single vote, discarding the previous stored item every time that a new orphan arrives. Blocking any follow-up reception of the orphan votes that were discarded, prohibiting its connection.
 (edge case scenario: splitting risk)

  2. Added test coverage for:

    1. Single proposal sync based on an orphan proposal vote reception.
    2. Single budget finalization sync based on an orphan finalization vote reception.
    3. Orphan proposal votes and orphan budget fin votes connection when parent is received.

  3. Removed unneeded extra items network sync roundtrip: No need to trigger an inv-based sync when the peer sent us a single proposal sync request. And stop walking through the budgets and proposals maps, locking their mutexes, on every single peer unique proposal/bud_fin item request.


  4. Generalized the budget full sync items relay process.

@furszy furszy changed the title [WIP][TierTwo] Fix and improve several synchronization problems [TierTwo] Fix and improve several synchronization problems Dec 4, 2021
Copy link

@random-zebra random-zebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great set of changes 👍 Code ACK with a couple of suggestions.
Will run some test.

src/masternode-sync.cpp Outdated Show resolved Hide resolved
src/budget/budgetmanager.cpp Outdated Show resolved Hide resolved
src/budget/budgetmanager.cpp Outdated Show resolved Hide resolved
src/budget/budgetmanager.cpp Outdated Show resolved Hide resolved
src/budget/budgetmanager.cpp Outdated Show resolved Hide resolved
src/budget/budgetmanager.cpp Outdated Show resolved Hide resolved
@furszy
Copy link
Author

furszy commented Dec 6, 2021

After further convos with zebra about the orphan maps cleanup functionality, added the following changes:

  1. Limit orphan maps to 10k entries max.
  2. Cleanup orphan maps if the parent item wasn't received after an hour from the last received vote time.

Ready to go now.

@furszy furszy force-pushed the 2021_mnsync_more_improvements branch from da5348b to c2b8426 Compare December 7, 2021 14:04
@furszy
Copy link
Author

furszy commented Dec 7, 2021

updated per feedback, seen maps cleaned during the orphan votes cleanup process as well.

@furszy furszy force-pushed the 2021_mnsync_more_improvements branch 5 times, most recently from ecc3e4b to 5894675 Compare December 8, 2021 03:20
@furszy
Copy link
Author

furszy commented Dec 8, 2021

Last update:

  • Fixed a serious bug that essentially blocks the node from relaying budget data to other peers during the incremental sync process.
  • Added test coverage for the budget incremental synchronization process (which is the last budget sync flow that was not covered by tests).
  • And cherry-picked zebra's 94e5228 which prevents the removal of valid votes from the seen map. Which fixes another problematic budget incremental sync issue, as the node isn't able to answer to known valid item requests.

Let's give it a good testnet testing round now :).

@furszy furszy force-pushed the 2021_mnsync_more_improvements branch from 94e5228 to e8c6fec Compare December 8, 2021 15:25
if we received for example a single proposal +50 seconds ago, then once the budget sync starts (right after this call),
it will look like the sync is finished. Will not wait to receive any budget data and declare the sync over.
MNW initial sync is mostly for historical reasons, get +20k items per peer (+120k items total for the previous threshold of 6) is a high amount of duplicated data relayed over the wire. 89k is still very high but.. it's something that we have to live with until v6.0.
…s flow.

As we are increase the `RequestedMNAttempt` right after calling `SwitchToNextAsset()`, the mn list sync starts with an invalid extra sync attempt.
Before, this function was being totally skipped for the tiertwo_networksync dispatcher function. The dispatcher internally processes `syncstatusaccount` messages and return true, which means that the message was processed, there by, it was never reaching the `masternodeSync.ProcessMessage` call in net_processing.
Copy link

@random-zebra random-zebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 6f79df3

@Fuzzbawls
Copy link
Collaborator

Having spent the past few days stress testing this on mainnet (which has a much more realistic amount of layer 2 overall data than testnet), there were indeed some inconsistent results prior to the most recent push, some of which have stabilized when doing repeated testing.

Results so far (each test was done 50 times on linux and 50 times on macOS):

  • Incremental sync for a node already running after first syncing layer 2 data: GOOD
  • Initial sync after immediately restarting a node that had a synced layer 2 data previously: GOOD
  • Initial sync with a node that has no on-disk layer 2 cache data: INTERMITTENT ^

^ Some test runs without any on-disk layer 2 cache data resulted in a partial budget vote count, others did not. the ones that "failed" had a noticeably extended period of time spent on the MNW portion of the sync.

This test does not yet have a large number of test runs yet due to the extended amount of intentional downtime required

  • Initial sync after starting a node with 130+ minutes of downtime and previous on-disk layer 2 cache data: INTERMITTEENT

Mostly on macOS, encountered numerous times where the initial layer 2 sync ended with partial vote counts. An incremental sync afterwards would recoup the missing votes.

I also have observed a peer-banning issue right after the node catches up with the chain tip and starts the layer 2 sync that looks to be related to mnping messages, but unrelated to the changes here.

furszy added a commit that referenced this pull request Dec 12, 2021
800ed43 Move MESSAGE_START_SIZE into CMessageHeader (furszy)
3296b9e net_processing:PushTierTwoMsg remove unneeded chain height (furszy)
57817e5 net: do not log "notfound" message as unknown. (furszy)

Pull request description:

  Grouped few findings that discovered testing #2659.

  * The "not found" message is not being processed properly. Logging an unnecessary "unknown message type" error.
  It become even more evident syncing a peer running 2659 from a normal network peer, because of the non-updated peer blockage from relaying budget items that described there (the `ClearSeen` bug..).

  * Then some extra refactoring, removing unneeded chain height argument in `PushTierTwoGetDataRequest` and adapted 2c09a52 to our sources as well.

ACKs for top commit:
  random-zebra:
    utACK 800ed43
  Fuzzbawls:
    ACK 800ed43

Tree-SHA512: 9db984be2baecaf996ca37f27cb41b91d3047e9e987716f15e1d677b78aab67f4e4f9ff16af62fb68ffba41ead7cb88ccd49dc4ab0f88e55e5bdb8380d174957
@furszy
Copy link
Author

furszy commented Dec 12, 2021

Are you testing this PR between updated nodes or only against the current network?

Some important points that are all correlated if you are connecting your node to the network:

  • The full incremental sync flow as is now deployed in the network does not work (only the partial sync works under certain circumstances). The clear of the votes seen maps right before the data relay impedes non-updated peers in the network from syncing anyone surrounding them. This is solved by 34e9c23, thus why the connection between updated nodes is needed in order to fulfill the entire process successfully.

  • For the very same reason, going into the intermittentencies topic: as there are peers that have the seen maps empty, and as this maps are used to relay the get data requests, there are a good number of peers in the network that can't respond to budget vote data requests (until someone else relay the items to them again..). Replying to each request with "not found" messages (thus why detected [Net_processing] Do not log "notfound" msg as unknown. #2676). You can check it enabling the 'net' logging level and looking for the "not found" response when you see the initial sync intermittentencies, your node will receive one "not found" per vote item request.

  • Another point over this topic is that non-update peers in the network are blocking budget items requests for an hour if the node asks for single items as well. So, when the mnw process takes a while, and the node receives single proposal item (before the budget sync step): he will ask for the proposal/budget, the remote peer will answer it back and block any follow-up budget sync request, making the next tier two full budget sync request useless as the remote peer will not answer it. (This is solved by 5f596b8 and further cleaned by 31e40a4).

Copy link
Collaborator

@Fuzzbawls Fuzzbawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threw more nodes running this into the connection pool, all good now.

ACK 6f79df3

@random-zebra random-zebra merged commit fd19260 into PIVX-Project:master Dec 14, 2021
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
no functional change.

Github-Pull: bitcoin#2659
Rebased-From: fd7d569
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
if we received for example a single proposal +50 seconds ago, then once the budget sync starts (right after this call),
it will look like the sync is finished. Will not wait to receive any budget data and declare the sync over.

Github-Pull: bitcoin#2659
Rebased-From: 7349750
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
MNW initial sync is mostly for historical reasons, get +20k items per peer (+120k items total for the previous threshold of 6) is a high amount of duplicated data relayed over the wire. 89k is still very high but.. it's something that we have to live with until v6.0.

Github-Pull: bitcoin#2659
Rebased-From: 7020310
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…s flow.

As we are increase the `RequestedMNAttempt` right after calling `SwitchToNextAsset()`, the mn list sync starts with an invalid extra sync attempt.

Github-Pull: bitcoin#2659
Rebased-From: 2c3b0e8
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
Before, this function was being totally skipped for the tiertwo_networksync dispatcher function. The dispatcher internally processes `syncstatusaccount` messages and return true, which means that the message was processed, there by, it was never reaching the `masternodeSync.ProcessMessage` call in net_processing.

Github-Pull: bitcoin#2659
Rebased-From: ae28000
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…ngle budget finalization.

Neither block them if it's us who started the budget sync process.

Co-authored-by: random-zebra <random.zebra@protonmail.com>

Github-Pull: bitcoin#2659
Rebased-From: 5f596b8
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…ed to broadcast an inv if the peer is asking for the item directly..) and stop walking through the two maps (budgets and proposals), locking their mutexes, when the peer is requesting a single proposal/budfin.

Github-Pull: bitcoin#2659
Rebased-From: 31e40a4
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…al vote reception and single finalization sync based on an orphan finalization vote.

Github-Pull: bitcoin#2659
Rebased-From: 32196b1
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…a single vote, discarding the previous stored item every time that a new orphan arrives. Blocking any follow-up reception of the orphan votes that were discarded, prohibiting its connection forever.

Github-Pull: bitcoin#2659
Rebased-From: 62e9f5e
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…(1) single items sync requests and (2) full budget sync requests.

Github-Pull: bitcoin#2659
Rebased-From: 62ae606
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
…reset.

The seen maps are used to answer inventory requests (to get the item serialized without locking the main budget maps mutexes). If they are cleared prior to the full budget invs relay, the node will not be able to respond to any follow-up getdata..

Plus, decrease full budget data relay to be done once per week instead of once every 14 days.

Github-Pull: bitcoin#2659
Rebased-From: 34e9c23
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
Essentially a node dropping the budget data and waiting until any of the connected peers relays the complete budget data once more (this happens every 28 blocks in regtest).

Github-Pull: bitcoin#2659
Rebased-From: d60a105
furszy referenced this pull request in furszy/bitcoin-core Dec 14, 2021
And clear the orphan maps as well.

Github-Pull: bitcoin#2659
Rebased-From: 6f79df3
furszy added a commit that referenced this pull request Dec 15, 2021
c9b1909 [net] Remove assert(nMaxInbound > 0) (MarcoFalke)
d629df0 [Refactor] Add ReloadMapSeen function to reload "seen" votes (random-zebra)
7427d44 test: add coverage for the incremental budget sync flow. (furszy)
241627c Budget incremental sync: do not clear seen maps when relay status is reset. (furszy)
61477a5 [Refactor] Budget: check mn-enabled/vote-time before adding to mapSeen (random-zebra)
318fc7c Budget: limit orphan votes maps to 10k entries max. (furszy)
55f3ef9 [Refactor] Simplify CBudgetManager::CheckOrphanVotes() Github-Pull: #2659 Rebased-From: d04f5be (random-zebra)
e5ad197 Budget: Decouple the all-in-one `Budget::Sync()` into two functions: (1) single items sync requests and (2) full budget sync requests. (furszy)
5fd74e4 tiertwo sync: fix orphans votes maps linking single prop/bud hash to a single vote, discarding the previous stored item every time that a new orphan arrives. Blocking any follow-up reception of the orphan votes that were discarded, prohibiting its connection forever. (furszy)
4cdb587 test: add coverage for single proposal sync based on an orphan proposal vote reception and single finalization sync based on an orphan finalization vote. (furszy)
979bb2c budget: generalize budget full sync items relay. (furszy)
736363d Budget sync: Remove useless extra item sync p2p msgs roundtrip (no need to broadcast an inv if the peer is asking for the item directly..) and stop walking through the two maps (budgets and proposals), locking their mutexes, when the peer is requesting a single proposal/budfin. (furszy)
fe6fe56 Tiertwo: do not block peers if they requested a single proposal or single budget finalization. (furszy)
ebbef96 masternodesync: connect the not-connected ProcessSyncStatusMsg. (furszy)
281c061 masternodesync: fix extra `RequestedMNAttempt` increment in the sporks flow. (furszy)
d95fb9b masternodesync: decrease mnw sync threshold to 4 peers. (furszy)
cfc15b6 masternodesync: reset last budget item time before start syncing it. (furszy)
94f1c52 masternodesync: re-format peer version check. (furszy)
946444e Fix ignoring tx data requests when fPauseSend is set on a peer (Matt Corallo)
10cedd0 GUI governance: add missing strings to the translations process. (furszy)
dc10e17 GUI governance: fix proposal name and URL size limit validation. (furszy)
594acc6 [GUI] Fix governance nav button hover css (Fuzzbawls)
fd8d844 GUI: governance, IBD check before opening the proposal creation wizard. (furszy)
33f5dd8 [GUI] Fix proposal tooltip menu location skewing (Fuzzbawls)
55619cc GUI: Differentiate Budget payment transaction records from MNs block reward records. (furszy)
139596e Refactor: Move MN block reward to chainparams. (furszy)
1a55752 GUI: Fix remaining item naming conflicts (Fuzzbawls)
5925150 GUI: Use sorted layout/spacer names in governance UIs (Fuzzbawls)
dfeb3c7 [GUI] Don't show UTXO locking options for shield notes (Fuzzbawls)
6dd96b9 [BUG] GUI: invalid locking of shield notes in coin control Github-Pull: #2661 Rebased-From: d281cde (random-zebra)
1d7f86e Refactor: remove circ depend. primitives/transaction <-> script/standard (random-zebra)
cb3a5b4 Refactor: remove circ dependency checkpoints <-> chainparams (random-zebra)
e02e70e Refactor: remove circ dependency checkpoints <-> validation (random-zebra)
e1a5d0a Break circular dependency: init -> * -> init by extracting shutdown.h (random-zebra)
2d29c42 Cleanup: remove unused init.h includes Github-Pull: #2646 Rebased-From: b09422f (random-zebra)
95c7f2e cleanup: fix header includes sorting and copy Github-Pull: #2642 Rebased-From: 525c2e7 (random-zebra)
fdd4045 scripted-diff: replace boost::optional with Optional<> wrapper (random-zebra)

Pull request description:

  List of PRs included:

  * #2642
  * #2646
  * #2643
  * #2644
  * #2661
  * #2662
  * #2663
  * #2671
  * #2664
  * #2670
  * #2665
  * #2672
  * #2659
  * #2677
  * #2682

ACKs for top commit:
  random-zebra:
    ACK c9b1909
  Fuzzbawls:
    ACK c9b1909

Tree-SHA512: 34871cd7d527302ec6bd3924f4ca4322116952c8b69317831b2652165fbf1c566f893dace9a2998823c6781c4e578c2eeb9bc367fa39f632235aee7f45a68306
@furszy furszy deleted the 2021_mnsync_more_improvements branch November 29, 2022 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants