Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: sync correctness #884

Closed
33 of 44 tasks
hdevalence opened this issue Aug 11, 2020 · 26 comments
Closed
33 of 44 tasks

Tracking: sync correctness #884

hdevalence opened this issue Aug 11, 2020 · 26 comments
Labels
C-tracking-issue Category: This is a tracking issue for other tasks

Comments

@hdevalence
Copy link
Contributor

hdevalence commented Aug 11, 2020

Tracking issue for issues around sync correctness, getting confused downloading blocks, etc.

First Alpha

Hangs and Panics:

Logging:

Performance:

Duplicate Downloads:

Cleanup:

Reduce Sync Restarts:

Work out if we need to remove duplicate downloads using:

  • the list of hashes from earlier queries in the current ObtainTips or ExtendTips (download_set) [Yes, as done in the current code]
  • the list of hashes that are being downloaded and verified in spawned futures [No, because we deduplicate in each extension and we check that each extension is an extension]
  • the list of hashes that have failed download or verify recently [No, because failures cause us to restart with clean sync state]
  • the hashes of verified blocks in the state [Only in obtain_tips, as done in the current code].

Design Questions:

Database:

Future Releases

Documentation:

  • Document minimum network requirements for Zebra
    • a minimum bandwidth of 10 Mbps
    • a maximum latency of 1 second
    • we don't support running Zebra over Tor
  • Document the peerset target size config, as a useful option for bandwidth-constrained nodes
    • test the minimum supported peerset target size, and add it to the docs (it's probably 4-12)
  • Check Zebra's memory usage during sync, and document the minimum requirement
  • Update the sync RFC to document the new sync algorithm Retcon new sync logic into RFC1 #899

Performance improvements:

Possible Future Work:

  • Consider disconnecting from peers that return bad responses

  • Consider disconnecting from peers that are sending blocks that we no longer want

  • Churn peers, using a regular timer to disconnect a random peer

  • Analyse the bandwidth and latency of current Zcash mainnet and testnet peers

  • Create a peer reputation service

  • Refactor out the core sync algorithms, and write unit tests for them Write unit tests for ObtainTips and ExtendTips #730

@hdevalence hdevalence added E-med C-tracking-issue Category: This is a tracking issue for other tasks labels Aug 11, 2020
@hdevalence
Copy link
Contributor Author

@teor2345 I think we don't need the sync to be able to go forward a block at a time, since as new blocks are generated we'll get them through gossip (to be implemented in #889).

@hdevalence
Copy link
Contributor Author

On the BLOCK_TIMEOUT question, there are actually two timeouts -- the one applied in the timeout layer and the one used internally in the peer set's state machine (set to 10 seconds). So if we want to have timeouts greater than 10 seconds we'll need to update that timeout as well. With a 1-second latency allowance that would be 1.8 Mbps. Maybe that's too high, although I don't think that it's unreasonable for the server use case that Zebra aims for.

I think it would be fine to increase the timeout to a somewhat larger value, and increase the lookahead limit to compensate. What's a reasonable target speed?

@teor2345
Copy link
Contributor

@teor2345 I think we don't need the sync to be able to go forward a block at a time, since as new blocks are generated we'll get them through gossip (to be implemented in #889).

Sounds good, and I guess it's ok to lag slightly in our first release. I'll update that TODO.

@teor2345
Copy link
Contributor

On the BLOCK_TIMEOUT question, there are actually two timeouts -- the one applied in the timeout layer and the one used internally in the peer set's state machine (set to 10 seconds). So if we want to have timeouts greater than 10 seconds we'll need to update that timeout as well. With a 1-second latency allowance that would be 1.8 Mbps. Maybe that's too high, although I don't think that it's unreasonable for the server use case that Zebra aims for.

I think it would be fine to increase the timeout to a somewhat larger value, and increase the lookahead limit to compensate. What's a reasonable target speed?

I'm not sure if we need to make any changes here.

Requiring less than 10 Mbps is reasonable for servers. And a latency of 1 second should cover most servers, wifi, and even some mobile internet. But I don't know the exact answer, because download speed also depends on peer speeds and peer latency. We should do some testing 🤔

Do we expect people to be able to run Zebra over Tor?
Because that will decrease our bandwidth, and significantly increase our latency:

For Tor via exits, it seems like 6 seconds should be enough for most blocks:
https://metrics.torproject.org/torperf.html?start=2020-05-15&end=2020-08-13&server=public&filesize=1mb

For Tor via onion services, we'd need to allow 20 seconds per block:
https://metrics.torproject.org/torperf.html?start=2020-05-15&end=2020-08-13&server=onion&filesize=1mb

(Note that these are the 1 MB graphs, there are also 5 MB graphs, but we only need 2 MB.)

@hdevalence
Copy link
Contributor Author

Sounds good, and I guess it's ok to lag slightly in our first release. I'll update that TODO.

To clarify, I don't think we'll be lagging here -- in initial block sync, we'll sync up to the current chain tip, and we'll stay in sync through block gossip.

@hdevalence
Copy link
Contributor Author

I'm not sure if we need to make any changes here.

Okay, let's do nothing for now. I don't think that we can run Zebra over Tor anyways, since the Bitcoin protocol isn't really compatible with Tor.

@hdevalence
Copy link
Contributor Author

Updated the list of checks with rationale about sync behavior.

@teor2345
Copy link
Contributor

I updated the list of outstanding issues, I'll check them off as I submit PRs.

Even after these changes, Zebra still doesn't limit the number of requests which are waiting for peer responses, or handle peer reputation. So I've added them to the "future work" section.

@hdevalence
Copy link
Contributor Author

I don't think it makes sense for us to limit the number of requests waiting for peer responses, because I don't think that there's a way for us to work out in a principled way what an appropriate limit for the number of in-flight requests should be. Instead I think it would be better to use backpressure from the peer set to block new requests when the peer set is at capacity, which is what we already do. If this isn't working well, I would prefer to try to figure out why, and how we could improve it. Perhaps this could be addressed by setting the peer set's target size to a smaller parameter.

@teor2345
Copy link
Contributor

Yeah, that was basically the conclusion I came to after thinking about it a bit more.

It might also be something we want to document for users - if they are on a slower network, try setting the target peerset size lower.

@teor2345
Copy link
Contributor

I have split this tracking issue into "First Alpha" and "Future Releases".

@mpguerra mpguerra added the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Nov 17, 2020
@mpguerra
Copy link
Contributor

What is left here for the Alpha release now?

I have removed #1183 from the Alpha milestone and #1182 is done.

Do we also want to implement the logging tickets?
How much of the testing (if any) has been done?

@teor2345
Copy link
Contributor

teor2345 commented Nov 23, 2020

These bugs make testing really difficult:

I still can't get Zebra to reliably sync to the tip. So we haven't done enough testing yet - we need to do a full sync after we've fixed all the common errors.

@teor2345
Copy link
Contributor

@hdevalence said on discord that tower-rs/tower#475 is fixed by tower-rs/tower#484.

For some reason the ticket is still open.

@mpguerra
Copy link
Contributor

Can we move this out of Alpha Release milestone now that #1183 is closed?

@teor2345
Copy link
Contributor

Can we move this out of Alpha Release milestone now that #1183 is closed?

What should we do about these first alpha testing tasks?

  • Manually run Zebra on Mainnet to verify that the sync works after recent fixes
    • Optional: also run Zebra on Testnet

We're still making changes that might affect sync behaviour.

@mpguerra
Copy link
Contributor

These sound to me more like ongoing checks that should be done by developers to validate their changes whenever they make changes to the sync code (or changes that may affect it) rather than finite one-time only tasks.

If these checks can't be automated my suggestion would be to add these to the PR template as a reminder for developers. Otherwise, if we can automate, let's create a ticket to implement these checks in our CI after the alpha.

@oxarbitrage
Copy link
Contributor

I think it will be good to document somewhere(maybe in the main README) machine specifications to sync the mainnet up to current tip. This is recommended RAM, disk space, CPU, etc. With those specs we can document an estimated time to be in sync, something like "at the moment of writing, with the recommended specs, sync up to block XXXXX will take around XX time" This way developers and users can rent appropriate services or try locally and know what to expect.

@mpguerra
Copy link
Contributor

I think it will be good to document somewhere(maybe in the main README) machine specifications to sync the mainnet up to current tip. This is recommended RAM, disk space, CPU, etc. With those specs we can document an estimated time to be in sync, something like "at the moment of writing, with the recommended specs, sync up to block XXXXX will take around XX time" This way developers and users can rent appropriate services or try locally and know what to expect.

We have some similar details in #1374

@oxarbitrage
Copy link
Contributor

I see, sorry i didn't saw that. Thanks.

@teor2345
Copy link
Contributor

teor2345 commented Dec 1, 2020

There isn't anything specific to be done for "testing on mainnet", so I removed those items, and moved this ticket out of the first alpha release.

@mpguerra
Copy link
Contributor

mpguerra commented Jan 5, 2021

adding #862 to this epic

@teor2345
Copy link
Contributor

teor2345 commented Jan 5, 2021

adding #862 to this epic

That ticket modifies the state service interface, which impacts sync and verification. So it doesn't really belong in sync correctness.

The state service cleanup in #1302 might be a better place for it.

@mpguerra
Copy link
Contributor

I would like us to review this tracking issue with a view to prioritizing outstanding tasks and potentially creating issues for those we still want to do so that we can close this issue.

If we decide not to close it yet, I think it probably should not be an epic but should be a github tracking issue and use github's task list to track progress

@mpguerra mpguerra removed the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Jan 25, 2022
@conradoplg
Copy link
Collaborator

I've added some comments below (in brackets) about the unfinished items, but most of them I'm not sure what to do. @teor2345 could you please take a look?

  • Document minimum network requirements for Zebra
    • a maximum latency of 1 second [is it?]
  • Document the peerset target size config, as a useful option for bandwidth-constrained nodes [maybe part of Document how to speed up full validation in the README #3101 ]
    • test the minimum supported peerset target size, and add it to the docs (it's probably 4-12) [do we know now?]
  • Update the sync RFC to document the new sync algorithm Retcon new sync logic into RFC1 #899 [it seems this doesn't exist anymore? We should document the sync algorithm anyway, I'd like to create an issue for this, but it's not a priority]

Performance improvements:

  • Work out how to improve genesis and post-restart sync latency, particularly on Testnet [do we need this? Seems to not be an issue anymore]

Possible Future Work:

@teor2345
Copy link
Contributor

* [x]  Document minimum network requirements for Zebra
  
  * [ ]  a maximum latency of 1 second [is it?]

It's effectively a 4 second RTT, I added this documentation task to #3101:

pub const HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(4);

* [ ]  Document the peerset target size config, as a useful option for bandwidth-constrained nodes [maybe part of [Document how to speed up full validation in the README #3101](https://github.com/ZcashFoundation/zebra/issues/3101) ]

I added this task to #3101

  * [ ]  test the minimum supported peerset target size, and add it to the docs (it's probably 4-12) [do we know now?]

We already have #704 for this, and the minimum size might change as we make fixes to Zebra. So we can do this task when Zebra is stable. (And we add tests for the minimum size.)

* [ ]  Update the sync RFC to document the new sync algorithm [Retcon new sync logic into RFC1 #899](https://github.com/ZcashFoundation/zebra/issues/899) [it seems this doesn't exist anymore? We should document the sync algorithm anyway, I'd like to create an issue for this, but it's not a priority]

We can do this documentation task whenever we have time.

Performance improvements:

* [ ]  Work out how to improve genesis and post-restart sync latency, particularly on Testnet [do we need this? Seems to not be an issue anymore]

It seems to be fixed, the post-restart delay is possibly too long, but restarts are very infrequent, so it doesn't matter. (And there's a tradeoff with restart loops on congested networks.)

Possible Future Work:

* [ ]  Consider disconnecting from peers that return bad responses [it seems we shouldn't do this? [Security: Stop disconnecting from nodes that send unexpected messages, to prevent disconnection attacks, Credit: Equilibrium #2107](https://github.com/ZcashFoundation/zebra/issues/2107)]

Yes, this is obsoleted and the security issue was fixed.

* [ ]  Consider disconnecting from peers that are sending blocks that we no longer want [[Security: Stop disconnecting from nodes that send unexpected messages, to prevent disconnection attacks, Credit: Equilibrium #2107](https://github.com/ZcashFoundation/zebra/issues/2107) applies too? i.e. we shouldn't do this]

Yes, this is obsoleted and the security issue was fixed.

* [ ]  Churn peers, using a regular timer to disconnect a random peer [don't know if we still need this]

I added this to:

* [ ]  Analyse the bandwidth and latency of current Zcash mainnet and testnet peers [don't know if we still need this]

If we need this, we should get someone else to do it for us. But it doesn't seem important right now.

* [ ]  Create a peer reputation service [don't know if we still need this]

I added this to:

* [ ]  Refactor out the core sync algorithms, and write unit tests for them [Write unit tests for ObtainTips and ExtendTips #730](https://github.com/ZcashFoundation/zebra/issues/730) [there's already an issue for it so no need to do anything]

After all those changes, I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: This is a tracking issue for other tasks
Projects
None yet
Development

No branches or pull requests

6 participants