-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block production/forking issues caused by malicious/not synced peers #5624
Comments
Hey sorry for the late reply. At some point I already started to write an answer to this issue, but I forgot about it. I also think that the issue is real and should be fixed. However, maybe we can make the block production a little bit smarter. Like we should check the timestamp of the best block and if that is too much behind the current time, the node should not try to produce a block. This way we could entirely stop relying on what we think the network state is. For sure we should introduce some kind of fallback cli flag to force block production whatever the best block is. This would help when there was a bug and requires to hardfork the network or whatever. Probably this "did we imported a recent block" recently could be used everywhere to replace the |
Feels like a hack rather than a proper solution, but it should work. What I don't particularly like about this is that it assumes there is a way to derive time from a block. While both BABE and Subspace have it and many other consensus engines too, I'm not sure this is a universal enough concept to make it a hard requirement. That said the fact that it doesn't require any network-level changes is appealing. We had to launch mainnet with a fork that has
We already have |
We don't need to make it a hard requirement for the entire of Substrate, but for BABE/AURA etc. Also Substrate has other assumptions already backed in. Do you have any other ideas that would make it less hacked? Depending on the networking is IMO not a good solution because we can never really trust it and we will see similar issues. |
We know protocols like ouroboros chronos that estimate some network time from others recent block times, btw.
At least for validators, we'll eventually more this on-chain:
|
I wouldn't call it networking, this is part of the sync mechanism. Just like nodes announce to each other what their tip is so node can decide whether there is a better tip to sync to, they can also indirectly announce that they see a higher tip, but but didn't verify/import it yet, hence they are still syncing themselves and node can ignore their tip because it isn't yet accurate anyway.
I have no idea what back certificates are and we don't have Proof-of-Stake in our protocol either, so I don't think it would be applicable anyway. |
Okay sounds like a good idea as well. |
Long time ago we had an issue with forking on our permissionless network that resulted in us carrying this Substrate patch for a few years now: autonomys@9b09c04
I'd like to finally upstream a version of that change in some form. We were using that patch since August 2022 on various test networks with thousands of consensus nodes. It works and does what it is supposed to.
I tried to convince Santiago and other folks from builders program that this is an issue, but it didn't go anywhere back then.
What we observed is that when many thousands of new consensus nodes joined a relatively small network within very short period of time (hours), many of them started forking despite not being synced fully. Turned out the reason is that their peers were not synced either, so as far as they are concerned, they are at the tip of the chain they can observe, even though it is not the best chain in absolute terms.
Since in contrast to Proof-of-Stake Polkadot our network is fully permissionless, it affects us to a larger degree. It still affects Proof-of-Stake chains in a sense that this can disrupt syncing, but at least it will not result in forking most of the time due to permissioned nature of the consensus.
The solution I came up with (in above mentioned patch) was for nodes to voluntarily announce to each other if they think they are synced, such that their peers can use that information to improve decisions about sync target, ignoring those that are not synced yet. This results in a bit more deliberate bootstrapping of the new network where first node needs to be force-synced (hence added CLI option) and then at least 2 nodes on the network need to maintain connectivity at any time in order for them to remain "synced" and for block production to happen on the network.
BTW, I have never got to writing PoC, but I think the opposite extreme might be problematic even for Polkadot/Proof-of-Stake chains, where large number of "fake" peers pretending to be at a higher block level could force node into major syncing and disrupt block production that way for a period of time until nodes discover that those peers are not actually presenting correct tip. The effect from this will depend on request/response timeouts and such, but likely is a real test vector to which I do not have an easy solution.
The text was updated successfully, but these errors were encountered: