-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Polkadot Incident 11.03.2023 #6862
Comments
Maybe because the dispute coordinator is overloaded selecting the best block to build on is either too slow or not returns at all. This isn't supported given the log messages above, but given the amount of disputes it could be a reasonable explanation. |
Other logs that I don't yet have analyzed: https://www.dropbox.com/s/42uettt221yb6lt/lambos.log?dl=0 (this is the validator that initiated the disputes) https://drive.google.com/file/d/1eZgNfrruqPGCUQrzzEJzhtHFTJ8Oy7xX/view?usp=share_link |
Validator that spent 4 seconds on trying to build a block: https://gist.github.com/bLd75/da95a8c51716fa4e23807fc35fb9f68d The block was discarded in the end. Maybe the inherent took too long to process? |
https://gist.github.com/matherceg/42f4c6c32a5fdaa9513801155a9e93d9 Availability store subsystem is constantly stalling. |
We've witnessed availability dying during past events? We've never seen this many disputes before? We're doing too many things in parallel maybe, like maybe running too many availability recoveries. Adversaries could race against approval checker assignment announcements, so those should progress reliably and fast, but other subsystems could typically delay their work in one way or another. In particular, we should think about sequencing for disputes for example: If two parachains blocks are disputed in the same relay chain block, then we could do them semi-sequentially, which slows finality much more but leaves CPU time for other activities. We could similarly delay later disputes until earlier ones get resolved. We would not afaik need consensus on the ordering because we'll have to do them all anyways. Also, if we've a random sampling of disputes then we could favor those with more invalid votes first, because one invalid vote triggers rewinding the chain. We'll avoids no-shows if validators can delay heavy work somewhat, which matters since we do not have consensus upon the delay. We've discussed adding delay estimates into approval checker assignment announcements, but never really analyzed how this should be abused. We should probably add some fake no-show system to a testnet, and eventually kusama for a while, which'll provide a different sort of stress testing than say glutton pallet parachains. |
Disputes are not related to availability dying. Disputes were probably related to the bug mentioned in the original posting. |
Logs from the last 24 hours: 2023-03-13-last-24-hours-JOE-V01.log |
Did you remove any log lines? I see that the node is taking too much time to produce a block, but it also doesn't print the "Pre-sealed" message or that it took too long. This is also interesting:
A honest node should not trigger this? |
Something to investigate is that block time degradation continued long after the last dispute was resolved according to subscan: https://polkadot.subscan.io/event?address=&module=parasdisputes&event=all&startDate=&endDate=&startBlock=&endBlock=&timeType=date&version=9370. |
@ordian Yes we've seen slow but steady block time recovery on the parachains too. |
Do we also have prometheus metrics for this node ? |
Reproduced the issue on Versi and the used block benchmarking to look closer. To me this seems to be the problem, overweight blocks due to dispute statements. We should profile our runtime code and/or fix our benchmarks to ensure this doesn't happen.
Edit: bench machine CPU:
|
Related paritytech/polkadot-sdk#849 |
A quick update. Here's the problematic backtrace from @sandreim's reproduction:
This takes up 88% of the time. 76% of the time is spent inside So some possible actions we could take to make this better:
|
@sandreim could you say how do you measure execution time? How to get those numbers on local machine, is it possible at all? Please.
|
Makes sense to me. If one signature is invalid, we could discard the whole dispute set to avoid duplicate checks. Currently, we're just discarding one vote: polkadot/runtime/parachains/src/disputes.rs Lines 1090 to 1100 in 5feae5d
Otherwise, we'd have to fall back to individual signature checks in case of batch verifier fails IIUC.
There shouldn't be any disputes in the first place. They are here due to either some bugs or slow hardware. This would also be fixed with slashing and validator disabling. However, it does make sense to allocate not the whole block weight, but maybe half of that at most to disputes. @eskimor WDYT? |
I wonder if the weight of a single signature verification check of 48 microsecs varies greatly under load:
|
This makes sense to me. AFAIU the only downside would be additional finality lag. We need to continue load testing on Versi to figure out how much we can fill the block with votes. 50 to 75% sounds like a good interval to me. |
That would likely be the case if the host is loaded and the thread running the runtime is not getting a full CPU core. We have some support for runtime metrics, (requires a compile time feature). I think we can create a histogram of the time spent there to properly visualise the variance. |
Hmm.... this is somewhat of an unfortunate API omission that when batch verifying we can only fetch an aggregate of "did all of the checks succeed?" instead of being able to tell which exact signatures are okay and which ones are not. Maybe it'd make sense to add an alternative host function to |
We don't have batch verification currently available in the runtime. There was some work, but it is currently disabled. We could support batch verification, but only for the relay chain itself. For Parachains we had some discussion here: https://forum.polkadot.network/t/background-signature-verification/132.
I think some finality lag, with the assumption that we will have slashing makes sense. Nevertheless we need to ensure that this doesn't happen again. Even if validators are slashed etc, bugs can still happen and we may see some huge amount of disputes again. |
To recap, we currently have identified 2 issues that occurred during the incident:
The second issue can be seen in this screenshot. Above panel shows disputes being raised and voted on. The bottom one shows votes being passed to create_inherent long after disputes ended. That being said, we need to prioritize releasing version 3 parachain host API to have 2 fixed soon, but to solve 1 we need to optimize and/or reduce the amount of dispute votes we include. @tdimitrov is investigating the variance for signature checking during high load dispute tests on Versi (using runtime-metrics feature). That information should be useful to determine the percentage of the block we fill with votes. |
Sounds like we have explanations for everything we have seen @sandreim? So, we can start writing a postmortem. Would you be open to tackle this? |
Yeah, I think we can explain all that has been seen except the Regarding postmortem, yeah, I could start a writeup as soon as we have some more data about the variance of the signature checks. |
Also, one more thing regarding speeding this up - looks like there's an easy performance win we could have here by enabling SIMD in the Before:
After:
That's a free almost ~30% speedup right there. There are two problems with this though:
|
Can we a dynamic feature detection with an ifunc trick? |
For what's worth no validator should be running on such old hardware, and if they do then they'll be too slow anyway. Another option would be to just detect this dynamically at runtime and only use the AVX2 codepath when it's supported I guess. (This means even more changes to |
We should not forget that there are also other nodes in the network and not just validators. So, having it dynamic would be a requirement. I hope no one runs a validator on these machines, but some ordinary node should be fine! |
I'll see what I can do and try to make the |
We're not already using AVX2? yikes.
This is what cryptographers mean by batch verification. We merge the verification equations, so they all succeed or fail together. It's quite useful if you've many nodes verifying blocks, but annoying if you're processing individual messages which maybe contain spam, hence why afaik web2 never deployed it. In principle, you could bisect batches until you find the baddies, but only if you've some node reputation system, or batch by source, or similar. |
The block author should not push any invalid votes! If they do this, we should reject the entire inherent, not single votes! |
I just started to write the postmortem on this and I arrived at the conclusion that we are still missing some information. I cannot explain for example why blocks such as this one have no dispute statements in them and still take a lot of time to be imported (as observed in the metrics) and relay chain block times. My reasoning here was that disputes votes get pushed on chain hours after last dispute concluded. But it seems that this is not true ... |
The only thing that I can find that could slow things down when there are no disputes happening is https://github.com/paritytech/polkadot/blob/master/runtime/parachains/src/disputes.rs#L918 . This iterates all disputes from past 6 sessions. |
Do you know when we imported the last dispute votes? |
Unfortunately I do not find the block number anymore :( However I have isolated the issue and a proposed fix is in #6937 |
I've gotten the same error when the node was active and selected as paravalidator. Please advise what could have caused this. The node is running binary: v0.9.39-1. Looking back at the logs, I see that error was on the logs in the past as well. The machine pass all benchmarks, and I have a high bandwidth connection. Cheers! |
On 11.03.2023 a validator started to send out thousands of disputes. The disputes started to slow down the network and also lead to degraded block times of Polkadot.
The reason for the disputes is probably the following issue: #6860
Another validator shared the following logs:
The text was updated successfully, but these errors were encountered: