-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738
Comments
Apart from fixing #64, we could add timeouts for runtime APIs that are (slightly) lower than subsystem stall timeout. Although this sounds brittle, it might be a bit better than letting the node hang and shutdown. |
Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput. Availabiltiy bitfields could be subject to back pressure too maybe? I suppose that's a topic for another thread. |
Generally the runtime is seen as trust source code. Also aborting is still not that easy to do with wasmtime. |
In this case I estimate almost all the validators restarted at the same time, the recovery itself was slower than expected because of two things:
|
Thanks for the great writeup! Q1:
I didn't quite get why Kusama survived? The same migration is there as well. Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return |
The
Checkout the values used by the code below in the HostConfiguration snippet.
On Kusama we had different values there which made the API run faster. |
Most likely it matters what's the garbage and on kusama the API is still returning, you can test it with chopsticks on kusama. On polkadot it gets stuck because of those really high number values @sandreim posted here: #5738 (comment), |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: |
Closing this now, the root-cause has been found and subsequent follow-up items have their own issues: |
What happened
Immediately after runtime was updated to 1.3.0 which was enacted at block, finality started lagging and parachains blocks weren't being produced as usual.
The reason for that was that a significant number of validators were crashing with the following error:
After restart the validators worked as expected but it took around ~40min for finality to catch:
Additionally, because parachain collators froze or crashed as well, some of them had to be restarted, for examples on asset-hub it can be seen here the collators that are still not producing any blocks: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fdot-rpc.stakeworld.io%2Fassethub#/collators and need manual intervation.
Root cause
All failure modes seem to be the same
candidate-backing-subsystem
does not process signals and after 64 blocks the size of the subsystem signals channels the overseers decides that it is stuck and everything is terminated and restarted.candidate-backing-subsystem
seems to be waiting on calls from runtime-api subsystem which seems to be taking a long long time:This data leads to the runtime call
fetch_claim_queue
which is a new runtime call that was emitted after the runtime upgrade, because the parachainHost api was bumped to pub const CLAIM_QUEUE_RUNTIME_REQUIREMENT: u32 = 11.A probable explanation for what happened is that we were affected by this limitation #64 and the fact the new runtime API claim_queue is using storage data that is created by a migration
parachains_configuration::migration::v12::MigrateToV12
included in the same runtime.So, the first time
claim_queue
got called it used garbage data that made it take a very long time and thecandidate-backing
subsystem got stucked waiting on the api calls, that also explains why after restart on the following blocks things recovered and worked as expected.An additional data point is the fact the subsystem got declared stalled by the overseer exactly 64 blocks after the block containing the runtime was imported.
Testing with chopsticks at block 22572435 confirmed that
claim_queue
is taking a lot of time to complete because of ParachainHost corruption.Conlusion
Root-cause is fairly well understood and checked, so no further impact is expected.
Action items
24,786,390
, but we got lucky/unlucky with the corruption andclaim_queue
api returned faster, so there was no impact.The text was updated successfully, but these errors were encountered: