[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

alexggh · 2024-09-17T13:32:02Z

What happened

Immediately after runtime was updated to 1.3.0 which was enacted at block, finality started lagging and parachains blocks weren't being produced as usual.

The reason for that was that a significant number of validators were crashing with the following error:

Sep 17 08:03:13 dot1 taskset[688103]: 2024-09-17 08:03:13 Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.
Sep 17 08:04:13 dot1 taskset[688103]: Error:
Sep 17 08:04:13 dot1 taskset[688103]:    0: Other: Essential task failed.
Sep 17 08:04:13 dot1 taskset[688103]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Sep 17 08:04:13 dot1 taskset[688103]: Run with RUST_BACKTRACE=full to include source snippets.

After restart the validators worked as expected but it took around ~40min for finality to catch:

Additionally, because parachain collators froze or crashed as well, some of them had to be restarted, for examples on asset-hub it can be seen here the collators that are still not producing any blocks: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fdot-rpc.stakeworld.io%2Fassethub#/collators and need manual intervation.

Root cause

All failure modes seem to be the same candidate-backing-subsystem does not process signals and after 64 blocks the size of the subsystem signals channels the overseers decides that it is stuck and everything is terminated and restarted.

Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

candidate-backing-subsystem seems to be waiting on calls from runtime-api subsystem which seems to be taking a long long time:

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.

This data leads to the runtime call fetch_claim_queue which is a new runtime call that was emitted after the runtime upgrade, because the parachainHost api was bumped to pub const CLAIM_QUEUE_RUNTIME_REQUIREMENT: u32 = 11.

A probable explanation for what happened is that we were affected by this limitation #64 and the fact the new runtime API claim_queue is using storage data that is created by a migration parachains_configuration::migration::v12::MigrateToV12 included in the same runtime.

So, the first time claim_queue got called it used garbage data that made it take a very long time and the candidate-backing subsystem got stucked waiting on the api calls, that also explains why after restart on the following blocks things recovered and worked as expected.

An additional data point is the fact the subsystem got declared stalled by the overseer exactly 64 blocks after the block containing the runtime was imported.

Testing with chopsticks at block 22572435 confirmed that claim_queue is taking a lot of time to complete because of ParachainHost corruption.

Conlusion

Root-cause is fairly well understood and checked, so no further impact is expected.

Action items

Gather more data to confirm this is the root-cause.
Track the network to make sure after everyone recovered their validators and collators the block production of the relaychain and parachain is the same as before.
Why wasn't this a problem on Kusama runtime upgrade ? Corruption happened on kusama as well at block 24,786,390, but we got lucky/unlucky with the corruption and claim_queue api returned faster, so there was no impact.
Paseo seems to have been impacted by this, so we should've had a way to delay the runtime upgrade.
Postmortem.
Make the subsystem more robust so they don't fail hard on this type of errors, if the subsystem wouldn't have crashed and would just move on to next block, then we wouldn't have seen any impact.

The text was updated successfully, but these errors were encountered:

ordian · 2024-09-17T13:53:33Z

Make the subsystem more robust so they don't fail hard on this type of errors, if the subsystem wouldn't have crashed and would just move on to next block, then we wouldn't have seen any impact.

Apart from fixing #64, we could add timeouts for runtime APIs that are (slightly) lower than subsystem stall timeout. Although this sounds brittle, it might be a bit better than letting the node hang and shutdown.

burdges · 2024-09-17T14:11:59Z

Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput.

Availabiltiy bitfields could be subject to back pressure too maybe? I suppose that's a topic for another thread.

bkchr · 2024-09-17T14:51:28Z

Although this sounds brittle, it might be a bit better than letting the node hang and shutdown.

Generally the runtime is seen as trust source code. Also aborting is still not that easy to do with wasmtime.

alexggh · 2024-09-17T15:11:03Z

Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput.

In this case I estimate almost all the validators restarted at the same time, the recovery itself was slower than expected because of two things:

Parachain collators got stuck as well from the same reason, parachain would not produce blocks
When all nodes got restarted, all of them had to go through DHT peer discovery, which is around 10s of minutes, PVF recompilation, and probably other stuff we do at restart that I'm missing, so it took awhile until everything was in placed for nodes to finalize blocks.

kianenigma · 2024-09-18T10:39:30Z

Thanks for the great writeup!

Q1:

Why wasn't this a problem on Kusama runtime upgrade ? Corruption happened on kusama as well at block 24,786,390, but we got lucky/unlucky with the corruption and claim_queue api returned faster, so there was no impact.

I didn't quite get why Kusama survived? The same migration is there as well.

Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return None or Default or Err or something. But I can't see why it would cause it to just take a long time to respond?

sandreim · 2024-09-18T10:57:31Z

Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return None or Default or Err or something. But I can't see why it would cause it to just take a long time to respond?

The claim_queue runtime API will use a corrupted host configuration with this piece of code:

polkadot-sdk/polkadot/runtime/parachains/src/scheduler.rs

Line 611 in ba38d31

let n_session_cores = T::AssignmentProvider::session_core_count();

Checkout the values used by the code below in the HostConfiguration snippet.

"HostConfiguration"{
   "approval_voting_params":"ApprovalVotingParams"{
      "max_approval_coalesce_count":655360
   },
   "scheduler_params":"SchedulerParams"{
      "group_rotation_frequency":131072,
      "paras_availability_period":83951616,
      "max_validators_per_core":"None",
      "lookahead":100663296,
      "num_cores":1476395008,
      "max_availability_timeouts":50331650,
      "on_demand_queue_max_size":1493172224,
      "on_demand_target_queue_utilization":0%,
      "on_demand_fee_variability":50.331648%,
      "on_demand_base_fee":2658455997759531942378630683232829440,
      "ttl":134217728
   }
}

On Kusama we had different values there which made the API run faster.

alexggh · 2024-09-18T11:01:29Z

I didn't quite get why Kusama survived? The same migration is there as well.

Most likely it matters what's the garbage and on kusama the API is still returning, you can test it with chopsticks on kusama.

On polkadot it gets stuck because of those really high number values @sandreim posted here: #5738 (comment), lookahead and num_cores translate to a lot of loops.

Polkadot-Forum · 2024-09-18T11:32:25Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2024-09-17-polkadot-finality-lag-slow-parachain-production-immediately-after-runtime-upgrade-post-mortem/10057/1

alexggh · 2024-09-24T11:26:27Z

Closing this now, the root-cause has been found and subsequent follow-up items have their own issues:

ordian mentioned this issue Sep 19, 2024

Elastic scaling: runtime v2 descriptor support #5423

Merged

6 tasks

This was referenced Sep 24, 2024

E2E test for runtime upgrades #5817

Open

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

Open

alexggh closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

alexggh commented Sep 17, 2024 •

edited

Loading

ordian commented Sep 17, 2024

burdges commented Sep 17, 2024 •

edited

Loading

bkchr commented Sep 17, 2024

alexggh commented Sep 17, 2024

kianenigma commented Sep 18, 2024

sandreim commented Sep 18, 2024

alexggh commented Sep 18, 2024 •

edited

Loading

Polkadot-Forum commented Sep 18, 2024

alexggh commented Sep 24, 2024

[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

Comments

alexggh commented Sep 17, 2024 • edited Loading

What happened

Root cause

Conlusion

Action items

ordian commented Sep 17, 2024

burdges commented Sep 17, 2024 • edited Loading

bkchr commented Sep 17, 2024

alexggh commented Sep 17, 2024

kianenigma commented Sep 18, 2024

sandreim commented Sep 18, 2024

alexggh commented Sep 18, 2024 • edited Loading

Polkadot-Forum commented Sep 18, 2024

alexggh commented Sep 24, 2024

alexggh commented Sep 17, 2024 •

edited

Loading

burdges commented Sep 17, 2024 •

edited

Loading

alexggh commented Sep 18, 2024 •

edited

Loading