Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Polkadot Incident 11.03.2023 #6862

Closed
bkchr opened this issue Mar 12, 2023 · 42 comments · Fixed by #6937
Closed

Polkadot Incident 11.03.2023 #6862

bkchr opened this issue Mar 12, 2023 · 42 comments · Fixed by #6937
Labels
I3-bug Fails to follow expected behavior.

Comments

@bkchr
Copy link
Member

bkchr commented Mar 12, 2023

On 11.03.2023 a validator started to send out thousands of disputes. The disputes started to slow down the network and also lead to degraded block times of Polkadot.

The reason for the disputes is probably the following issue: #6860

Another validator shared the following logs:

Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 Encountered error finding best chain containing 0x7633f17db42d3e860654da761b237216609ed27357a0441ba3b6fc30687d3331: couldn't find target block: Request for leaves from chain selection got canceled
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 Could not cast prevote: previously known block 0x7633f17db42d3e860654da761b237216609ed27357a0441ba3b6fc30687d3331 has disappeared
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 Essential task `overseer` failed. Shutting down service.
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { oriin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="pvf-checker-subsystem" err=FromOrigin { origin: "pvf-checker", source: Generated(Context("Signal channel is terminated and empty.")) }
Mar 12 12:54:35 localhost polkadot[54572]: 2023-03-12 12:54:35 subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Mar 12 12:55:35 localhost polkadot[54572]: 2023-03-12 12:55:35 Detected running(potentially stalled) tasks on shutdown:
Mar 12 12:55:35 localhost polkadot[54572]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Mar 12 12:55:35 localhost polkadot[54572]: 2023-03-12 12:55:35 Task "provisioner-subsystem" (Group: provisioner) was still running after waiting 60 seconds to finish.
Mar 12 12:55:35 localhost polkadot[54572]: 2023-03-12 12:55:35 Task "availability-store-subsystem" (Group: availability-store) was still running after waiting 60 seconds to finish.
Mar 12 12:55:35 localhost polkadot[54572]: 2023-03-12 12:55:35 Task "candidate-backing-subsystem" (Group: candidate-backing) was still running after >
Mar 12 12:55:35 localhost polkadot[54572]: Error:
Mar 12 12:55:35 localhost polkadot[54572]:    0: Other: Essential task failed.
Mar 12 12:55:35 localhost polkadot[54572]: Run with RUST_BACKTRACE=full to include source snippets.
@bkchr bkchr added the I3-bug Fails to follow expected behavior. label Mar 12, 2023
@bkchr
Copy link
Member Author

bkchr commented Mar 12, 2023

Maybe because the dispute coordinator is overloaded selecting the best block to build on is either too slow or not returns at all. This isn't supported given the log messages above, but given the amount of disputes it could be a reasonable explanation.

@bkchr
Copy link
Member Author

bkchr commented Mar 12, 2023

@bkchr
Copy link
Member Author

bkchr commented Mar 12, 2023

Validator that spent 4 seconds on trying to build a block: https://gist.github.com/bLd75/da95a8c51716fa4e23807fc35fb9f68d

The block was discarded in the end. Maybe the inherent took too long to process?

@bkchr
Copy link
Member Author

bkchr commented Mar 12, 2023

https://gist.github.com/matherceg/42f4c6c32a5fdaa9513801155a9e93d9

Availability store subsystem is constantly stalling.

@burdges
Copy link
Contributor

burdges commented Mar 12, 2023

We've witnessed availability dying during past events? We've never seen this many disputes before? We're doing too many things in parallel maybe, like maybe running too many availability recoveries.

Adversaries could race against approval checker assignment announcements, so those should progress reliably and fast, but other subsystems could typically delay their work in one way or another.

In particular, we should think about sequencing for disputes for example: If two parachains blocks are disputed in the same relay chain block, then we could do them semi-sequentially, which slows finality much more but leaves CPU time for other activities. We could similarly delay later disputes until earlier ones get resolved. We would not afaik need consensus on the ordering because we'll have to do them all anyways. Also, if we've a random sampling of disputes then we could favor those with more invalid votes first, because one invalid vote triggers rewinding the chain. We'll avoids no-shows if validators can delay heavy work somewhat, which matters since we do not have consensus upon the delay.

We've discussed adding delay estimates into approval checker assignment announcements, but never really analyzed how this should be abused.

We should probably add some fake no-show system to a testnet, and eventually kusama for a while, which'll provide a different sort of stress testing than say glutton pallet parachains.

@bkchr
Copy link
Member Author

bkchr commented Mar 12, 2023

We've witnessed availability dying during past events? We've never seen this many disputes before?

Disputes are not related to availability dying. Disputes were probably related to the bug mentioned in the original posting.

@joepetrowski
Copy link
Contributor

Logs from the last 24 hours: 2023-03-13-last-24-hours-JOE-V01.log

@bkchr
Copy link
Member Author

bkchr commented Mar 13, 2023

Logs from the last 24 hours: 2023-03-13-last-24-hours-JOE-V01.log

Did you remove any log lines? I see that the node is taking too much time to produce a block, but it also doesn't print the "Pre-sealed" message or that it took too long.

This is also interesting:

Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))
Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))
Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))
Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))

A honest node should not trigger this?

@ordian
Copy link
Member

ordian commented Mar 13, 2023

Something to investigate is that block time degradation continued long after the last dispute was resolved according to subscan: https://polkadot.subscan.io/event?address=&module=parasdisputes&event=all&startDate=&endDate=&startBlock=&endBlock=&timeType=date&version=9370.

@SimonKraus
Copy link

@ordian Yes we've seen slow but steady block time recovery on the parachains too.
Good visualization of the curve to be found in HydraDX Collator dashboard: https://grafana.play.hydration.cloud/d/IL4WH9kVa/collators?orgId=2&from=1678402800000&to=1678748399000

@sandreim
Copy link
Contributor

sandreim commented Mar 13, 2023

Logs from the last 24 hours: 2023-03-13-last-24-hours-JOE-V01.log

Do we also have prometheus metrics for this node ?

@sandreim
Copy link
Contributor

Block import times from our Kusma validators during disputes period. There are several outliers > 2s. But it might look worse on other nodes depending on hw.

Screenshot 2023-03-13 at 13 25 50

@sandreim
Copy link
Contributor

sandreim commented Mar 14, 2023

Reproduced the issue on Versi and the used block benchmarking to look closer. To me this seems to be the problem, overweight blocks due to dispute statements. We should profile our runtime code and/or fix our benchmarks to ensure this doesn't happen.

Running `target/production/substrate benchmark block --repeat=3 --from 8405 --to 8600 --db rocksdb --base-path /home/sandreim/versi_db/20230314-095606 --dev --execution=Wasm`

2023-03-14 12:20:00 Block 8416 with     2 tx used 161.77% of its weight ( 3,415,277,557 of  2,111,179,062 ns) - OVER WEIGHT!    
2023-03-14 12:20:01 Block 8417 with     5 tx used  79.75% of its weight (   449,651,252 of    563,853,838 ns)    
2023-03-14 12:20:03 Block 8418 with     5 tx used  66.37% of its weight (   450,878,860 of    679,292,314 ns)    
2023-03-14 12:20:04 Block 8419 with     7 tx used 276.12% of its weight (   325,270,032 of    117,798,863 ns) - OVER WEIGHT!    
2023-03-14 12:20:05 Block 8420 with     6 tx used  74.09% of its weight (   471,057,274 of    635,749,851 ns)    
2023-03-14 12:20:06 Block 8421 with     4 tx used  66.77% of its weight (   483,358,720 of    723,962,236 ns)    
2023-03-14 12:20:17 Block 8422 with     2 tx used 159.79% of its weight ( 3,380,703,333 of  2,115,706,542 ns) - OVER WEIGHT!    
2023-03-14 12:20:27 Block 8423 with     2 tx used 163.95% of its weight ( 3,474,325,357 of  2,119,183,882 ns) - OVER WEIGHT!    
2023-03-14 12:20:28 Block 8424 with    25 tx used  85.25% of its weight (   462,118,567 of    542,050,010 ns)    
2023-03-14 12:20:30 Block 8425 with     6 tx used 123.41% of its weight (   448,888,929 of    363,741,418 ns) - OVER WEIGHT!    
2023-03-14 12:20:31 Block 8426 with    11 tx used  71.55% of its weight (   484,063,270 of    676,579,972 ns)    
2023-03-14 12:20:32 Block 8427 with    13 tx used  84.62% of its weight (   441,319,387 of    521,532,364 ns)    
2023-03-14 12:20:43 Block 8428 with     2 tx used 165.23% of its weight ( 3,509,763,887 of  2,124,232,762 ns) - OVER WEIGHT!    
2023-03-14 12:20:54 Block 8429 with     2 tx used 173.14% of its weight ( 3,681,854,294 of  2,126,555,888 ns) - OVER WEIGHT!    
2023-03-14 12:20:56 Block 8430 with    21 tx used  82.18% of its weight (   536,352,058 of    652,686,508 ns)    
2023-03-14 12:20:57 Block 8431 with    15 tx used  73.91% of its weight (   571,469,565 of    773,157,660 ns)    
2023-03-14 12:21:08 Block 8432 with     2 tx used 162.42% of its weight ( 3,460,068,835 of  2,130,279,730 ns) - OVER WEIGHT!    
2023-03-14 12:21:09 Block 8433 with    32 tx used 273.82% of its weight (   404,748,840 of    147,816,954 ns) - OVER WEIGHT!    
2023-03-14 12:21:10 Block 8434 with     9 tx used  77.72% of its weight (   498,307,455 of    641,173,068 ns)    
2023-03-14 12:21:12 Block 8435 with     6 tx used  69.90% of its weight (   532,214,922 of    761,418,291 ns)    
2023-03-14 12:21:22 Block 8436 with     2 tx used 156.85% of its weight ( 3,337,463,364 of  2,127,844,858 ns) - OVER WEIGHT!    
2023-03-14 12:21:24 Block 8437 with    17 tx used  81.47% of its weight (   534,775,081 of    656,401,253 ns)    
2023-03-14 12:21:25 Block 8438 with    25 tx used  72.32% of its weight (   568,230,380 of    785,700,333 ns)    
2023-03-14 12:21:27 Block 8439 with     2 tx used  79.35% of its weight (   510,105,347 of    642,881,458 ns)    
2023-03-14 12:21:29 Block 8440 with     2 tx used  68.72% of its weight (   527,634,935 of    767,847,786 ns)    
2023-03-14 12:21:30 Block 8441 with     2 tx used  80.53% of its weight (   516,181,041 of    640,976,358 ns)    
2023-03-14 12:21:32 Block 8442 with     2 tx used  68.51% of its weight (   527,627,197 of    770,099,902 ns)    
2023-03-14 12:21:33 Block 8443 with     3 tx used  81.40% of its weight (   526,896,182 of    647,292,310 ns)    
2023-03-14 12:21:44 Block 8444 with     2 tx used 164.02% of its weight ( 3,509,827,564 of  2,139,837,572 ns) - OVER WEIGHT!    
2023-03-14 12:21:45 Block 8445 with     2 tx used  79.00% of its weight (   564,723,849 of    714,869,606 ns)    
2023-03-14 12:21:47 Block 8446 with     2 tx used  76.17% of its weight (   503,585,434 of    661,157,254 ns)    
2023-03-14 12:21:49 Block 8447 with     2 tx used  71.57% of its weight (   515,453,873 of    720,211,002 ns)    
2023-03-14 12:21:50 Block 8448 with     2 tx used  73.81% of its weight (   497,716,461 of    674,305,270 ns)    
2023-03-14 12:21:52 Block 8449 with     3 tx used  70.00% of its weight (   516,546,002 of    737,907,582 ns)    
2023-03-14 12:21:53 Block 8450 with     3 tx used 263.76% of its weight (   383,105,080 of    145,250,257 ns) - OVER WEIGHT!    
2023-03-14 12:21:54 Block 8451 with     2 tx used  20.75% of its weight (   444,972,686 of  2,144,671,406 ns)    
2023-03-14 12:21:56 Block 8452 with     2 tx used  72.12% of its weight (   487,177,171 of    675,466,606 ns)    
2023-03-14 12:21:57 Block 8453 with     2 tx used  70.44% of its weight (   475,143,470 of    674,513,314 ns)    
2023-03-14 12:22:06 Block 8454 with     2 tx used 148.63% of its weight ( 3,168,121,933 of  2,131,563,804 ns) - OVER WEIGHT!    
2023-03-14 12:22:08 Block 8455 with     2 tx used  69.64% of its weight (   494,683,203 of    710,321,808 ns)    

Edit: bench machine CPU:

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7601 32-Core Processor
stepping        : 2
microcode       : 0x8001230
cpu MHz         : 1200.000
cache size      : 512 KB

@ggwpez
Copy link
Member

ggwpez commented Mar 14, 2023

We should profile our runtime code and/or fix our benchmarks to ensure this doesn't happen.

Related paritytech/polkadot-sdk#849
Looks like that benchmark is not so great in either direction; normally over-estimating and in the worst case under-estimating.

@bkchr
Copy link
Member Author

bkchr commented Mar 14, 2023

@koute we will need your help here :D @sandreim please make the db available.

@koute
Copy link
Contributor

koute commented Mar 14, 2023

@koute we will need your help here :D @sandreim please make the db available.

I'm on it!

@sandreim
Copy link
Contributor

@koute we will need your help here :D @sandreim please make the db available.

gsutil -m cp -r "gs://parity-versi-blockstore-backups/versi-rpc-node-1/20230314-095606" .

@koute
Copy link
Contributor

koute commented Mar 20, 2023

A quick update.

Here's the problematic backtrace from @sandreim's reproduction:

Core_execute_block
...
polkadot_runtime_parachains::paras_inherent::<impl polkadot_runtime_parachains::paras_inherent::pallet::Pallet<T>>::enter_inner
polkadot_runtime_parachains::paras_inherent::limit_and_sanitize_disputes
<polkadot_runtime_parachains::disputes::pallet::Pallet<T> as polkadot_runtime_parachains::disputes::DisputesHandler<<T as frame_system::pallet::Config>::BlockNumber>>::filter_dispute_data
polkadot_runtime_parachains::disputes::check_signature
sp_io::crypto::extern_host_function_impls::sr25519_verify
...
sp_io::crypto::sr25519_verify_version_2

This takes up 88% of the time. 76% of the time is spent inside schnorrkel::sign::<impl schnorrkel::keys::PublicKey>::verify_simple. 67% of the time is spent inside curve25519_dalek::ristretto::RistrettoPoint::vartime_double_scalar_mul_basepoint.

So some possible actions we could take to make this better:

  1. Optimize curve25519_dalek. Not entirely sure how well it is optimized right now, but any improvements there would have a big impact considering the code spends majority of its time there.
  2. We're verifying the signatures of the disputes one by one; it seems to me that we could just switch to the batch verifier here?
  3. We're letting the disputes just consume the whole block; perhaps that's not the best idea when there are so many disputes?

@TomaszWaszczyk
Copy link

TomaszWaszczyk commented Mar 20, 2023

@sandreim could you say how do you measure execution time? How to get those numbers on local machine, is it possible at all? Please.

This takes up 88% of the time. 76% of the time is spent inside schnorrkel::sign::::verify_simple. 67% of the time is spent inside curve25519_dalek::ristretto::RistrettoPoint::vartime_double_scalar_mul_basepoint.

@ordian
Copy link
Member

ordian commented Mar 20, 2023

  1. We're verifying the signatures of the disputes one by one; it seems to me that we could just switch to the batch verifier here?

Makes sense to me. If one signature is invalid, we could discard the whole dispute set to avoid duplicate checks. Currently, we're just discarding one vote:

if let Err(()) = check_signature(
&validator_public,
set.candidate_hash,
set.session,
statement,
signature,
) {
importer.undo(undo);
filter.remove_index(i);
continue
}

Otherwise, we'd have to fall back to individual signature checks in case of batch verifier fails IIUC.

  1. We're letting the disputes just consume the whole block; perhaps that's not the best idea when there are so many disputes?

There shouldn't be any disputes in the first place. They are here due to either some bugs or slow hardware. This would also be fixed with slashing and validator disabling. However, it does make sense to allocate not the whole block weight, but maybe half of that at most to disputes. @eskimor WDYT?

@ordian
Copy link
Member

ordian commented Mar 20, 2023

  1. Optimize curve25519_dalek. Not entirely sure how well it is optimized right now, but any improvements there would have a big impact considering the code spends majority of its time there.

I wonder if the weight of a single signature verification check of 48 microsecs varies greatly under load:

.saturating_add(Weight::from_parts(48_796_195, 0).saturating_mul(v.into()))

@sandreim
Copy link
Contributor

Otherwise, we'd have to fall back to individual signature checks in case of batch verifier fails IIUC.

  1. We're letting the disputes just consume the whole block; perhaps that's not the best idea when there are so many disputes?

There shouldn't be any disputes in the first place. They are here due to either some bugs or slow hardware. This would also be fixed with slashing and validator disabling. However, it does make sense to allocate not the whole block weight, but maybe half of that at most to disputes. @eskimor WDYT?

This makes sense to me. AFAIU the only downside would be additional finality lag. We need to continue load testing on Versi to figure out how much we can fill the block with votes. 50 to 75% sounds like a good interval to me.

@sandreim
Copy link
Contributor

  1. Optimize curve25519_dalek. Not entirely sure how well it is optimized right now, but any improvements there would have a big impact considering the code spends majority of its time there.

I wonder if the weight of a single signature verification check of 48 microsecs varies greatly under load:

.saturating_add(Weight::from_parts(48_796_195, 0).saturating_mul(v.into()))

That would likely be the case if the host is loaded and the thread running the runtime is not getting a full CPU core. We have some support for runtime metrics, (requires a compile time feature). I think we can create a histogram of the time spent there to properly visualise the variance.

@koute
Copy link
Contributor

koute commented Mar 20, 2023

Otherwise, we'd have to fall back to individual signature checks in case of batch verifier fails IIUC.

Hmm.... this is somewhat of an unfortunate API omission that when batch verifying we can only fetch an aggregate of "did all of the checks succeed?" instead of being able to tell which exact signatures are okay and which ones are not. Maybe it'd make sense to add an alternative host function to finish_batch_verify that would return e.g. a Result<(), Vec<bool>> indicating that either all succeeded (the Ok case) or that not all succeeded (the Err case)? (cc @bkchr)

@bkchr
Copy link
Member Author

bkchr commented Mar 20, 2023

We don't have batch verification currently available in the runtime. There was some work, but it is currently disabled. We could support batch verification, but only for the relay chain itself. For Parachains we had some discussion here: https://forum.polkadot.network/t/background-signature-verification/132.

This makes sense to me. AFAIU the only downside would be additional finality lag.

I think some finality lag, with the assumption that we will have slashing makes sense. Nevertheless we need to ensure that this doesn't happen again. Even if validators are slashed etc, bugs can still happen and we may see some huge amount of disputes again.

@sandreim
Copy link
Contributor

To recap, we currently have identified 2 issues that occurred during the incident:

  1. high block import time - the root cause seems to be that we fill the blocks with only disputes in the worst case (like when there are many disputes happening) .
  2. dispute votes get pushed on chain hours after last dispute concluded. This happens because we always put all votes for at most 1000 disputes and their votes, so this exaplains why we keep seeing block import times high even after disputes end. This is a known issue which was fixed in Improved dispute votes import in provisioner #5567 but currently it is not enabled yet on Kusama/Polkadot as it depends on Parachain Host version 3 API.

The second issue can be seen in this screenshot. Above panel shows disputes being raised and voted on. The bottom one shows votes being passed to create_inherent long after disputes ended.

Screenshot 2023-03-20 at 11 20 19

That being said, we need to prioritize releasing version 3 parachain host API to have 2 fixed soon, but to solve 1 we need to optimize and/or reduce the amount of dispute votes we include.

@tdimitrov is investigating the variance for signature checking during high load dispute tests on Versi (using runtime-metrics feature). That information should be useful to determine the percentage of the block we fill with votes.

@bkchr
Copy link
Member Author

bkchr commented Mar 20, 2023

Sounds like we have explanations for everything we have seen @sandreim? So, we can start writing a postmortem. Would you be open to tackle this?

@sandreim
Copy link
Contributor

Yeah, I think we can explain all that has been seen except the err=Generated(SubsystemStalled("availability-store-subsystem")) in the validator logs you posted initially. I did not encounter this one in all Versi testing.

Regarding postmortem, yeah, I could start a writeup as soon as we have some more data about the variance of the signature checks.

@koute
Copy link
Contributor

koute commented Mar 20, 2023

Also, one more thing regarding speeding this up - looks like there's an easy performance win we could have here by enabling SIMD in the curve25519_dalek crate.

Before:

Block 8429 with     2 tx used  98.15% of its weight ( 2,087,189,806 of  2,126,555,888 ns)    

After:

Block 8429 with     2 tx used  72.41% of its weight ( 1,539,741,678 of  2,126,555,888 ns)    

That's a free almost ~30% speedup right there.

There are two problems with this though:

  1. This requires AVX2. Officially this should not be a problem because our reference hardware to run a node is an 7700K from 2016, and AVX2 is present in Intel CPUs since the 4xxx series from 2013 (and on AMD it's available since 2015). So are we fine with not being able to run on CPUs which are at this point ~10 years old?
  2. This currently requires Rust nightly as the curve25519_dalek uses the packed-simd crate. But this could be rectified by e.g. removing the dependency on the packed-simd crate from curve25519_dalek to make it work on stable, and then using that version in schnorrkel.

@sandreim
Copy link
Contributor

Thanks @koute . 2 might be workable but 1 is something that falls out of our control. @wpank do we have any data to support raising requirements ?

@ordian
Copy link
Member

ordian commented Mar 20, 2023

Can we a dynamic feature detection with an ifunc trick?

@koute
Copy link
Contributor

koute commented Mar 20, 2023

but 1 is something that falls out of our control.

For what's worth no validator should be running on such old hardware, and if they do then they'll be too slow anyway.

Another option would be to just detect this dynamically at runtime and only use the AVX2 codepath when it's supported I guess. (This means even more changes to curve25519_dalek, which is fine I suppose as long as upstream will accept it.)

@bkchr
Copy link
Member Author

bkchr commented Mar 20, 2023

We should not forget that there are also other nodes in the network and not just validators. So, having it dynamic would be a requirement. I hope no one runs a validator on these machines, but some ordinary node should be fine!

@koute
Copy link
Contributor

koute commented Mar 20, 2023

I'll see what I can do and try to make the curve25519_dalek's SIMD support work on stable Rust and autodetect the features at runtime.

@burdges
Copy link
Contributor

burdges commented Mar 21, 2023

We're not already using AVX2? yikes. curve25519_dalek now has new maintainers, so they'll respond at least.

when batch verifying we can only fetch an aggregate of "did all of the checks succeed?" instead of being able to tell which exact signatures are okay and which ones are not.

This is what cryptographers mean by batch verification. We merge the verification equations, so they all succeed or fail together.

It's quite useful if you've many nodes verifying blocks, but annoying if you're processing individual messages which maybe contain spam, hence why afaik web2 never deployed it. In principle, you could bisect batches until you find the baddies, but only if you've some node reputation system, or batch by source, or similar.

@bkchr
Copy link
Member Author

bkchr commented Mar 21, 2023

Makes sense to me. If one signature is invalid, we could discard the whole dispute set to avoid duplicate checks. Currently, we're just discarding one vote:

The block author should not push any invalid votes! If they do this, we should reject the entire inherent, not single votes!

@sandreim
Copy link
Contributor

I just started to write the postmortem on this and I arrived at the conclusion that we are still missing some information. I cannot explain for example why blocks such as this one have no dispute statements in them and still take a lot of time to be imported (as observed in the metrics) and relay chain block times.

My reasoning here was that disputes votes get pushed on chain hours after last dispute concluded. But it seems that this is not true ...

@sandreim
Copy link
Contributor

sandreim commented Mar 21, 2023

The only thing that I can find that could slow things down when there are no disputes happening is https://github.com/paritytech/polkadot/blob/master/runtime/parachains/src/disputes.rs#L918 . This iterates all disputes from past 6 sessions.

@bkchr
Copy link
Member Author

bkchr commented Mar 22, 2023

I cannot explain for example why blocks such as this one have no dispute statements in them and still take a lot of time to be imported (as observed in the metrics) and relay chain block times.

Do you know when we imported the last dispute votes?

@sandreim
Copy link
Contributor

sandreim commented Mar 22, 2023

Do you know when we imported the last dispute votes?

Unfortunately I do not find the block number anymore :(

However I have isolated the issue and a proposed fix is in #6937

@alam807
Copy link

alam807 commented Apr 30, 2023

Logs from the last 24 hours: 2023-03-13-last-24-hours-JOE-V01.log

Did you remove any log lines? I see that the node is taking too much time to produce a block, but it also doesn't print the "Pre-sealed" message or that it took too long.

This is also interesting:

Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))
Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))
Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))
Mar 12 08:29:27 pantani-joe-v01-nyc3 polkadot[913301]: 2023-03-12 08:29:27 error=AuthorityFlooding(Public(9e193b2d7bd2a55e7e715c64e67d0be4b76086d9cc31c11ec35caeb68634eb55 (14aJ4YDf...)))

A honest node should not trigger this?

I've gotten the same error when the node was active and selected as paravalidator. Please advise what could have caused this. The node is running binary: v0.9.39-1. Looking back at the logs, I see that error was on the logs in the past as well. The machine pass all benchmarks, and I have a high bandwidth connection. Cheers!

@mrcnski
Copy link
Contributor

mrcnski commented May 1, 2023

@alam807 That seems to be expected since 0.9.39-1 is from Mar 9, which I think is before this was fixed?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.