Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalization stalling issues 2024-07 #1104

Closed
MOZGIII opened this issue Jul 26, 2024 · 33 comments
Closed

Finalization stalling issues 2024-07 #1104

MOZGIII opened this issue Jul 26, 2024 · 33 comments
Assignees
Labels
tracking A tracking issue for a big multi-PR initiative

Comments

@MOZGIII
Copy link
Contributor

MOZGIII commented Jul 26, 2024

This is a tracking issue for the recurring finalization stalling incidents.

So far, here are the pointers we have.

Commits that are affecting the finalization directly:

Notable historical commits that might've affected this incident:

Probably irrelevant:

Related things from the future:


The thoughts we have so far is that the issue is caused by the broken block pruning (which is meaningful when an unfortunate best block selection happens). (UPD: the blocks are not really pruned, we can still request them from the API; this means this is a selection issue) Grandpa tries to finalize the block that is pruned and not the part of the best chain anymore; upon receiving the precommits it fails to recognize them because the block can't be resolved (it's pruned). This is quite odd, since the nodes operate with --blocks-pruning archive, so it must mean they haven't seen the block in question as is it no never to be pruned if seen - but in reality we see a the reorg happening from the block in question to a new one, so the old block is definitely a known one.


We are currently on substrate 0.9.40.


Mainnet encountered this at least three times lately:

image
@MOZGIII MOZGIII added the tracking A tracking issue for a big multi-PR initiative label Jul 26, 2024
@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

2024-07-26 02:31:13.467  WARN tokio-runtime-worker grandpa: Encountered error finding best chain containing 0xab3e065e87c2c2e3ac34cb07a0e05c1fa7dd7df2d782fee66a8c284bbd430fd0: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0xab3e065e87c2c2e3ac34cb07a0e05c1fa7dd7df2d782fee66a8c284bbd430fd0 not in the best chain 0x325ab807d115074518d0c5bd764de875b0945d0166d0937753968f98b4c2805d    

Explore-logs-2024-07-26 10_41_41.txt
Explore-logs-2024-07-26 10_53_12.txt
The logs from the node to hint what's going on

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

Current plan for a workaround is to delay the finality for more than 2 blocks. Currently we are planning to delay the finality after by 5 7 blocks beyond the best block. This should help with reducing the chance of the reorg-caused race condition we are facing.

@dmitrylavrenov
Copy link
Contributor

Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

It shows the root cause that we face.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

We have analyzed the code, and it looks like the issue is with those two PRs:

We think that the fix that is added at paritytech/substrate#13364 does not apply after the code added at paritytech/substrate#13289 returns an error. We'll investigate further.

We are still on course to apply the workaround mentioned above as a hotfix, but the long term solution is also in sight.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

To elaborate more on paritytech/substrate#13364:

  1. LongestChain::finality_target will error out when the passed target block is not in the best chain (https://github.com/paritytech/substrate/pull/13289/files#diff-97e7e3622ec6f0b455d18c1bf8055ccf5cfd210efd50348df6b07fce4531baeeR115-R119). This also applies to the case where the block number of the target and best-chain block are the same.
  2. The fixup code at https://github.com/paritytech/substrate/pull/13364/files#diff-5ecb4cf31ed389a0d12482137c8b234485b199c200b09446cf8876321527446aR1242 tries to use the best block as target block (best block to target block) but in practice that part of the fn is not reached due to how an error from LongestChain::finality_target above is handled.
  3. As a result, the intended fix is not applied, and we see the error logged instead - but also the consensus stalls.

@bkchr
Copy link

bkchr commented Jul 26, 2024

Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

The test is succeeding?

@dmitrylavrenov
Copy link
Contributor

Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

The test is succeeding?

Yep, we've implemented it in the way to verify our thoughts about the happened case.

@dmitrylavrenov
Copy link
Contributor

We met the same error

	// `hashof8_a` should be finalized next based on last completed round data,
	// but it's not an valid finality target
	assert_matches!(
		env.select_chain.finality_target(hashof8_a, None).await.unwrap_err(),
		ConsensusError::ChainLookup(_)
	);

@andresilva
Copy link

I have only looked into this superficially, but I think @MOZGIII comment on the LongestChain::finality_target issue is correct. The root cause is that we shouldn't error when the given vote target is not in the best chain, otherwise we make it impossible for finality to proceed in case we have a round estimate that gets re-orged out. In principle what should happen is grandpa would finalize the fork that got re-orged out, and then the node would re-org again (after grandpa finalizes). I didn't think through all the consequences of such a change.

@dmitrylavrenov
Copy link
Contributor

I have only looked into this superficially, but I think @MOZGIII comment on the LongestChain::finality_target issue is correct. The root cause is that we shouldn't error when the given vote target is not in the best chain, otherwise we make it impossible for finality to proceed in case we have a round estimate that gets re-orged out. In principle what should happen is grandpa would finalize the fork that got re-orged out, and then the node would re-org again (after grandpa finalizes). I didn't think through all the consequences of such a change.

The same thoughts we discussed with @MOZGIII

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

@bkchr @andresilva should we expect any hotfix for this implemented at the parity side or rather work on this ourselves?

@dmitrylavrenov
Copy link
Contributor

dmitrylavrenov commented Jul 26, 2024

@bkchr @andresilva let me create PR of this test into our fork and describe it in more details

@andresilva
Copy link

paritytech/polkadot-sdk#5153

I think this should fix it.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

paritytech/polkadot-sdk#5153 looks good, thx!

This would still be risky to ship for us as it means the whole layer of logic dedicated to the grandpa-issued reorg was not used for a while now and may contain bugs or just be broken cause of being stale...

We'll think about how to test this. So far we have thought about replicating the scenario of having 1k+ validators on our load-test facilities - but in our prior load tests we haven't seen the finality stalls like we see in the live chain; that might be due to not running for enough time, or due to other conditions - so it unlikely we'll have an accurate test for this in the load testing environment.

Any thoughts on how to trigger this bug in the live-like environment?

@andresilva
Copy link

andresilva commented Jul 26, 2024

I think besides any significant changes to make sure the issue happens (e.g. changing vote logic to make sure validators vote on abandoned forks explicitly), the easiest way would be to make forks more likely to happen and hence make it more likely for the issue to surface. I think changing https://github.com/humanode-network/humanode/blob/master/crates/humanode-runtime/src/constants.rs#L49 to (1, 1) should make forks more prevalent, reducing block time might also help.

This shouldn't be changed on any live chain though, since these parameters can't be changed dynamically. You'd need to bootstrap a new test chain.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

Great idea! We will also change the block time from 6s to 1s for more stress...


Pretty sure it is fine, but just to double check - is it safe to ship f5bccf0 ?
Should not cause anything problematic afaik.

@andresilva
Copy link

andresilva commented Jul 26, 2024

Yes, don't see any problem with that (and should make the issue less likely to happen).

For the test above you can also remove the BeforeBestBlockBy rule to make sure the validators always vote for the latest block, which should make issues with reorgs even more likely!

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 26, 2024

Great!

To summarize, we'll test your fix with the following:

  • (1, 1) primary at babe
  • 1s block production time
  • removed BeforeBestBlockBy at voting rules

UPD: implemented at #1106

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 27, 2024

Some new input, as the chain stalled again just now:

  • last finalized block 8745707 0x3c97148e16cc04c5df90b96f64f37d9dddf083468a6c4df2dd447f5396df58fd
  • epoch start 8745697 0xce0fac61989be9f69a27c5671e71b40f2adc579c231b933a613c5ab103fd5f7a (10 blocks away)
  • Reorg on #8745708,0x191a…c58d to #8745711,0xfb9c…21bc, common ancestor #8745707,0x3c97…58fd - to the future
image image

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 27, 2024

We did not push the fix just yet, only the bigger finalization delay

But we see this:

image

The issue seems to be that once the prevote for a round goes out it then can't progress and the grandpa stalls.

@andresilva @bkchr let me know if you need access to our infra

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 27, 2024

Maybe some peers are not voting in the round because of the bug (and some do - like the one we have above), and thus the whole consensus stalls, as if the amount of online nodes is less that quorum...

Only 39% of the nodes have applied the finality delay update according to https://telemetry.humanode.io

image

@andresilva
Copy link

andresilva commented Jul 28, 2024

Yes, you'd need >66% of the nodes on the new version for it to effectively take effect.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Jul 30, 2024

We are backporting the fix to our substrate fork; we are still working on reproducing the issue. @dmitrylavrenov will send an update when we have news on that front.

@dmitrylavrenov
Copy link
Contributor

dmitrylavrenov commented Aug 12, 2024

Status update

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Current plan is to run different load tests this week trying to reproduce faced issue.

@bkchr
Copy link

bkchr commented Aug 12, 2024

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Why not zombienet?

@dmitrylavrenov
Copy link
Contributor

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Why not zombienet?

Will prepare the detailed answer a little bit later.

@dmitrylavrenov
Copy link
Contributor

Status update

We were able to prepare a such config of our chain-load-test k8s-based tool that reproduces the exact issue we faced even with just 9 nodes.
image
image
image

Usually it requires less than 1 hour.

reproduced-grandpa-issue-logs-1.txt
reproduced-grandpa-issue-logs-2.txt
reproduced-grandpa-issue-logs.txt

Next step is preparing an image containing the fix and run tests again.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Aug 17, 2024

On zombienet: in short, we tired it, and had issues with it.

The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago).
Also, we dig deeper and found that it does not provide features besides reading the logs and running assertions on those logs, and even that was heavily unoptimized for the scale of testing we needed.

We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion.

Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream.
We are quite satisfied with the results so far, as we can proceed with the work on fixing this bug knowing we have means to reproduce the issue.

Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to.

We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta).


UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes (1000, 2000, 5000, ...) as we are preparing to grow our validators headcount.

@dmitrylavrenov
Copy link
Contributor

On zombienet: in short, we tired it, and had issues with it.

The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago). Also, we dig deeper and found that it does not provide features besides reading the logs and running assertions on those logs, and even that was heavily unoptimized for the scale of testing we needed.

We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion.

Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream. We are quite satisfied with the results so far, as we can proceed with the work on fixing this bug knowing we have means to reproduce the issue.

Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to.

We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta).

UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes (1000, 2000, 5000, ...) as we are preparing to grow our validators headcount.

I can add some important items from practical usage:

  • it was difficult to customize settings to make it working for networks with different SS58-prefix. Maybe it's already improved, but was needed to do some changes (the version I used in April-May) in the source code and rebuild by myself the tool.
  • It's not possible to rerun failed pods keeping running good ones. All time you need to rerun the network from clean. As a result, sometimes the test with 1000 nodes take almost forever as setup is terminated if an image for some pod were not able to be pulled for some networks reason.

@dmitrylavrenov
Copy link
Contributor

Status update

@bkchr @andresilva we have great news!
We were able to test fixed version where we faced the same issue. But the consensus wasn't stucked and continued working as usual! Tested twice. You can check the logs :)

grandpa-fix-logs.txt
grandpa-fix-logs-1.txt

@dmitrylavrenov
Copy link
Contributor

dmitrylavrenov commented Aug 19, 2024

@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet?

@andresilva
Copy link

Thanks for the thorough testing 👌

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Aug 19, 2024

@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet?

Yes.

The investigation is now finished, and we were able to identify the bug, theorize and implement a fix, reproduce the bug reliably in a controlled environment, apply the fix and ensure the bug does not reproduce under the same conditions anymore, and that there are no other surprising effects from applying the fix.

We are now working on releasing the fix on the Humanode network.

Thanks everyone for taking part in this work!

@MOZGIII MOZGIII closed this as completed Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracking A tracking issue for a big multi-PR initiative
Development

No branches or pull requests

4 participants