Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

How to prevent and deal with unexpected epoch changes #4464

Closed
philipstanislaus opened this issue Dec 19, 2019 · 23 comments
Closed

How to prevent and deal with unexpected epoch changes #4464

philipstanislaus opened this issue Dec 19, 2019 · 23 comments
Labels
Z7-question Issue is a question. Closer should answer.

Comments

@philipstanislaus
Copy link
Contributor

Issue

In our testnet at Centrifuge, we have observed 2 chain halts so far due to unexpected epoch changes. This error seems to happen whenever blocks are not produced for a whole epoch.

Context

We are using relatively short epochs for testing purposes (30 slots a 6 seconds => 5 minutes), and had only 2 validators that were down before the errors showed up.

Questions

  1. What is the exact reason for the unexpected epoch change errors?
  2. How to prevent these errors?
  3. How to recover from an unexpected epoch change error in case they happen on mainnet?

Reproducible example

  1. Compile the latest substrate node at git commit 36c7fcfa24f54b06bb4c7a32e7627be35bdd80ef: cargo build --release
  2. Run a dev testnet: ./target/release/substrate --dev
  3. Stop the testnet, wait for > 10 minutes
  4. Continue to run the dev testnet: ./target/release/substrate --dev

Full output when continuing the dev testnet:

~/paritytech/substrate$ ./target/release/substrate --dev
2019-12-19 20:10:46 Running in --dev mode, RPC CORS has been disabled.
2019-12-19 20:10:46 Substrate Node
2019-12-19 20:10:46   version 2.0.0-36c7fcfa2-x86_64-linux-gnu
2019-12-19 20:10:46   by Parity Technologies, 2017-2019
2019-12-19 20:10:46 Chain specification: Development
2019-12-19 20:10:46 Node name: tricky-rat-5376
2019-12-19 20:10:46 Roles: AUTHORITY
2019-12-19 20:10:47 Highest known block at #3
2019-12-19 20:10:47 Using default protocol ID "sup" because none is configured in the chain specs
2019-12-19 20:10:47 Local node identity is: QmbbTjqN3asDnJedg7fvchpgnaDuVXuogpXnJXy1qcJkdH
2019-12-19 20:10:47 Starting BABE Authorship worker
2019-12-19 20:10:47 Grafana data source server started at 127.0.0.1:9955
2019-12-19 20:10:48 Starting consensus session on top of parent 0x0abbae06c4e7569d1a05288d4d6a0a0664e91a2f522aeb1188891b52ead14d75
2019-12-19 20:10:48 Prepared block for proposing at 4 [hash: 0x207cf4644d77ddc2a1121c1914d2c0d034d521161b06794283eedba283ff967a; parent_hash: 0x0abb…4d75; extrinsics: [0x8264…a8a5]]
2019-12-19 20:10:48 Pre-sealed block for proposal at 4. Hash now 0xf583b51f85ab56f7ef6ff5e8cd8970e221e2492f5d57171d578c8a4c5cf993a9, previously 0x207cf4644d77ddc2a1121c1914d2c0d034d521161b06794283eedba283ff967a.
2019-12-19 20:10:48 New epoch 1 launching at block 0xf583…93a9 (block slot 525594216 >= start slot 525591147).
2019-12-19 20:10:48 Next epoch starts at slot 525591347
2019-12-19 20:10:48 Imported #4 (0xf583…93a9)
2019-12-19 20:10:51 Starting consensus session on top of parent 0xf583b51f85ab56f7ef6ff5e8cd8970e221e2492f5d57171d578c8a4c5cf993a9
2019-12-19 20:10:51 Prepared block for proposing at 5 [hash: 0xa5cbf81cca32bbd8fffcac789abab930dae721c7ee5b43b615f9a13a6110ba28; parent_hash: 0xf583…93a9; extrinsics: [0xdebf…b60a, 0xffb8…6c71]]
2019-12-19 20:10:51 Pre-sealed block for proposal at 5. Hash now 0x92753b8be742ab4f33a6804e1cd6eb06c629e0de8e24871847a21dcba26b82b2, previously 0xa5cbf81cca32bbd8fffcac789abab930dae721c7ee5b43b615f9a13a6110ba28.
2019-12-19 20:10:51 Error with block built on 0xf583b51f85ab56f7ef6ff5e8cd8970e221e2492f5d57171d578c8a4c5cf993a9: ClientImport("Unexpected epoch change")
2019-12-19 20:10:52 Idle (0 peers), best: #4 (0xf583…93a9), finalized #3 (0x0abb…4d75), ⬇ 0 ⬆ 0
@rphmeier rphmeier added the Z7-question Issue is a question. Closer should answer. label Dec 19, 2019
@rphmeier
Copy link
Contributor

One invariant of BABE is that you have to have at least one block every epoch_duration slots. It's a violation of the synchrony assumption of the algorithm. It is possible to work around, but only with more research, design, and implementation work.

@philipstanislaus
Copy link
Contributor Author

Thanks for the quick reply Rob.

Is a hard spoon the only current way to recover from a situation like that?

@philipstanislaus
Copy link
Contributor Author

Short follow up @rphmeier – is there any way to recover from a pricked chain as described above? Or is a hard spoon the only option?

Thanks!

@rphmeier
Copy link
Contributor

rphmeier commented Jan 9, 2020

Yeah, hard spooning is the best way to solve it. You could also hack something together where you pick up the chain in a sandbox where all the nodes think they're in the past, and time is running faster than normal.

There's probably a way to do that that involves manually issuing 1 block every EPOCH_LEN - 1 slots until you're caught up to the current epoch.

@philipstanislaus
Copy link
Contributor Author

Cool, thanks for your reply. Is the code that did time warping on Kusama visible somewhere?

@rphmeier
Copy link
Contributor

cc @andresilva - I'd imagine there are some ugly hacks in there we might not want to propagate

@andresilva
Copy link
Contributor

https://github.com/paritytech/substrate/compare/andre/polkadot-master

Relevant changes for the time warp are in files:

  • client/consensus/slots/src/lib.rs
  • client/consensus/slots/src/slots.rs
  • primitives/timestamp/src/lib.rs

@andresilva
Copy link
Contributor

BTW in the code above we do a warp factor of 6x with a block time of 6s, which means that during the warp you'll end up with 1s block times. For any chain deployed on the public internet this is probably too fast and will trigger some other issues (nodes will start forking). I'd recommend warping with a factor that keeps block times at most as low as 2s.

@philipstanislaus
Copy link
Contributor Author

That’s very useful, thanks @andresilva 🙏

@AurevoirXavier
Copy link
Contributor

AurevoirXavier commented Jan 17, 2020

Hi, @andresilva thank you!

I made a patch like this, but I got block production took too long. Any ideas?

I'm sure it was build in release and I try --execution=Native. But still got this.

@andresilva
Copy link
Contributor

@AurevoirXavier
Copy link
Contributor

AurevoirXavier commented Jan 17, 2020

darwinia-network/darwinia@ce4ce7a#diff-d2bd7cb542d4338238e9e420d754e1cbR72

Maybe use 3x here?

Still got this.

Btw, how to roll back to a specify block? I sync from stuck node and ctrl-c at a previous block to get the db.

@andresilva
Copy link
Contributor

https://github.com/paritytech/substrate/blob/master/client/consensus/slots/src/lib.rs#L247

Print this to figure out how much time is being given to block execution, maybe there's some calculation which is making this too little. You can revert the chain using the revert command although it will only revert unfinalized blocks (substrate revert 100, would revert the last 100 blocks). It is possible to revert finalized blocks programmatically using some of the code from the branch I've shown above, but then you also need to blacklist the previous chain (otherwise nodes will see the old justifications and finalize again).

None of these operations are simple and require in-depth knowledge of how substrate works. If you don't have much luck with my instructions I'd suggest going for a "hard-spoon" instead.

@AurevoirXavier
Copy link
Contributor

Thanks a million.

@AurevoirXavier
Copy link
Contributor

AurevoirXavier commented Jan 18, 2020

Hi, @andresilva!

The remaining_duration is 1 second. it's ok.

I was on an old substrate version.

I found this:
https://github.com/darwinia-network/substrate/blob/darwinia-develop/core/consensus/slots/src/lib.rs#L210

Your hack:
https://github.com/paritytech/substrate/compare/andre/polkadot-master#diff-6bba625803932c183283584500f0c1faR145

So I just ignore the slot_after_building != slot_number.

@andresilva
Copy link
Contributor

https://github.com/paritytech/substrate/compare/andre/polkadot-master#diff-2aed294ee7caca5bdf3e288c93b29422R107

This code will make sure the slot is always "stuck" until your local clock reaches the REVIVE_TIMESTAMP. So what you're seeing is expected if the REVIVE_TIMESTAMP that you set is in the future. Is this the case?

@AurevoirXavier
Copy link
Contributor

Yes, I set it to 9:50 and I boot nodes at 9:20. Everything goes well. Thanks again!

@andresilva
Copy link
Contributor

Great! 👍😃

@Muhammad-Altabba
Copy link

Muhammad-Altabba commented Feb 22, 2020

I have this error every time I start my virtual machine after a long time of being off.


2020-02-22 16:56:45 Error with block built on 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028: ClientImport("Unexpected epoch change")
2020-02-22 16:56:48 Starting consensus session on top of parent 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028
2020-02-22 16:56:48 Prepared block for proposing at 4739 [hash: 0xeb46ae5e3e05d30e901ec3e93e1a6dfbf1fe05eab4b66db768a3b667fbb0d7a0; parent_hash: 0xa5aa…8028; extrinsics: [0x3c48…9440, 0xa3ae…8492]]
2020-02-22 16:56:48 Pre-sealed block for proposal at 4739. Hash now 0x8ec0f250cd530b040f0957e74df1dedc46772253e644e38583cfee36f79be0b8, previously 0xeb46ae5e3e05d30e901ec3e93e1a6dfbf1fe05eab4b66db768a3b667fbb0d7a0.
2020-02-22 16:56:48 Error with block built on 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028: ClientImport("Unexpected epoch change")
2020-02-22 16:56:48 Idle (0 peers), best: #4738 (0xa5aa…8028), finalized #4736 (0xaaac…cdd7), ⬇ 0 ⬆ 0
2020-02-22 16:56:51 Starting consensus session on top of parent 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028
2020-02-22 16:56:51 Prepared block for proposing at 4739 [hash: 0x71fa7ae35fdc22344a3db836af2b6793351880057c21b3423c296fc8e5053b9c; parent_hash: 0xa5aa…8028; extrinsics: [0x82d2…e0d6, 0xa3ae…8492]]
2020-02-22 16:56:51 Pre-sealed block for proposal at 4739. Hash now 0x9246182d6041421eccb17e6b3f2477239169457a37340bef327c2b29aa61a85c, previously 0x71fa7ae35fdc22344a3db836af2b6793351880057c21b3423c296fc8e5053b9c.

Nothing resolves this error except doing purge with substrate purge-chain --dev! This is a pain to redeploy the contracts and make the transfer every time. And it is a real issue if I want to run a real node!

@andresilva
Copy link
Contributor

@Muhammad-Altabba Use AuRa instead of BABE for your development chain, it's what's used in node-template for this reason. You can easily switch back to BABE when you want to go into production.

@Muhammad-Altabba
Copy link

Thanks @andresilva,
I checked the wiki (https://wiki.parity.io/Aura) and the usage is not clear. Could you manage to add how to use and a sample usage to the wiki (where exactly and how to pass the parameters)?

@andresilva
Copy link
Contributor

That is documentation for the parity-ethereum (https://github.com/paritytech/parity-ethereum) client, not Substrate. My suggestion is to look at the node-template code since it is set up to use Aura (https://github.com/paritytech/substrate/tree/master/bin/node-template), in particular the service (https://github.com/paritytech/substrate/blob/master/bin/node-template/node/src/service.rs) and runtime (https://github.com/paritytech/substrate/blob/master/bin/node-template/runtime/src/lib.rs) definitions.

@techwizard210
Copy link

Everyone. Please help me.

When I run my substrate node server by running this command "./target/release/node-template --dev --ws-external --rpc-external", server works well and start generating blocks first. But after few hours, my server is been stopped and also stopped generating blocks automatically. How can I solve this problem? Here is Terminal Error.
Best Regards.
Error

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Z7-question Issue is a question. Closer should answer.
Projects
None yet
Development

No branches or pull requests

6 participants