How to prevent and deal with unexpected epoch changes #4464

philipstanislaus · 2019-12-19T19:16:05Z

Issue

In our testnet at Centrifuge, we have observed 2 chain halts so far due to unexpected epoch changes. This error seems to happen whenever blocks are not produced for a whole epoch.

Context

We are using relatively short epochs for testing purposes (30 slots a 6 seconds => 5 minutes), and had only 2 validators that were down before the errors showed up.

Questions

What is the exact reason for the unexpected epoch change errors?
How to prevent these errors?
How to recover from an unexpected epoch change error in case they happen on mainnet?

Reproducible example

Compile the latest substrate node at git commit 36c7fcfa24f54b06bb4c7a32e7627be35bdd80ef: cargo build --release
Run a dev testnet: ./target/release/substrate --dev
Stop the testnet, wait for > 10 minutes
Continue to run the dev testnet: ./target/release/substrate --dev

Full output when continuing the dev testnet:

~/paritytech/substrate$ ./target/release/substrate --dev
2019-12-19 20:10:46 Running in --dev mode, RPC CORS has been disabled.
2019-12-19 20:10:46 Substrate Node
2019-12-19 20:10:46   version 2.0.0-36c7fcfa2-x86_64-linux-gnu
2019-12-19 20:10:46   by Parity Technologies, 2017-2019
2019-12-19 20:10:46 Chain specification: Development
2019-12-19 20:10:46 Node name: tricky-rat-5376
2019-12-19 20:10:46 Roles: AUTHORITY
2019-12-19 20:10:47 Highest known block at #3
2019-12-19 20:10:47 Using default protocol ID "sup" because none is configured in the chain specs
2019-12-19 20:10:47 Local node identity is: QmbbTjqN3asDnJedg7fvchpgnaDuVXuogpXnJXy1qcJkdH
2019-12-19 20:10:47 Starting BABE Authorship worker
2019-12-19 20:10:47 Grafana data source server started at 127.0.0.1:9955
2019-12-19 20:10:48 Starting consensus session on top of parent 0x0abbae06c4e7569d1a05288d4d6a0a0664e91a2f522aeb1188891b52ead14d75
2019-12-19 20:10:48 Prepared block for proposing at 4 [hash: 0x207cf4644d77ddc2a1121c1914d2c0d034d521161b06794283eedba283ff967a; parent_hash: 0x0abb…4d75; extrinsics: [0x8264…a8a5]]
2019-12-19 20:10:48 Pre-sealed block for proposal at 4. Hash now 0xf583b51f85ab56f7ef6ff5e8cd8970e221e2492f5d57171d578c8a4c5cf993a9, previously 0x207cf4644d77ddc2a1121c1914d2c0d034d521161b06794283eedba283ff967a.
2019-12-19 20:10:48 New epoch 1 launching at block 0xf583…93a9 (block slot 525594216 >= start slot 525591147).
2019-12-19 20:10:48 Next epoch starts at slot 525591347
2019-12-19 20:10:48 Imported #4 (0xf583…93a9)
2019-12-19 20:10:51 Starting consensus session on top of parent 0xf583b51f85ab56f7ef6ff5e8cd8970e221e2492f5d57171d578c8a4c5cf993a9
2019-12-19 20:10:51 Prepared block for proposing at 5 [hash: 0xa5cbf81cca32bbd8fffcac789abab930dae721c7ee5b43b615f9a13a6110ba28; parent_hash: 0xf583…93a9; extrinsics: [0xdebf…b60a, 0xffb8…6c71]]
2019-12-19 20:10:51 Pre-sealed block for proposal at 5. Hash now 0x92753b8be742ab4f33a6804e1cd6eb06c629e0de8e24871847a21dcba26b82b2, previously 0xa5cbf81cca32bbd8fffcac789abab930dae721c7ee5b43b615f9a13a6110ba28.
2019-12-19 20:10:51 Error with block built on 0xf583b51f85ab56f7ef6ff5e8cd8970e221e2492f5d57171d578c8a4c5cf993a9: ClientImport("Unexpected epoch change")
2019-12-19 20:10:52 Idle (0 peers), best: #4 (0xf583…93a9), finalized #3 (0x0abb…4d75), ⬇ 0 ⬆ 0

The text was updated successfully, but these errors were encountered:

rphmeier · 2019-12-19T20:32:30Z

One invariant of BABE is that you have to have at least one block every epoch_duration slots. It's a violation of the synchrony assumption of the algorithm. It is possible to work around, but only with more research, design, and implementation work.

philipstanislaus · 2019-12-19T23:34:20Z

Thanks for the quick reply Rob.

Is a hard spoon the only current way to recover from a situation like that?

philipstanislaus · 2020-01-08T21:48:54Z

Short follow up @rphmeier – is there any way to recover from a pricked chain as described above? Or is a hard spoon the only option?

Thanks!

rphmeier · 2020-01-09T14:12:42Z

Yeah, hard spooning is the best way to solve it. You could also hack something together where you pick up the chain in a sandbox where all the nodes think they're in the past, and time is running faster than normal.

There's probably a way to do that that involves manually issuing 1 block every EPOCH_LEN - 1 slots until you're caught up to the current epoch.

philipstanislaus · 2020-01-10T07:22:29Z

Cool, thanks for your reply. Is the code that did time warping on Kusama visible somewhere?

rphmeier · 2020-01-11T13:42:37Z

cc @andresilva - I'd imagine there are some ugly hacks in there we might not want to propagate

andresilva · 2020-01-12T22:41:21Z

https://github.com/paritytech/substrate/compare/andre/polkadot-master

Relevant changes for the time warp are in files:

client/consensus/slots/src/lib.rs
client/consensus/slots/src/slots.rs
primitives/timestamp/src/lib.rs

andresilva · 2020-01-12T23:09:17Z

BTW in the code above we do a warp factor of 6x with a block time of 6s, which means that during the warp you'll end up with 1s block times. For any chain deployed on the public internet this is probably too fast and will trigger some other issues (nodes will start forking). I'd recommend warping with a factor that keeps block times at most as low as 2s.

philipstanislaus · 2020-01-13T09:33:55Z

That’s very useful, thanks @andresilva 🙏

AurevoirXavier · 2020-01-17T09:44:39Z

Hi, @andresilva thank you!

I made a patch like this, but I got block production took too long. Any ideas?

I'm sure it was build in release and I try --execution=Native. But still got this.

andresilva · 2020-01-17T10:18:15Z

darwinia-network/darwinia@ce4ce7a#diff-d2bd7cb542d4338238e9e420d754e1cbR72

Maybe use 3x here?

AurevoirXavier · 2020-01-17T10:24:25Z

darwinia-network/darwinia@ce4ce7a#diff-d2bd7cb542d4338238e9e420d754e1cbR72

Maybe use 3x here?

Still got this.

Btw, how to roll back to a specify block? I sync from stuck node and ctrl-c at a previous block to get the db.

andresilva · 2020-01-17T10:42:33Z

https://github.com/paritytech/substrate/blob/master/client/consensus/slots/src/lib.rs#L247

Print this to figure out how much time is being given to block execution, maybe there's some calculation which is making this too little. You can revert the chain using the revert command although it will only revert unfinalized blocks (substrate revert 100, would revert the last 100 blocks). It is possible to revert finalized blocks programmatically using some of the code from the branch I've shown above, but then you also need to blacklist the previous chain (otherwise nodes will see the old justifications and finalize again).

None of these operations are simple and require in-depth knowledge of how substrate works. If you don't have much luck with my instructions I'd suggest going for a "hard-spoon" instead.

AurevoirXavier · 2020-01-17T10:47:56Z

Thanks a million.

AurevoirXavier · 2020-01-18T03:29:19Z

Hi, @andresilva!

The remaining_duration is 1 second. it's ok.

I was on an old substrate version.

I found this:
https://github.com/darwinia-network/substrate/blob/darwinia-develop/core/consensus/slots/src/lib.rs#L210

Your hack:
https://github.com/paritytech/substrate/compare/andre/polkadot-master#diff-6bba625803932c183283584500f0c1faR145

So I just ignore the slot_after_building != slot_number.

Q
```
 ClientImport("Slot number must increase: parent slot: 526352499, this slot: 526352499")
```
at https://github.com/darwinia-network/substrate/blob/darwinia-develop/core/consensus/babe/src/lib.rs#L873.
A
Update the FORK_TIMESTAMP to next block timestamp.

andresilva · 2020-01-19T13:11:07Z

https://github.com/paritytech/substrate/compare/andre/polkadot-master#diff-2aed294ee7caca5bdf3e288c93b29422R107

This code will make sure the slot is always "stuck" until your local clock reaches the REVIVE_TIMESTAMP. So what you're seeing is expected if the REVIVE_TIMESTAMP that you set is in the future. Is this the case?

AurevoirXavier · 2020-01-20T00:13:53Z

Yes, I set it to 9:50 and I boot nodes at 9:20. Everything goes well. Thanks again!

andresilva · 2020-01-20T09:53:31Z

Great! 👍😃

Muhammad-Altabba · 2020-02-22T16:32:29Z

I have this error every time I start my virtual machine after a long time of being off.


2020-02-22 16:56:45 Error with block built on 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028: ClientImport("Unexpected epoch change")
2020-02-22 16:56:48 Starting consensus session on top of parent 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028
2020-02-22 16:56:48 Prepared block for proposing at 4739 [hash: 0xeb46ae5e3e05d30e901ec3e93e1a6dfbf1fe05eab4b66db768a3b667fbb0d7a0; parent_hash: 0xa5aa…8028; extrinsics: [0x3c48…9440, 0xa3ae…8492]]
2020-02-22 16:56:48 Pre-sealed block for proposal at 4739. Hash now 0x8ec0f250cd530b040f0957e74df1dedc46772253e644e38583cfee36f79be0b8, previously 0xeb46ae5e3e05d30e901ec3e93e1a6dfbf1fe05eab4b66db768a3b667fbb0d7a0.
2020-02-22 16:56:48 Error with block built on 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028: ClientImport("Unexpected epoch change")
2020-02-22 16:56:48 Idle (0 peers), best: #4738 (0xa5aa…8028), finalized #4736 (0xaaac…cdd7), ⬇ 0 ⬆ 0
2020-02-22 16:56:51 Starting consensus session on top of parent 0xa5aa39ce955b3492d46c9243b0c5f7c799681acf93bf4c610c6dff6662028028
2020-02-22 16:56:51 Prepared block for proposing at 4739 [hash: 0x71fa7ae35fdc22344a3db836af2b6793351880057c21b3423c296fc8e5053b9c; parent_hash: 0xa5aa…8028; extrinsics: [0x82d2…e0d6, 0xa3ae…8492]]
2020-02-22 16:56:51 Pre-sealed block for proposal at 4739. Hash now 0x9246182d6041421eccb17e6b3f2477239169457a37340bef327c2b29aa61a85c, previously 0x71fa7ae35fdc22344a3db836af2b6793351880057c21b3423c296fc8e5053b9c.

Nothing resolves this error except doing purge with substrate purge-chain --dev! This is a pain to redeploy the contracts and make the transfer every time. And it is a real issue if I want to run a real node!

andresilva · 2020-02-22T16:34:24Z

@Muhammad-Altabba Use AuRa instead of BABE for your development chain, it's what's used in node-template for this reason. You can easily switch back to BABE when you want to go into production.

Muhammad-Altabba · 2020-02-24T06:44:58Z

Thanks @andresilva,
I checked the wiki (https://wiki.parity.io/Aura) and the usage is not clear. Could you manage to add how to use and a sample usage to the wiki (where exactly and how to pass the parameters)?

andresilva · 2020-02-25T15:55:19Z

That is documentation for the parity-ethereum (https://github.com/paritytech/parity-ethereum) client, not Substrate. My suggestion is to look at the node-template code since it is set up to use Aura (https://github.com/paritytech/substrate/tree/master/bin/node-template), in particular the service (https://github.com/paritytech/substrate/blob/master/bin/node-template/node/src/service.rs) and runtime (https://github.com/paritytech/substrate/blob/master/bin/node-template/runtime/src/lib.rs) definitions.

techwizard210 · 2022-06-15T07:26:59Z

Everyone. Please help me.

When I run my substrate node server by running this command "./target/release/node-template --dev --ws-external --rpc-external", server works well and start generating blocks first. But after few hours, my server is been stopped and also stopped generating blocks automatically. How can I solve this problem? Here is Terminal Error.
Best Regards.

rphmeier added the Z7-question Issue is a question. Closer should answer. label Dec 19, 2019

rphmeier closed this as completed Dec 19, 2019

AurevoirXavier mentioned this issue Jan 15, 2020

Unexpected Epoch Change When All Nodes Down darwinia-network/darwinia#245

Closed

AurevoirXavier mentioned this issue Jan 17, 2020

Try Time Wrap darwinia-network/darwinia#246

Merged

xlc mentioned this issue Mar 10, 2020

Local chain running into error. No more new blocks produced. AcalaNetwork/Acala#154

Closed

JoshOrndorff mentioned this issue Mar 23, 2021

Rename constant to EPOCH_DURATION_IN_SLOTS paritytech/polkadot#2674

Merged

raindust mentioned this issue Jul 8, 2021

"unexpected epoch changes" error tearust/tea-camellia#21

Closed

chrisli30 mentioned this issue Jul 23, 2021

What if encountering more than 1/3 failed validator AvaProtocol/OAK-blockchain#20

Closed

baidang201 mentioned this issue Aug 10, 2021

after node-template import-blocks 7 days ago blockdata, and run substrate node report "unexpected epoch changes" paritytech/subport#218

Closed

decentration mentioned this issue Dec 9, 2021

Clarifications for setting up Custom Rococo Staging Network paritytech/subport#280

Closed

andresilva mentioned this issue Mar 10, 2022

GRANDPA voter error: Client(BadJustification("invalid precommits for target commit")) #10992

Closed

davxy mentioned this issue Mar 14, 2022

BABE's revert procedure #11022

Merged

anonymousGiga mentioned this issue Jun 15, 2022

How to repair a Substrate Chain in the way time warp? #11673

Closed

islami00 mentioned this issue Jul 3, 2022

Development local parachain setup with zombienet chocolatenetwork/chocolate-parachain#3

Merged

4 tasks

drewstone mentioned this issue Dec 26, 2023

[BUG] Mainnet halts at block 100 tangle-network/tangle#368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prevent and deal with unexpected epoch changes #4464

How to prevent and deal with unexpected epoch changes #4464

philipstanislaus commented Dec 19, 2019

rphmeier commented Dec 19, 2019

philipstanislaus commented Dec 19, 2019

philipstanislaus commented Jan 8, 2020

rphmeier commented Jan 9, 2020 •

edited

Loading

philipstanislaus commented Jan 10, 2020

rphmeier commented Jan 11, 2020

andresilva commented Jan 12, 2020

andresilva commented Jan 12, 2020

philipstanislaus commented Jan 13, 2020

AurevoirXavier commented Jan 17, 2020 •

edited

Loading

andresilva commented Jan 17, 2020

AurevoirXavier commented Jan 17, 2020 •

edited

Loading

andresilva commented Jan 17, 2020

AurevoirXavier commented Jan 17, 2020

AurevoirXavier commented Jan 18, 2020 •

edited

Loading

andresilva commented Jan 19, 2020

AurevoirXavier commented Jan 20, 2020

andresilva commented Jan 20, 2020

Muhammad-Altabba commented Feb 22, 2020 •

edited

Loading

andresilva commented Feb 22, 2020

Muhammad-Altabba commented Feb 24, 2020

andresilva commented Feb 25, 2020

techwizard210 commented Jun 15, 2022

How to prevent and deal with unexpected epoch changes #4464

How to prevent and deal with unexpected epoch changes #4464

Comments

philipstanislaus commented Dec 19, 2019

Issue

Context

Questions

Reproducible example

Full output when continuing the dev testnet:

rphmeier commented Dec 19, 2019

philipstanislaus commented Dec 19, 2019

philipstanislaus commented Jan 8, 2020

rphmeier commented Jan 9, 2020 • edited Loading

philipstanislaus commented Jan 10, 2020

rphmeier commented Jan 11, 2020

andresilva commented Jan 12, 2020

andresilva commented Jan 12, 2020

philipstanislaus commented Jan 13, 2020

AurevoirXavier commented Jan 17, 2020 • edited Loading

andresilva commented Jan 17, 2020

AurevoirXavier commented Jan 17, 2020 • edited Loading

andresilva commented Jan 17, 2020

AurevoirXavier commented Jan 17, 2020

AurevoirXavier commented Jan 18, 2020 • edited Loading

andresilva commented Jan 19, 2020

AurevoirXavier commented Jan 20, 2020

andresilva commented Jan 20, 2020

Muhammad-Altabba commented Feb 22, 2020 • edited Loading

andresilva commented Feb 22, 2020

Muhammad-Altabba commented Feb 24, 2020

andresilva commented Feb 25, 2020

techwizard210 commented Jun 15, 2022

rphmeier commented Jan 9, 2020 •

edited

Loading

AurevoirXavier commented Jan 17, 2020 •

edited

Loading

AurevoirXavier commented Jan 17, 2020 •

edited

Loading

AurevoirXavier commented Jan 18, 2020 •

edited

Loading

Muhammad-Altabba commented Feb 22, 2020 •

edited

Loading