Deadlock conditions in the geth clique consensus implementation #9

abramsymons · 2020-08-27T13:33:22Z

As reported in go-ethereum issue #18402 there is a random probability of deadlock especially when number of active nodes in the network is even.
We experienced this deadlock once when we had 4 sealers and once when we had 5 sealers and one of them was down.

abramsymons · 2020-08-27T14:03:54Z

I think the best approach for us (before this issue get solved by geth team or we have enough resources to contribute solving this issue) will be having a script running beside geth process on all nodes that monitor if the node eth.blockNumber is increasing every 5 seconds and if it stopped for example for 15 seconds, use debug.setHead to return the node state to n/2+1 blocks ago where n is the number of sealers.
We should try to reproduce the deadlock in a test chain and see if having such a script beside nodes can always resolve the deadlock automatically.

siftal · 2020-08-28T09:31:31Z

I tried to reproduce the deadlock condition IDChain faced a few days ago.
We started a new chain with following features:

5 sealers which 1 of them was down and 4 of them were active
the block time set to 1 second.
the wiggleTime set to 5 milliseconds to make the deadlock happen faster

The chain stopped mining new blocks at ‍#14031 and had a fork with 2 sealers being on one side and the other 2 on the other side.

signers:
0x1543c0ec0e16eb889c79bdebda269ebca55a9e2c (Node0)
0x432624c5c117e08741548287b61138ddb30be090 (Node1)
0x442c23ad2a6c06ab7b57b3dea6453e0ace388149 (Node2)
0x50ff0095e11aa5b30781a10da6171351423e3a9f (Node3)
0xdad03f2d609a9e6e33b45a4b34f55bffdb13aae7 (Node4)

block	Node0/4	Node1/2
14027	node2	node2
14028	node0	node4
14029	node1	node0
14030	node4	node1
14031	node0	node2

Node0 and Node4 are waiting for Node1 or Node2 (or Node3) to mine the next block, and
in the other hand, Node1 and Node2 are waiting for Node0 or Node4 (or Node3) to mine the next block, so the chain deadlocked.

abramsymons · 2021-01-03T07:30:46Z

deadlock_resolver.py seems to be able to resolve all sort of deadlocks.
This script uses eth rpc api to get the last blocks, clique to get number of signers to calculate n/2+1, debug to return the node state using debug.setHead and miner to restart miner after returning state. The script can be tested by creating a file named reset.it beside that. If script is working fine, it should call debug.setHead and restart mining by calling miner.stop and miner.start and delete the reset.it file.

AT19947 · 2023-07-25T14:51:29Z

I'm in the same deadlock situation. I tried using deadlock_resolver.py and it solved the issue but the transactions in the reverted block were deleted. Is there any workaround to this issue? @abramsymons

abramsymons · 2023-08-04T09:03:08Z

I'm in the same deadlock situation. I tried using deadlock_resolver.py and it solved the issue but the transactions in the reverted block were deleted. Is there any workaround to this issue? @abramsymons

It's not an issue.
Blocks can only be considered finalised after n/2 + 1 blocks get mined after them in the Clique POA protocol and reorgs can remove newer blocks even if no revert happens.

ckartik · 2024-08-08T13:49:28Z

It's not an issue.

I'd push back on that not being an issue. We shouldn't delete the transactions in the block that's been reverted, they should be pushed popped back into the mempool. I think one approach is to have a data structure that can cache some of the transactions in non-finalized blocks and release them in the event of a deadlock caused reorg. Similar to how geth handles added/deleted txns in the event of a reorg:

		deletedTxs []common.Hash
		addedTxs   []common.Hash

AT19947 · 2024-08-08T13:57:29Z

I'm in the same deadlock situation. I tried using deadlock_resolver.py and it solved the issue but the transactions in the reverted block were deleted. Is there any workaround to this issue? @abramsymons

It's not an issue. Blocks can only be considered finalised after n/2 + 1 blocks get mined after them in the Clique POA protocol and reorgs can remove newer blocks even if no revert happens.

But Geth doesn't delete the transactions from the removed blocks.

AT19947 · 2024-08-08T13:57:50Z

It's not an issue.

I'd push back on that not being an issue. We shouldn't delete the transactions in the block that's been reverted, they should be pushed popped back into the mempool. I think one approach is to have a data structure that can cache some of the transactions in non-finalized blocks and release them in the event of a deadlock caused reorg. Similar to how geth handles added/deleted txns in the event of a reorg:
		deletedTxs []common.Hash
		addedTxs   []common.Hash

I think this is a good idea.

abramsymons closed this as completed Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock conditions in the geth clique consensus implementation #9

Deadlock conditions in the geth clique consensus implementation #9

abramsymons commented Aug 27, 2020 •

edited

Loading

abramsymons commented Aug 27, 2020

siftal commented Aug 28, 2020 •

edited

Loading

abramsymons commented Jan 3, 2021 •

edited

Loading

AT19947 commented Jul 25, 2023 •

edited

Loading

abramsymons commented Aug 4, 2023

ckartik commented Aug 8, 2024 •

edited

Loading

AT19947 commented Aug 8, 2024

AT19947 commented Aug 8, 2024 •

edited

Loading

Deadlock conditions in the geth clique consensus implementation #9

Deadlock conditions in the geth clique consensus implementation #9

Comments

abramsymons commented Aug 27, 2020 • edited Loading

abramsymons commented Aug 27, 2020

siftal commented Aug 28, 2020 • edited Loading

abramsymons commented Jan 3, 2021 • edited Loading

AT19947 commented Jul 25, 2023 • edited Loading

abramsymons commented Aug 4, 2023

ckartik commented Aug 8, 2024 • edited Loading

AT19947 commented Aug 8, 2024

AT19947 commented Aug 8, 2024 • edited Loading

abramsymons commented Aug 27, 2020 •

edited

Loading

siftal commented Aug 28, 2020 •

edited

Loading

abramsymons commented Jan 3, 2021 •

edited

Loading

AT19947 commented Jul 25, 2023 •

edited

Loading

ckartik commented Aug 8, 2024 •

edited

Loading

AT19947 commented Aug 8, 2024 •

edited

Loading