Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock conditions in the geth clique consensus implementation #9

Closed
abramsymons opened this issue Aug 27, 2020 · 8 comments
Closed

Comments

@abramsymons
Copy link
Collaborator

abramsymons commented Aug 27, 2020

As reported in go-ethereum issue #18402 there is a random probability of deadlock especially when number of active nodes in the network is even.
We experienced this deadlock once when we had 4 sealers and once when we had 5 sealers and one of them was down.

@abramsymons
Copy link
Collaborator Author

I think the best approach for us (before this issue get solved by geth team or we have enough resources to contribute solving this issue) will be having a script running beside geth process on all nodes that monitor if the node eth.blockNumber is increasing every 5 seconds and if it stopped for example for 15 seconds, use debug.setHead to return the node state to n/2+1 blocks ago where n is the number of sealers.
We should try to reproduce the deadlock in a test chain and see if having such a script beside nodes can always resolve the deadlock automatically.

@siftal
Copy link
Collaborator

siftal commented Aug 28, 2020

I tried to reproduce the deadlock condition IDChain faced a few days ago.
We started a new chain with following features:

  • 5 sealers which 1 of them was down and 4 of them were active
  • the block time set to 1 second.
  • the wiggleTime set to 5 milliseconds to make the deadlock happen faster

The chain stopped mining new blocks at ‍#14031 and had a fork with 2 sealers being on one side and the other 2 on the other side.

signers:
0x1543c0ec0e16eb889c79bdebda269ebca55a9e2c (Node0)
0x432624c5c117e08741548287b61138ddb30be090 (Node1)
0x442c23ad2a6c06ab7b57b3dea6453e0ace388149 (Node2)
0x50ff0095e11aa5b30781a10da6171351423e3a9f (Node3)
0xdad03f2d609a9e6e33b45a4b34f55bffdb13aae7 (Node4)

block Node0/4 Node1/2
14027 node2 node2
14028 node0 node4
14029 node1 node0
14030 node4 node1
14031 node0 node2

Node0 and Node4 are waiting for Node1 or Node2 (or Node3) to mine the next block, and
in the other hand, Node1 and Node2 are waiting for Node0 or Node4 (or Node3) to mine the next block, so the chain deadlocked.

@abramsymons
Copy link
Collaborator Author

abramsymons commented Jan 3, 2021

deadlock_resolver.py seems to be able to resolve all sort of deadlocks.
This script uses eth rpc api to get the last blocks, clique to get number of signers to calculate n/2+1, debug to return the node state using debug.setHead and miner to restart miner after returning state. The script can be tested by creating a file named reset.it beside that. If script is working fine, it should call debug.setHead and restart mining by calling miner.stop and miner.start and delete the reset.it file.

@AT19947
Copy link

AT19947 commented Jul 25, 2023

I'm in the same deadlock situation. I tried using deadlock_resolver.py and it solved the issue but the transactions in the reverted block were deleted. Is there any workaround to this issue? @abramsymons

@abramsymons
Copy link
Collaborator Author

I'm in the same deadlock situation. I tried using deadlock_resolver.py and it solved the issue but the transactions in the reverted block were deleted. Is there any workaround to this issue? @abramsymons

It's not an issue.
Blocks can only be considered finalised after n/2 + 1 blocks get mined after them in the Clique POA protocol and reorgs can remove newer blocks even if no revert happens.

@ckartik
Copy link

ckartik commented Aug 8, 2024

It's not an issue.

I'd push back on that not being an issue. We shouldn't delete the transactions in the block that's been reverted, they should be pushed popped back into the mempool. I think one approach is to have a data structure that can cache some of the transactions in non-finalized blocks and release them in the event of a deadlock caused reorg. Similar to how geth handles added/deleted txns in the event of a reorg:

		deletedTxs []common.Hash
		addedTxs   []common.Hash

@AT19947
Copy link

AT19947 commented Aug 8, 2024

I'm in the same deadlock situation. I tried using deadlock_resolver.py and it solved the issue but the transactions in the reverted block were deleted. Is there any workaround to this issue? @abramsymons

It's not an issue. Blocks can only be considered finalised after n/2 + 1 blocks get mined after them in the Clique POA protocol and reorgs can remove newer blocks even if no revert happens.

But Geth doesn't delete the transactions from the removed blocks.

@AT19947
Copy link

AT19947 commented Aug 8, 2024

It's not an issue.

I'd push back on that not being an issue. We shouldn't delete the transactions in the block that's been reverted, they should be pushed popped back into the mempool. I think one approach is to have a data structure that can cache some of the transactions in non-finalized blocks and release them in the event of a deadlock caused reorg. Similar to how geth handles added/deleted txns in the event of a reorg:

		deletedTxs []common.Hash
		addedTxs   []common.Hash

I think this is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants