Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geth Clique Block Period 1, Fork of Chain inconsistency #21191

Closed
cyrilnavessamuel opened this issue Jun 8, 2020 · 9 comments
Closed

Geth Clique Block Period 1, Fork of Chain inconsistency #21191

cyrilnavessamuel opened this issue Jun 8, 2020 · 9 comments

Comments

@cyrilnavessamuel
Copy link

Hi there,

I am posting this bug which is already discussed but is inactive #18402.

Moreover the bug which I post here is a slight modification of the one posted in 18402

I have also posted in stack exchange which didn't have much response: https://ethereum.stackexchange.com/questions/83357/geth-clique-block-period-1-fork-of-chain-inconsistency

System information

Geth version: 1.9.12
OS & Version: Linux
Clique PoA
Processor: Intel i7 - 3770
CPU: 8
RAM : 16 GB
TXPOOL.GlobalSlots: 100000000
Block Period 1 or 2 (Faster block times)
No of Sealers/ Nodes : 4

Expected behaviour

Block Period 1 / 2 / 3 (Shorter block times) fork is created but should be resolved or reorged atleast

Actual behaviour

In my private Clique network with 4 Nodes (A,B,C,D) I noticed a fork in the chain for block period 1.

I noticed that it happens some times with block period 1 & 2.

I noticed that the fork happens at block 1678 . Nodes A & D have a similar chain data meanwhile Nodes B& C have similar data.

At block 1677, I noticed the difference in data between 2 chains:

  1. Block hashes are different
  2. Block of one chain is an uncle block while for other chain it has 5000 txs included
  3. Both blocks have same difficulty 2 which means that it was mined in turn
  4. Another complication arises when I noticed it was the same sealer who sealed the block.

This results in fork of the network and stalling at the end which cannot undergo any reorg in this deadlock situation.

Steps to reproduce the behaviour

  1. Setup PoA Clique network with block period 1 / 2
  2. No of Nodes/ Sealers : 4
  3. Send transactions faster to invoke smart contract method. Eg client available in : https://github.com/cyrilnavessamuel/ethereumissue
  4. You will notice fork when the nodes stop mining and when you debug the issue looking at the latest block header.

Backtrace

Latest Block 1678 details in Node A:
1
Block 1677 fork occured between Node A & Node C on 1 fork and other fork having Node B & Node D.
Notice the
difference in block hash ,
different no of txs in block (1 fork has 0 txs and other has 4791 txs)
same difficulty in both forks
Fork 1 details: Node A & Node D
2
Fork 2 details: Node B & Node C
3
Note: Fork 2 image is different since 4791 txs were printed which was too large;

@holiman
Copy link
Contributor

holiman commented Jun 18, 2020

The curios thing here is that one miner mined two different variants of block 1676, one empty and the other with 4791 transactions. It could be that there's a race here somewhere, since the default behaivour of the miner is to start mining an empty block, and then update the work as transactions come in.

@holiman
Copy link
Contributor

holiman commented Jun 18, 2020

A blocktime of 1 second, combined with a gasLimit of 250M, filled with thousands of transactions probably hits some cornercase in the miner.

@karalabe
Copy link
Member

You gas limit seems to be insanely large (250M), and you block time quite small (1-3s). How much time do nodes need to actually process one such block? If the block processing times exceed the period, you can end up in very strange scenarios where signers start to race with themselves betwene importing/mining.

@karalabe
Copy link
Member

Perhaps if you could share normal operational logs so we can see how heavy subsequent blocks are, how much time they take, etc.

@cyrilnavessamuel
Copy link
Author

Thanks Martin and Peter for looking into the issue. I'll try to recreate the scenario and share the operational logs of it.
Cheers

@cyrilnavessamuel
Copy link
Author

Hello All,

Sorry for the time to reproduce the issue.

This time I got with the same configuration (Block period 1) and with three scripts sending concurrently to 3 nodes from a 4 node blockchain network. Fork Happened at block 109. Although the fork happened and stuck there, the same node sealing two varieties of block was not reproduced.

Node A & Node B on one Fork
NodeA B

Node C & Node D on other Fork;
NodeC D

Please find the attached log files of 4 Nodes in zip file on google drive since the file is some what big.

https://drive.google.com/file/d/18E-0DHueFt3OOYb51Y7s0M5tYEGzPZvV/view?usp=sharing

Thanks for looking,
Cheers

@fnaticwang
Copy link

This problem has been around for almost a year and we still haven't solved it.

@sambacha
Copy link

sambacha commented Jul 27, 2020

try the following changes:

Genesis file changes

change gasblock to: 1DCD6000
delete: "petersburgBlock": 0,

period time

*where*, `OUT_OF_TURN_DELAY_MULTIPLER` == 500ms
*where*, `MIN_OUT_OF_TURN_DELAY` == your_desired_period (in this case `2000ms`

${G_PERIOD} == [ MIN_OUT_OF_TURN_DELAY + rand(SIGNER_COUNT * OUT_OF_TURN_DELAY_MULTIPLER) ]

so that you change: period: ${G_PERIOD} in your genesis file.

make sure your closing connections as well,
netstat -a | grep 8545 | wc -l

additional

Clique out of turn is a known issue, see goerli testnet.

@holiman
Copy link
Contributor

holiman commented Aug 20, 2020

I think the root problem here is

No of Nodes/ Sealers : 4

If you'd had 5 instead, the situation would be less prone to chain ties (2 sealers vs 2 sealers), and you'd have a natural tie-breaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants