Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geth signing stops after a period of time #16406

Closed
REPTILEHAUS opened this issue Mar 28, 2018 · 9 comments
Closed

Geth signing stops after a period of time #16406

REPTILEHAUS opened this issue Mar 28, 2018 · 9 comments

Comments

@REPTILEHAUS
Copy link

REPTILEHAUS commented Mar 28, 2018

I have 5 clients setup (1 boot, 1 geth, 3 signers using Clique) - Most of the time they work no problem and mining starts, sometimes mining just stops with no warning nor issue/error message.

Another time I noticed that after coming to consensus on an additional external signer that mining will also stop - I have tried to find information on the order with which signers take turns and what can stop mining from happening, from looking at the code it seems that signers are chosen more so by timing than randomly but can a signer who misses their "turn" for whatever reason cripple or stop the other nodes from signing/validating blocks ? I would hazard a guess that this is not the case but I cant seem to tract down the issue and its is rendering my dev network useless. I know I'm not giving much to go on but there is no errors like usual such "as bad propagated block" all that happens is siging stops and everyone is on "signed recently, waiting to sign again"

for what its worth im running my signers with this

 geth --datadir ~/test-net/data --mine --cache=1024 --syncmode 'full' --bootnodes="enode://enodeurlgoeshere@ipgoeshere:port" --networkid 47592 --rpc --rpcapi admin,db,eth,debug,miner,net,txpool,personal,web3 --rpcport "8545" --rpcaddr "0.0.0.0" --rpccorsdomain "*" --unlock 914be90b2a0dd6a5b0789c2cee5836dd1f1c030 --password <(echo "password") console --nat "extip:pubIPgoeshere"
INFO [03-28|18:15:34] Commit new mining work                   number=21271 txs=0 uncles=2 elapsed=315.821µs
INFO [03-28|18:15:34] Successfully sealed new block            number=21271 hash=0e4d19…b0854b
INFO [03-28|18:15:34] 🔨 mined potential block                  number=21271 hash=0e4d19…b0854b
INFO [03-28|18:15:34] Commit new mining work                   number=21272 txs=0 uncles=2 elapsed=434.638µs
INFO [03-28|18:15:34] Signed recently, must wait for others 
INFO [03-28|18:23:07] Regenerated local transaction journal    transactions=0 accounts=0
INFO [03-28|19:23:07] Regenerated local transaction journal    transactions=0 accounts=0
INFO [03-28|20:23:07] Regenerated local transaction journal    transactions=0 accounts=0
INFO [03-28|21:23:07] Regenerated local transaction journal    transactions=0 accounts=0
@karalabe
Copy link
Member

Quick recap on sealing in Clique:

  • To seal a block, you have to be authorized to do so (doh)
  • To ensure there is ever only one majority chain (50% + 1), a signer is only allowed to seal a new block if they had not signed one for the last (len(signers)+1) / 2 blocks. So if there are 3 signers, each signer can only sign every 2nd block. If there are 4 signers, the majority consensus is 3 (50% + 1), so each signer can only sign every third block. This may be one error cause, because if 2 signers disconnect from the authorized 4, the chain cannot progress.
  • Since every authorized signer (who hasn't signed in a while) can create new blocks, we need to weigh them differently somehow, otherwise they all would just try to create new blocks and chaos would ensue. We have two mechanisms for that:
    • One one hand, signers take their turn as to who should sign next (e.g. if we're at block N and we have K signers, then N % K is in-turn); if you're in-turn, your block is weighed 2, otherwise 1. This ensures that the network always prefers in-turn blocks over random signings.
    • The second heuristic is that out-of-turn signers will delay signing by a bit to ensure the in-turn signer has a change to make and propagate its block.

There's a slight race condition possible if your in-turn signer is offline. E.g. you have 3 signers, and the in-turn one goes offline. In that case you need both the other signers to work in lockstep to progress the chain. But since both are out-of-turn, they will both mine blocks with 1 diff. Given a good timing, those blocks will be mined simultaneously, and both will be stuck on their own little world.

Could you provide some more infos about your setup, maybe th eligs from each individual signer when they get stuck? That might shed some light on the issue.

@REPTILEHAUS
Copy link
Author

Thanks for getting back and sure no problem I can provide logs output should I set verbosity to 4 and output to file ? My setup is basically 5 AWS EC2 instnaces, each running with Clique, genesis created with puppeth and a dedicated bootnode server also, 1 standard geth install with nothing fancy just the usual rpc commands much the same as the example signer command above. just running with verbosity 4 and i see this

DEBUG[03-29|15:11:36] HTTP registered                          service="&{backend:0xc420448ca0 mux:0xc420438780 quit:<nil> chainDb:0xc42006c9c0 events:0xc421366340 filtersMu:{state:0 sema:0} filters:map[]}" namespace=eth
DEBUG[03-29|15:11:36] HTTP registered                          service=&{eth:0xc420272000}                                                                                                                     namespace=admin
DEBUG[03-29|15:11:36] HTTP registered                          service=&{eth:0xc420272000}                                                                                                                     namespace=debug
DEBUG[03-29|15:11:36] HTTP registered                          service="&{config:0xc42007e380 eth:0xc420272000}"                                                                                               namespace=debug
DEBUG[03-29|15:11:36] HTTP registered                          service="&{net:0xc420432f00 networkVersion:47891}"                                                                                              namespace=net
INFO [03-29|15:11:36] HTTP endpoint opened                     url=http://0.0.0.0:8545 cors=* vhosts=localhost
INFO [03-29|15:11:36] RLPx listener up                         self=enode://c07a353261a8006acf8fdd07cc2b8c7a6f0f7ca96dd8acfec6c3e4f4454723979f4b2d19e46ebf2ce1fa18e2a87f91cf17c30f4c9c0517283336acef76e6c7b4@[::]:30303
DEBUG[03-29|15:11:36] Couldn't add port mapping                proto=tcp extport=30303 intport=30303 interface="UPnP or NAT-PMP" err="no UPnP or NAT-PMP router discovered"
INFO [03-29|15:11:36] IPC endpoint opened                      url=/root/test-net/data/geth.ipc
DEBUG[03-29|15:11:37] Found seed node in database              id=20ade6a5fbb9d043 addr=ipaddress1here:30303  age=3m14.025063702s
DEBUG[03-29|15:11:37] Found seed node in database              id=95ccc5e1a0e8b2eb addr=ipaddress2here:30303  age=7h59m12.025205753s
DEBUG[03-29|15:11:37] Found seed node in database              id=51c7fdf37abd52cf addr=ipaddress3here:30303 age=11m31.025278095s
DEBUG[03-29|15:11:37] Found seed node in database              id=48f6d73541396876 addr=ipaddress4here:30303 age=1.025338561s
INFO [03-29|15:11:37] Unlocked account                         address=0x3F82F2F04AF76178d2E4414a5CE92A71d1494EA9
INFO [03-29|15:11:37] Transaction pool price threshold updated price=18000000000
INFO [03-29|15:11:37] Etherbase automatically configured       address=0x3F82F2F04AF76178d2E4414a5CE92A71d1494EA9
INFO [03-29|15:11:37] Starting mining operation 
INFO [03-29|15:11:37] Commit new mining work                   number=21272 txs=0 uncles=0 elapsed=107.032µs
INFO [03-29|15:11:37] Signed recently, must wait for others 
DEBUG[03-29|15:11:37] Adding p2p peer                          name=Geth/v1.8.1-stable/l... addr=ipaddress4here:30303 peers=1
DEBUG[03-29|15:11:37] Ethereum peer connected                  id=48f6d73541396876 conn=dyndial name=Geth/v1.8.1-stable/linux-amd64/go1.9.2
DEBUG[03-29|15:11:37] Adding p2p peer                          name=Geth/v1.8.1-stable/l...                addr=ipaddress2here:30303  peers=2
DEBUG[03-29|15:11:37] Ethereum peer connected                  id=95ccc5e1a0e8b2eb conn=dyndial name=Geth/v1.8.1-stable/linux-amd64/go1.9.2
DEBUG[03-29|15:11:37] Adding p2p peer                          name=Geth/v1.8.1-stable/l...                addr=ipaddress1here:30303  peers=3
DEBUG[03-29|15:11:37] Ethereum peer connected                  id=20ade6a5fbb9d043 conn=dyndial name=Geth/v1.8.1-stable/linux-amd64/go1.9.2
DEBUG[03-29|15:11:37] Relaxed downloader QoS values            rtt=20s confidence=0.500 ttl=1m0s
DEBUG[03-29|15:11:37] Relaxed downloader QoS values            rtt=20s confidence=0.333 ttl=1m0s
DEBUG[03-29|15:11:42] Revalidated node                         b=14 id=20ade6a5fbb9d043
DEBUG[03-29|15:11:44] Revalidated node                         b=16 id=20ade6a5fbb9d043
DEBUG[03-29|15:11:49] Removed dead node                        b=11 id=95ccc5e1a0e8b2eb ip=ipaddress2here
DEBUG[03-29|15:11:50] Revalidated node                         b=14 id=20ade6a5fbb9d043
DEBUG[03-29|15:11:54] Recalculated downloader QoS values       rtt=20s confidence=0.667 ttl=1m0s
DEBUG[03-29|15:11:56] Revalidated node                         b=14 id=20ade6a5fbb9d043
DEBUG[03-29|15:12:03] Revalidated node                         b=14 id=20ade6a5fbb9d043
DEBUG[03-29|15:12:05] Removed dead node                        b=16 id=95ccc5e1a0e8b2eb ip=ipaddress2here
DEBUG[03-29|15:12:13] Revalidated node                         b=14 id=20ade6a5fbb9d043
DEBUG[03-29|15:12:14] Recalculated downloader QoS values       rtt=20s confidence=0.833 ttl=1m0s
DEBUG[03-29|15:12:21] Removed dead node                        b=11 id=95ccc5e1a0e8b2eb ip=ipaddress2here
DEBUG[03-29|15:12:31] Removed dead node                        b=11 id=95ccc5e1a0e8b2eb ip=ipaddress2here

at the moment all 3 signers have stopped, is there anywhere in particular where i can check for why signers fall over or a specific kind of error when nothing is coming up on the screen. I can send the 4 logs of output to you somehow if you need to see them

@REPTILEHAUS
Copy link
Author

REPTILEHAUS commented Mar 29, 2018

just stopped and started all 3 AWS signers and the 1 signer i have locally.. i started them back up again and see the block numbers in the output but they are all frozen like this, seems like 2 are stuck on block 2172 and 2 are stuck on block 2171

signer 1 (AWS)

INFO [03-29|15:32:09] Commit new mining work                   number=21272 txs=0 uncles=0 elapsed=117.728µs

signer 2 (local instance)

INFO [03-29|15:32:09] Commit new mining work                   number=21272 txs=0 uncles=0 elapsed=117.728µs

node 3 (AWS)

INFO [03-29|15:32:10] Commit new mining work                   number=21271 txs=0 uncles=0 elapsed=99.351µs

node 4 (AWS)

INFO [03-29|15:31:57] Commit new mining work                   number=21271 txs=0 uncles=0 elapsed=100.583µs

@tudyzhb
Copy link

tudyzhb commented Jun 16, 2018

Ref clique-seal of v1.8.11, I think there is no an effective mechanism to retry seal, when an in-turn/out-of-turn seal fail occur. So our dev network useless easily.

@lyhbarry
Copy link

Quick recap on sealing in Clique:

  • To seal a block, you have to be authorized to do so (doh)

  • To ensure there is ever only one majority chain (50% + 1), a signer is only allowed to seal a new block if they had not signed one for the last (len(signers)+1) / 2 blocks. So if there are 3 signers, each signer can only sign every 2nd block. If there are 4 signers, the majority consensus is 3 (50% + 1), so each signer can only sign every third block. This may be one error cause, because if 2 signers disconnect from the authorized 4, the chain cannot progress.

  • Since every authorized signer (who hasn't signed in a while) can create new blocks, we need to weigh them differently somehow, otherwise they all would just try to create new blocks and chaos would ensue. We have two mechanisms for that:

    • One one hand, signers take their turn as to who should sign next (e.g. if we're at block N and we have K signers, then N % K is in-turn); if you're in-turn, your block is weighed 2, otherwise 1. This ensures that the network always prefers in-turn blocks over random signings.
    • The second heuristic is that out-of-turn signers will delay signing by a bit to ensure the in-turn signer has a change to make and propagate its block.

There's a slight race condition possible if your in-turn signer is offline. E.g. you have 3 signers, and the in-turn one goes offline. In that case you need both the other signers to work in lockstep to progress the chain. But since both are out-of-turn, they will both mine blocks with 1 diff. Given a good timing, those blocks will be mined simultaneously, and both will be stuck on their own little world.

Could you provide some more infos about your setup, maybe th eligs from each individual signer when they get stuck? That might shed some light on the issue.

I'm also facing the same issue with 3 signers. Will the possibility of the occurrence of the race condition mentioned be reduced with more signers?

@marcosmartinez7
Copy link

marcosmartinez7 commented Jan 7, 2019

Experimenting the same issue with 6 sealers, i restarted each node but now im get stucked in

INFO [01-07|18:17:30.645] Etherbase automatically configured address=0x5Bc69DC4dba04b6955aC94BbdF129C3ce2d20D34
INFO [01-07|18:17:30.645] Commit new mining work number=488677 sealhash=a506ec…8cb403 uncles=0 txs=0 gas=0 fees=0 elapsed=133.76µs
INFO [01-07|18:17:30.645] Signed recently, must wait for others

Btw, some nodes are stucked on the 488677 and others are on 488676, same issue of @lyhbarry

Also, after 1 hour, the issue of local transactions is showed again

INFO [01-07|18:41:56.134] Signed recently, must wait for others 
INFO [01-07|19:41:42.125] Regenerated local transaction journal    transactions=0 accounts=0
INFO [01-07|18:41:56.134] Signed recently, must wait for others 

Any ideas?

My current version is:

Geth
Version: 1.8.17-stable
Git Commit: 8bbe72075e4e16442c4e28d999edee12e294329e
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10

I dont know why the

Regenerated local transaction journal transactions=0 accounts=0

error happens.. but since is the same issue as @lyhbarry is pretty sure that error makes the chain out of sync and now the signers can continue

Is there any information i can provide? The blockchain was running for 2 months without any issues..

Also, i see this message in one signer:

Synchronisation failed, dropping peer peer=7875a002affc775b err="retrieved hash chain is invalid"

And this information

WARN [12-24|00:43:33.989] Header broke chain ancestry number=447760 hash=d2c11f…10fd6a
WARN [12-24|00:43:33.990] Synchronisation failed, dropping peer peer=7875a002affc775b err="action from bad peer ignored"

WARN [12-26|18:33:19.540] Synchronisation failed, dropping peer    peer=7875a002affc775b err="action from bad peer ignored"
WARN [12-26|22:37:49.572] Synchronisation failed, dropping peer    peer=7875a002affc775b err="action from bad peer ignored"
WARN [12-26|23:57:10.541] Synchronisation failed, dropping peer    peer=7875a002affc775b err="action from bad peer ignored"
WARN [12-27|08:03:25.895] Synchronisation failed, dropping peer    peer=7875a002affc775b err="action from bad peer ignored"
WARN [12-31|06:41:24.760] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
WARN [12-31|06:42:04.810] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
WARN [12-31|06:42:14.335] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
WARN [12-31|06:42:43.001] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
WARN [12-31|06:43:24.631] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"

@holiman
Copy link
Contributor

holiman commented Oct 1, 2019

@REPTILEHAUS it appears that you have some sort of network partition, since two sealers are on ..71 and two are on ..72. Four signers split into two pairs will halt progress.
@marcosmartinez7 you seem to have a consensus failure between your sealers, in that one rejects that hash of the other. Perhaps configured erroneously, - 1.8.17 is rather old, and perhaps it doesn't support the same forks as the other node does?

@marcosmartinez7
Copy link

@holiman My issue was described better here #18402

It was a bug on the clique algorithm.

@no-response
Copy link

no-response bot commented Oct 31, 2019

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

@no-response no-response bot closed this as completed Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants