core: improve shutdown synchronization in BlockChain #22853

holiman · 2021-05-10T18:46:28Z

This change removes misuses of sync.WaitGroup in BlockChain. Before this change,
block insertion modified the WaitGroup counter in order to ensure that Stop would wait
for pending operations to complete. This was racy and could even lead to crashes
if Stop was called at an unfortunate time. The issue is resolved by adding a specialized
'closable' mutex, which prevents chain modifications after stopping while also
synchronizing writers.

core/blockchain.go

holiman · 2021-05-11T06:53:58Z

core/blockchain.go

+	if bc.insertStopped() {
+		return errInsertionInterrupted
+	}


This was placed on the wrong method, though :)

fjl · 2021-05-20T09:28:57Z

We should consider taking the chain mutex in Blockchain.Stop

holiman · 2021-05-20T11:06:39Z

on master, with also an addition miner reward on the clique engine, just to force some state progression. I ran it maybe 10-20 times, then got something:

[user@work go-ethereum]$ ./build/bin/geth --dev --dev.period=-1
...

INFO [05-20|12:58:33.102] Commit new mining work                   number=2746 sealhash=b605c4..727369 uncles=0 txs=0 gas=0 fees=0 elapsed="188.744µs"
INFO [05-20|12:58:33.103] Writing cached state to disk             block=2745 hash=017219..dc5913 root=a88032..762a0e
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=4  size=732.00B   time="118.176µs" gcnodes=7851 gcsize=1.48MiB gctime=14.477017ms livenodes=385 livesize=74.25KiB
INFO [05-20|12:58:33.103] Writing cached state to disk             block=2744 hash=f2d8cc..f6d171 root=3215e0..f006c8
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=3  size=594.00B   time="29.044µs"  gcnodes=0    gcsize=0.00B   gctime=0s          livenodes=382 livesize=73.67KiB
INFO [05-20|12:58:33.103] Writing cached state to disk             block=2618 hash=37f688..df1c4e root=3ef902..c58cec
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=3  size=594.00B   time="29.954µs"  gcnodes=0    gcsize=0.00B   gctime=0s          livenodes=379 livesize=73.09KiB
INFO [05-20|12:58:33.103] Writing snapshot state to disk           root=247c30..745f18
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=0  size=0.00B     time="2.164µs"   gcnodes=0    gcsize=0.00B   gctime=0s          livenodes=379 livesize=73.09KiB
ERROR[05-20|12:58:33.105] Dangling trie nodes after full cleanup 
INFO [05-20|12:58:33.105] Blockchain stopped 
panic: assignment to entry in nil map

goroutine 43 [running]:
github.com/ethereum/go-ethereum/ethdb/memorydb.(*batch).Write(0xc00627b6b0, 0x0, 0x0)
	github.com/ethereum/go-ethereum/ethdb/memorydb/memorydb.go:233 +0x285
github.com/ethereum/go-ethereum/core.(*BlockChain).writeHeadBlock(0xc0000b7200, 0xc0046e1ef0)
	github.com/ethereum/go-ethereum/core/blockchain.go:788 +0x2df
github.com/ethereum/go-ethereum/core.(*BlockChain).writeBlockWithState(0xc0000b7200, 0xc0046e1ef0, 0x1f477e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0047af380, 0xc00055fc01, ...)
	github.com/ethereum/go-ethereum/core/blockchain.go:1564 +0xf58
github.com/ethereum/go-ethereum/core.(*BlockChain).WriteBlockWithState(0xc0000b7200, 0xc0046e1ef0, 0x1f477e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0047af380, 0x1, ...)
	github.com/ethereum/go-ethereum/core/blockchain.go:1443 +0x111
github.com/ethereum/go-ethereum/miner.(*worker).resultLoop(0xc000112900)
	github.com/ethereum/go-ethereum/miner/worker.go:626 +0x83b
created by github.com/ethereum/go-ethereum/miner.newWorker
	github.com/ethereum/go-ethereum/miner/worker.go:229 +0x57c

With this PR (plus the same miner reward on clique), I never got an error, but instead it detected that it's already shutting down:

[user@work go-ethereum]$ ./build/bin/geth --dev --dev.period=-1
...
INFO [05-20|12:54:44.628] Successfully sealed new block            number=2437 sealhash=74c6aa..61baa1 hash=de3c63..0ff803 elapsed="161.166µs"
INFO [05-20|12:54:44.628] 🔗 block reached canonical chain          number=2430 hash=18605d..e93a1b
INFO [05-20|12:54:44.628] 🔨 mined potential block                  number=2437 hash=de3c63..0ff803
ERROR[05-20|12:54:44.629] Failed writing block to chain            err="insertion is interrupted"
INFO [05-20|12:54:44.629] Commit new mining work                   number=2438 sealhash=676667..dc97d7 uncles=0 txs=0 gas=0 fees=0 elapsed=1.139ms

holiman · 2021-05-21T09:25:46Z

Tested this with the clique stress-test, where I added a signal to shut down all the stacks on ctrl-c. Without this PR, I consistently got:

INFO [05-21|11:15:23.655] Blockchain stopped
INFO [05-21|11:15:23.655] Writing cached state to disk             block=2 hash=e27baf..c04f83 root=26e0e4..73e6ed
INFO [05-21|11:15:23.656] Persisted trie from memory database      nodes=0   size=0.00B    time="3.48µs"    gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-21|11:15:23.656] Blockchain stopped
ERROR[05-21|11:15:24.000] Failed writing block to chain            err="unknown ancestor"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2


INFO [05-21|11:16:35.207] Persisted trie from memory database      nodes=0   size=0.00B    time="1.416µs"   gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-21|11:16:35.207] Blockchain stopped
ERROR[05-21|11:16:36.000] Failed writing block to chain            err="unknown ancestor"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2


INFO [05-21|11:18:09.649] Persisted trie from memory database      nodes=0   size=0.00B    time="1.488µs"   gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-21|11:18:09.649] Blockchain stopped
ERROR[05-21|11:18:10.000] Failed writing block to chain            err="unknown ancestor"
ERROR[05-21|11:18:11.325] Failed writing block to chain            err="unknown ancestor"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2

With this PR:

INFO [05-21|11:19:16.823] Blockchain stopped
ERROR[05-21|11:19:17.000] Failed writing block to chain            err="insertion is interrupted"
ERROR[05-21|11:19:18.325] Failed writing block to chain            err="insertion is interrupted"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2

(the panic is part of the stress-test, so that's fine, but the error is different)

core/blockchain.go

holiman · 2021-08-12T10:54:55Z

rebased

holiman · 2021-08-12T10:55:55Z

@fjl now would be a good time to merge this, to give it lot of time on master before next release

holiman · 2021-08-27T08:56:49Z

Rebased again, cc @fjl

piersy · 2021-10-04T11:16:46Z

Hi @holiman, I've had a look at this and think its definitely a neater approach compared to #23673.

I think there might still be some shutdown sync issues with go-routines started from within the blockchain code.

This line starts a go routine that tries to access the rawdb, and there doesn't appear to be any sync control around it, so I think this code could again result in trying to access the db after it has been closed.

https://github.com/ethereum/go-ethereum/blob/c480e29c6fcc08d9af8bee10def07e8f01b595da/core/blockchain.go#L2334

Then there's also these 2 lines, which I don't think would access the db in an unsafe way, but its not easy to really verify that, and that could change when code changes are made, so it would seem safer to me to also wait for completion of these with the waitgroup.
https://github.com/ethereum/go-ethereum/blob/c480e29c6fcc08d9af8bee10def07e8f01b595da/core/blockchain.go#L377
https://github.com/ethereum/go-ethereum/blob/c480e29c6fcc08d9af8bee10def07e8f01b595da/core/blockchain.go#L1848

fjl · 2021-10-06T11:09:08Z

@piersy Thanks for your additional review! I have checked the goroutines you mentioned. The indexBlocks goroutine is tracked by the tx indexer, which is tracked by the WaitGroup.

The go bc.update() is for the future blocks processor, and I have added it to the WaitGroup now.

The last one you mentioned is a goroutine created for the state prefetcher. It's not possible to track it properly at this time. We should probably move the goroutine creation/tracking into the prefetcher itself. For now, I think we can live with this one not being tracked, the prefetcher does not write to the database.

fjl · 2021-10-06T13:31:23Z

Hmm, so this won't work just yet. Adding futureBlocksLoop into the WaitGroup creates a potential deadlock because procFutureBlocks calls InsertChain, which may take the chain mutex.

Here's the challenge with shutdown sync: within Stop, we want to ensure that all calls to InsertChain and related methods have left the critical section, and new calls cannot enter it. When we discussed removing the wg.Add calls, we were hoping this could be achieved by simply taking the chain mutex in Stop. Since this mutex is held during all chain modifications, it would give us the exclusion we need. However, simple exclusion is not all we need here. While Stop is running, we also need to deflect all attempts to insert new chain data immediately.

I think what we need is some kind of closable mutex. All the chain mutations would attempt to take this mutex, and return an error if it is closed. We'd close the mutex in Stop.

holiman · 2021-10-06T14:58:25Z

You missed this:

func (bc *BlockChain) ResetWithGenesisBlock(genesis *types.Block) error {
	// Dump the entire block chain and purge the caches
	if err := bc.SetHead(0); err != nil {
		return err
	}
	bc.chainmu.Lock()

and

func testHeaderChainImport(chain []*types.Header, blockchain *BlockChain) error {
	for _, header := range chain {
		// Try and validate the header
		if err := blockchain.engine.VerifyHeader(blockchain, header, false); err != nil {
			return err
		}
		// Manually insert the header into the database, but don't reorganise (allows subsequent testing)
		blockchain.chainmu.Lock()

and inside


// testBlockChainImport tries to process a chain of blocks, writing them into
// the database if successful.
func testBlockChainImport(chain types.Blocks, blockchain *BlockChain) error {

…hereum#22853) This change removes misuses of sync.WaitGroup in BlockChain. Before this change, block insertion modified the WaitGroup counter in order to ensure that Stop would wait for pending operations to complete. This was racy and could even lead to crashes if Stop was called at an unfortunate time. The issue is resolved by adding a specialized 'closable' mutex, which prevents chain modifications after stopping while also synchronizing writers with each other. Co-authored-by: Felix Lange <fjl@twurst.com>

This change removes misuses of sync.WaitGroup in BlockChain. Before this change, block insertion modified the WaitGroup counter in order to ensure that Stop would wait for pending operations to complete. This was racy and could even lead to crashes if Stop was called at an unfortunate time. The issue is resolved by adding a specialized 'closable' mutex, which prevents chain modifications after stopping while also synchronizing writers with each other. Co-authored-by: Felix Lange <fjl@twurst.com>

holiman requested review from karalabe and rjl493456442 as code owners May 10, 2021 18:46

rjl493456442 reviewed May 11, 2021

View reviewed changes

core/blockchain.go Show resolved Hide resolved

holiman commented May 11, 2021

View reviewed changes

holiman mentioned this pull request May 20, 2021

concurrent map read and map write #22892

Closed

holiman force-pushed the graceful branch from 1cfdd8e to aa54a53 Compare May 20, 2021 09:38

fjl reviewed May 26, 2021

View reviewed changes

core/blockchain.go Outdated Show resolved Hide resolved

holiman assigned fjl Jun 30, 2021

holiman force-pushed the graceful branch from aa54a53 to ed92597 Compare June 30, 2021 09:13

fjl changed the title ~~core: don't write blocks after insertion is stopped~~ core: remove misuse of WaitGroup in BlockChain and improve shutdown sync Jun 30, 2021

holiman force-pushed the graceful branch from ed92597 to 1b0fd29 Compare August 12, 2021 10:54

holiman force-pushed the graceful branch from 1b0fd29 to c480e29 Compare August 27, 2021 08:56

holiman mentioned this pull request Oct 1, 2021

core: Synchronize wait group access in blockchain #23673

Closed

fjl added this to the 1.10.10 milestone Oct 1, 2021

holiman and others added 6 commits October 6, 2021 13:08

core: don't write blocks after insertion is stopped

302d733

core: fix check at wrong place

f74cb45

core: remove bad use of waitgroup in blockchain

2f19b6e

core: track future blocks loop in BlockChain.wg

c9e2f0c

core: add some blank lines in maintainTxIndex

fc984b4

core: add some blank lines in insertChain

15823b6

fjl force-pushed the graceful branch from 5e20681 to 15823b6 Compare October 6, 2021 11:09

fjl changed the title ~~core: remove misuse of WaitGroup in BlockChain and improve shutdown sync~~ core: improve shutdown synchronization in BlockChain Oct 6, 2021

fjl added 4 commits October 6, 2021 16:06

core: implement and use closable mutex for chain stop

8530bc7

internal/syncx: update copyright year

f171df5

core: update comment in Stop

182d106

core: use errChainStopped in ExportN

c02fea5

fjl added 4 commits October 7, 2021 13:51

core: add close handling in ResetWithGenesisBlock

aed6b62

internal/syncx: improve docs/API

fe0c759

core: update for new ClosableMutex API

8aad281

core: update BlockChain field comments

c00e3ca

fjl merged commit edb1937 into ethereum:master Oct 7, 2021

sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Oct 7, 2021

core: improve shutdown synchronization in BlockChain (ethereum#22853)

a0e38af

piersy mentioned this pull request Oct 13, 2021

trie: Fix concurrent map access on trie.dirties #23674

Closed

piersy mentioned this pull request Oct 20, 2021

Add start stop e2e test celo-org/celo-blockchain#1705

Merged

kyrie-yl mentioned this pull request Dec 16, 2021

Fatal error: concurrent map read and map write bnb-chain/bsc#463

Closed

yoomee1313 mentioned this pull request Jan 20, 2022

Get rid of dual mutex in blockchain.go klaytn/klaytn#1099

Merged

9 tasks

This was referenced Sep 23, 2022

Metadium to master METADIUM/go-metadium#24

Closed

Metadium to master METADIUM/go-metadium#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: improve shutdown synchronization in BlockChain #22853

core: improve shutdown synchronization in BlockChain #22853

holiman commented May 10, 2021 •

edited by fjl

Loading

holiman May 11, 2021

fjl commented May 20, 2021

holiman commented May 20, 2021

holiman commented May 21, 2021

holiman commented Aug 12, 2021

holiman commented Aug 12, 2021

holiman commented Aug 27, 2021

piersy commented Oct 4, 2021 •

edited

Loading

fjl commented Oct 6, 2021

fjl commented Oct 6, 2021 •

edited

Loading

holiman commented Oct 6, 2021

core: improve shutdown synchronization in BlockChain #22853

core: improve shutdown synchronization in BlockChain #22853

Conversation

holiman commented May 10, 2021 • edited by fjl Loading

holiman May 11, 2021

Choose a reason for hiding this comment

fjl commented May 20, 2021

holiman commented May 20, 2021

holiman commented May 21, 2021

holiman commented Aug 12, 2021

holiman commented Aug 12, 2021

holiman commented Aug 27, 2021

piersy commented Oct 4, 2021 • edited Loading

fjl commented Oct 6, 2021

fjl commented Oct 6, 2021 • edited Loading

holiman commented Oct 6, 2021

holiman commented May 10, 2021 •

edited by fjl

Loading

piersy commented Oct 4, 2021 •

edited

Loading

fjl commented Oct 6, 2021 •

edited

Loading