Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: improve shutdown synchronization in BlockChain #22853

Merged
merged 14 commits into from
Oct 7, 2021

Conversation

holiman
Copy link
Contributor

@holiman holiman commented May 10, 2021

This change removes misuses of sync.WaitGroup in BlockChain. Before this change,
block insertion modified the WaitGroup counter in order to ensure that Stop would wait
for pending operations to complete. This was racy and could even lead to crashes
if Stop was called at an unfortunate time. The issue is resolved by adding a specialized
'closable' mutex, which prevents chain modifications after stopping while also
synchronizing writers.

core/blockchain.go Show resolved Hide resolved
Comment on lines 1428 to 1430
if bc.insertStopped() {
return errInsertionInterrupted
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was placed on the wrong method, though :)

@fjl
Copy link
Contributor

fjl commented May 20, 2021

We should consider taking the chain mutex in Blockchain.Stop

@holiman
Copy link
Contributor Author

holiman commented May 20, 2021

on master, with also an addition miner reward on the clique engine, just to force some state progression. I ran it maybe 10-20 times, then got something:

[user@work go-ethereum]$ ./build/bin/geth --dev --dev.period=-1
...

INFO [05-20|12:58:33.102] Commit new mining work                   number=2746 sealhash=b605c4..727369 uncles=0 txs=0 gas=0 fees=0 elapsed="188.744µs"
INFO [05-20|12:58:33.103] Writing cached state to disk             block=2745 hash=017219..dc5913 root=a88032..762a0e
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=4  size=732.00B   time="118.176µs" gcnodes=7851 gcsize=1.48MiB gctime=14.477017ms livenodes=385 livesize=74.25KiB
INFO [05-20|12:58:33.103] Writing cached state to disk             block=2744 hash=f2d8cc..f6d171 root=3215e0..f006c8
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=3  size=594.00B   time="29.044µs"  gcnodes=0    gcsize=0.00B   gctime=0s          livenodes=382 livesize=73.67KiB
INFO [05-20|12:58:33.103] Writing cached state to disk             block=2618 hash=37f688..df1c4e root=3ef902..c58cec
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=3  size=594.00B   time="29.954µs"  gcnodes=0    gcsize=0.00B   gctime=0s          livenodes=379 livesize=73.09KiB
INFO [05-20|12:58:33.103] Writing snapshot state to disk           root=247c30..745f18
INFO [05-20|12:58:33.103] Persisted trie from memory database      nodes=0  size=0.00B     time="2.164µs"   gcnodes=0    gcsize=0.00B   gctime=0s          livenodes=379 livesize=73.09KiB
ERROR[05-20|12:58:33.105] Dangling trie nodes after full cleanup 
INFO [05-20|12:58:33.105] Blockchain stopped 
panic: assignment to entry in nil map

goroutine 43 [running]:
github.com/ethereum/go-ethereum/ethdb/memorydb.(*batch).Write(0xc00627b6b0, 0x0, 0x0)
	github.com/ethereum/go-ethereum/ethdb/memorydb/memorydb.go:233 +0x285
github.com/ethereum/go-ethereum/core.(*BlockChain).writeHeadBlock(0xc0000b7200, 0xc0046e1ef0)
	github.com/ethereum/go-ethereum/core/blockchain.go:788 +0x2df
github.com/ethereum/go-ethereum/core.(*BlockChain).writeBlockWithState(0xc0000b7200, 0xc0046e1ef0, 0x1f477e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0047af380, 0xc00055fc01, ...)
	github.com/ethereum/go-ethereum/core/blockchain.go:1564 +0xf58
github.com/ethereum/go-ethereum/core.(*BlockChain).WriteBlockWithState(0xc0000b7200, 0xc0046e1ef0, 0x1f477e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0047af380, 0x1, ...)
	github.com/ethereum/go-ethereum/core/blockchain.go:1443 +0x111
github.com/ethereum/go-ethereum/miner.(*worker).resultLoop(0xc000112900)
	github.com/ethereum/go-ethereum/miner/worker.go:626 +0x83b
created by github.com/ethereum/go-ethereum/miner.newWorker
	github.com/ethereum/go-ethereum/miner/worker.go:229 +0x57c

With this PR (plus the same miner reward on clique), I never got an error, but instead it detected that it's already shutting down:

[user@work go-ethereum]$ ./build/bin/geth --dev --dev.period=-1
...
INFO [05-20|12:54:44.628] Successfully sealed new block            number=2437 sealhash=74c6aa..61baa1 hash=de3c63..0ff803 elapsed="161.166µs"
INFO [05-20|12:54:44.628] 🔗 block reached canonical chain          number=2430 hash=18605d..e93a1b
INFO [05-20|12:54:44.628] 🔨 mined potential block                  number=2437 hash=de3c63..0ff803
ERROR[05-20|12:54:44.629] Failed writing block to chain            err="insertion is interrupted"
INFO [05-20|12:54:44.629] Commit new mining work                   number=2438 sealhash=676667..dc97d7 uncles=0 txs=0 gas=0 fees=0 elapsed=1.139ms


@holiman
Copy link
Contributor Author

holiman commented May 21, 2021

Tested this with the clique stress-test, where I added a signal to shut down all the stacks on ctrl-c. Without this PR, I consistently got:

INFO [05-21|11:15:23.655] Blockchain stopped
INFO [05-21|11:15:23.655] Writing cached state to disk             block=2 hash=e27baf..c04f83 root=26e0e4..73e6ed
INFO [05-21|11:15:23.656] Persisted trie from memory database      nodes=0   size=0.00B    time="3.48µs"    gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-21|11:15:23.656] Blockchain stopped
ERROR[05-21|11:15:24.000] Failed writing block to chain            err="unknown ancestor"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2


INFO [05-21|11:16:35.207] Persisted trie from memory database      nodes=0   size=0.00B    time="1.416µs"   gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-21|11:16:35.207] Blockchain stopped
ERROR[05-21|11:16:36.000] Failed writing block to chain            err="unknown ancestor"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2


INFO [05-21|11:18:09.649] Persisted trie from memory database      nodes=0   size=0.00B    time="1.488µs"   gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-21|11:18:09.649] Blockchain stopped
ERROR[05-21|11:18:10.000] Failed writing block to chain            err="unknown ancestor"
ERROR[05-21|11:18:11.325] Failed writing block to chain            err="unknown ancestor"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2

With this PR:

INFO [05-21|11:19:16.823] Blockchain stopped
ERROR[05-21|11:19:17.000] Failed writing block to chain            err="insertion is interrupted"
ERROR[05-21|11:19:18.325] Failed writing block to chain            err="insertion is interrupted"
panic: insufficient funds for gas * price + value

goroutine 1 [running]:
main.main()
        /home/user/go/src/github.com/ethereum/go-ethereum/miner/stress/clique/clique.go:136 +0xc14
exit status 2

(the panic is part of the stress-test, so that's fine, but the error is different)

core/blockchain.go Outdated Show resolved Hide resolved
@fjl fjl changed the title core: don't write blocks after insertion is stopped core: remove misuse of WaitGroup in BlockChain and improve shutdown sync Jun 30, 2021
@holiman
Copy link
Contributor Author

holiman commented Aug 12, 2021

rebased

@holiman
Copy link
Contributor Author

holiman commented Aug 12, 2021

@fjl now would be a good time to merge this, to give it lot of time on master before next release

@holiman
Copy link
Contributor Author

holiman commented Aug 27, 2021

Rebased again, cc @fjl

@fjl fjl added this to the 1.10.10 milestone Oct 1, 2021
@piersy
Copy link
Contributor

piersy commented Oct 4, 2021

Hi @holiman, I've had a look at this and think its definitely a neater approach compared to #23673.

I think there might still be some shutdown sync issues with go-routines started from within the blockchain code.

This line starts a go routine that tries to access the rawdb, and there doesn't appear to be any sync control around it, so I think this code could again result in trying to access the db after it has been closed.

https://github.com/ethereum/go-ethereum/blob/c480e29c6fcc08d9af8bee10def07e8f01b595da/core/blockchain.go#L2334

Then there's also these 2 lines, which I don't think would access the db in an unsafe way, but its not easy to really verify that, and that could change when code changes are made, so it would seem safer to me to also wait for completion of these with the waitgroup.
https://github.com/ethereum/go-ethereum/blob/c480e29c6fcc08d9af8bee10def07e8f01b595da/core/blockchain.go#L377
https://github.com/ethereum/go-ethereum/blob/c480e29c6fcc08d9af8bee10def07e8f01b595da/core/blockchain.go#L1848

@fjl
Copy link
Contributor

fjl commented Oct 6, 2021

@piersy Thanks for your additional review! I have checked the goroutines you mentioned. The indexBlocks goroutine is tracked by the tx indexer, which is tracked by the WaitGroup.

The go bc.update() is for the future blocks processor, and I have added it to the WaitGroup now.

The last one you mentioned is a goroutine created for the state prefetcher. It's not possible to track it properly at this time. We should probably move the goroutine creation/tracking into the prefetcher itself. For now, I think we can live with this one not being tracked, the prefetcher does not write to the database.

@fjl fjl changed the title core: remove misuse of WaitGroup in BlockChain and improve shutdown sync core: improve shutdown synchronization in BlockChain Oct 6, 2021
@fjl
Copy link
Contributor

fjl commented Oct 6, 2021

Hmm, so this won't work just yet. Adding futureBlocksLoop into the WaitGroup creates a potential deadlock because procFutureBlocks calls InsertChain, which may take the chain mutex.

Here's the challenge with shutdown sync: within Stop, we want to ensure that all calls to InsertChain and related methods have left the critical section, and new calls cannot enter it. When we discussed removing the wg.Add calls, we were hoping this could be achieved by simply taking the chain mutex in Stop. Since this mutex is held during all chain modifications, it would give us the exclusion we need. However, simple exclusion is not all we need here. While Stop is running, we also need to deflect all attempts to insert new chain data immediately.

I think what we need is some kind of closable mutex. All the chain mutations would attempt to take this mutex, and return an error if it is closed. We'd close the mutex in Stop.

@holiman
Copy link
Contributor Author

holiman commented Oct 6, 2021

You missed this:

func (bc *BlockChain) ResetWithGenesisBlock(genesis *types.Block) error {
	// Dump the entire block chain and purge the caches
	if err := bc.SetHead(0); err != nil {
		return err
	}
	bc.chainmu.Lock()

and

func testHeaderChainImport(chain []*types.Header, blockchain *BlockChain) error {
	for _, header := range chain {
		// Try and validate the header
		if err := blockchain.engine.VerifyHeader(blockchain, header, false); err != nil {
			return err
		}
		// Manually insert the header into the database, but don't reorganise (allows subsequent testing)
		blockchain.chainmu.Lock()

and inside


// testBlockChainImport tries to process a chain of blocks, writing them into
// the database if successful.
func testBlockChainImport(chain types.Blocks, blockchain *BlockChain) error {

@fjl fjl merged commit edb1937 into ethereum:master Oct 7, 2021
sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Oct 7, 2021
tony-ricciardi pushed a commit to tony-ricciardi/go-ethereum that referenced this pull request Jan 20, 2022
…hereum#22853)

This change removes misuses of sync.WaitGroup in BlockChain. Before this change,
block insertion modified the WaitGroup counter in order to ensure that Stop would wait
for pending operations to complete. This was racy and could even lead to crashes
if Stop was called at an unfortunate time. The issue is resolved by adding a specialized
'closable' mutex, which prevents chain modifications after stopping while also
synchronizing writers with each other.

Co-authored-by: Felix Lange <fjl@twurst.com>
yongjun925 pushed a commit to DODOEX/go-ethereum that referenced this pull request Dec 3, 2022
This change removes misuses of sync.WaitGroup in BlockChain. Before this change,
block insertion modified the WaitGroup counter in order to ensure that Stop would wait
for pending operations to complete. This was racy and could even lead to crashes
if Stop was called at an unfortunate time. The issue is resolved by adding a specialized
'closable' mutex, which prevents chain modifications after stopping while also
synchronizing writers with each other.

Co-authored-by: Felix Lange <fjl@twurst.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants