Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A potential dead lock of client which result in block syncing halt #22131

Closed
unclezoro opened this issue Jan 7, 2021 · 7 comments
Closed

A potential dead lock of client which result in block syncing halt #22131

unclezoro opened this issue Jan 7, 2021 · 7 comments
Assignees
Labels

Comments

@unclezoro
Copy link

System information

Geth version: 1.9.13
OS & Version: Linux

Expected behaviour

Client sync block normal

Actual behaviour

Client failed to sync block.

Steps to reproduce the behaviour

First change the code to make it easier to reproduce:

func (sub *Subscription) Unsubscribe() {
	sub.unsubOnce.Do(func() {
	uninstallLoop:
		for {
			select {
			case sub.es.uninstall <- sub.f:
                                // newly added line
				time.Sleep(3 * time.Second)
				break uninstallLoop
                        case ...
			}
		}
		<-sub.Err()
	})
}

Do request for multi times:

curl -X POST --data '{"jsonrpc":"2.0","method":"eth_newPendingTransactionFilter","params":[],"id":73}'

Backtrace

Here is the pprof of go routine:
goroutine.txt

  1. The lock here is not released:
    image

According to the pprof, the code is blocked on /server/bsc/event/feed.go:170.

1 @ 0x447b10 0x4573db 0x457eda 0x4b5630 0x6e90a4 0xcf0a5f 0xcf3800 0xcf2544 0xf68a60 0xf3aba5 0x475d61
#       0x457ed9        reflect.rselect+0x389                                                           /usr/lib/golang/src/runtime/select.go:542
#       0x4b562f        reflect.Select+0x16f                                                            /usr/lib/golang/src/reflect/value.go:2229
#       0x6e90a3        github.com/ethereum/go-ethereum/event.(*Feed).Send+0x593                        /server/bsc/event/feed.go:170
#       0xcf0a5e        github.com/ethereum/go-ethereum/core.(*BlockChain).writeBlockWithState+0xe2e    /server/bsc/core/blockchain.go:1470
#       0xcf37ff        github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain+0x118f           /server/bsc/core/blockchain.go:1757
  1. The subscription may consume the channel so that block above select.
    As we checked, new NewPendingTransactionFilter calls are pending on fetching filtersMu lock.
#	0x4580a6	sync.runtime_SemacquireMutex+0x46									/usr/lib/golang/src/runtime/sema.go:71
#	0x49408b	sync.(*Mutex).lockSlow+0xfb										/usr/lib/golang/src/sync/mutex.go:138
#	0xf13256	sync.(*Mutex).Lock+0x296										/usr/lib/golang/src/sync/mutex.go:81
#	0xf130e1	github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).NewPendingTransactionFilter.func1+0x121	/server/bsc/eth/filters/api.go:119
  1. The routine that holds filtersMu lock is timeoutLoop:
#	0xf142b3	github.com/ethereum/go-ethereum/eth/filters.(*Subscription).Unsubscribe.func1+0xf3	/server/bsc/eth/filters/filter_system.go:164
#	0x4944d2	sync.(*Once).doSlow+0xe2								/usr/lib/golang/src/sync/once.go:66
#	0xf0f704	sync.(*Once).Do+0x64									/usr/lib/golang/src/sync/once.go:57
#	0xf0f6bd	github.com/ethereum/go-ethereum/eth/filters.(*Subscription).Unsubscribe+0x1d		/server/bsc/eth/filters/filter_system.go:157
#	0xf09d47	github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).timeoutLoop+0xf7		/server/bsc/eth/filters/api.go:88

The code is:

func (sub *Subscription) Unsubscribe() {
	sub.unsubOnce.Do(func() {
	uninstallLoop:
		for {
			// write uninstall request and consume logs/hashes. This prevents
			// the eventLoop broadcast method to deadlock when writing to the
			// filter event channel while the subscription loop is waiting for
			// this method to return (and thus not reading these events).
			select {
			case sub.es.uninstall <- sub.f:
				break uninstallLoop
			case <-sub.f.logs:
			case <-sub.f.hashes:
			case <-sub.f.headers:
			}
		}

		// wait for filter to be uninstalled in work loop before returning
		// this ensures that the manager won't use the event channel which
		// will probably be closed by the client asap after this method returns.
		<-sub.Err()
	})
}

Obvious it failed to do sub.es.uninstall <- sub.f.

  1. eventLoop is responsible for consuming the sub.es.uninstall channel.
    The routine is busy doing handleTxsEvent:
#	0xf1128b	github.com/ethereum/go-ethereum/eth/filters.(*EventSystem).handleTxsEvent+0x20b	/server/bsc/eth/filters/filter_system.go:350
#	0xf12f90	github.com/ethereum/go-ethereum/eth/filters.(*EventSystem).eventLoop+0xa60	/server/bsc/eth/filters/filter_system.go:461

It will push hashes to f.hashes channel, and now it is blocking by that channel.

func (es *EventSystem) handleTxsEvent(filters filterIndex, ev core.NewTxsEvent) {
	hashes := make([]common.Hash, 0, len(ev.Txs))
	for _, tx := range ev.Txs {
		hashes = append(hashes, tx.Hash())
	}
	for _, f := range filters[PendingTransactionsSubscription] {
		f.hashes <- hashes
	}
}
  1. What happened to f.hashes channel?
    That is because, in the above step 3, the Unsubscribe may not consume all the hashes and break out so that step4 stucked.

The deadlock dependency is:

Routine A:NewPendingTransactionFilter want to lock filtersMu to consume hashes;
Routine B: eventLoop is waiting Routine A to consume hashes so that it can push new hash to channel.
Routine C: Unsubscribe is holding lock filtersMu, but it is waiting for Routine B to consume uninstall channel.

@rjl493456442
Copy link
Member

That is because, in the above step 3, the Unsubscribe may not consume all the hashes and break out so that step4 stucked.

I still don't get it, if the subscription is uninstalled sucessfully, no more following events will be sent to this filter.

@rjl493456442
Copy link
Member

And regarding the deadlock. Before uninstall request get processed, all the queued events can be processed by the Unsubscribe function too, so I don't think the eventLoop will be blocked.

Maybe I miss something here?

@holiman
Copy link
Contributor

holiman commented Jan 7, 2021

@guagualvcha did you also send this as a bounty? I tried the repro earlier, and was never able to repro it. I'll give it another go later

@ligi ligi assigned s1na Jan 7, 2021
@ligi
Copy link
Member

ligi commented Jan 7, 2021

@s1na will try to reproduce the issue

@unclezoro
Copy link
Author

unclezoro commented Jan 7, 2021

Hi, teams. Thanks for all the timely responses. @s1na @holiman To easily reproduce the issue, you will need 1. the block chain should have some new transactions, empty block can not reproduce the issue. 2. change the code a bit, in the above case, I did sleep 3 seconds when break uninstallLoop, you can increase the sleeping time to 15 seconds or longer.
After that, you can reproduce this issue stably.

@s1na
Copy link
Contributor

s1na commented Jan 14, 2021

I could reproduce the issue (thanks for the pointers).

I still don't get it, if the subscription is uninstalled sucessfully, no more following events will be sent to this filter.
And regarding the deadlock. Before uninstall request get processed, all the queued events can be processed by the Unsubscribe function too, so I don't think the eventLoop will be blocked.

@rjl493456442 I think the problem is when there are multiple filters. Filter A times out, locks filtersMu and goes to Unsubscribe. At the same time a new txhash arrives. Now broadcasting the hash to filter A should work (because Unsubscribe would consume it), but broadcasting it to any other active filter would block because it can't acquire filtersMu.

I'm now going to think about a fix.

@rjl493456442
Copy link
Member

rjl493456442 commented Jan 15, 2021

Yeah, I get it. Multiple filters explain the reason!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants