Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/bloombits: fix deadlock when matcher session hits an error #28184

Merged
merged 1 commit into from
Sep 25, 2023
Merged

core/bloombits: fix deadlock when matcher session hits an error #28184

merged 1 commit into from
Sep 25, 2023

Conversation

msmania
Copy link
Contributor

@msmania msmania commented Sep 24, 2023

core/bloombits: fix a deadlock when a matcher session hits an error

Problem description

A deadlock occurs when a pruned node receives eth_getLogs for pruned blocks. We hit this deadlock in bsc and avalanche's subnet-evm.

This deadlock didn't happen on Ethereum with this repo's geth. Probably it's because the pruning mechanism is different.

As discussed below, however, the root cause is in core/bloombits and its code is the same as bsc. Any component consuming core/bloombits may hit this deadlock.

Root cause and Fix

Here are the goroutines causing the deadlock. Please note that this info is from bsc. The functions names are slightly different from the current geth.

  1. Goroutine running GetLogs is waiting on s.closer
  Goroutine 325 - User: /data/bin/go/src/runtime/sema.go:77 sync.runtime_SemacquireMutex (0x497766) [timer goroutine (idle) 13036920152269]
...snip...
 2  0x0000000000477c11 in runtime.semacquire1
    at /data/bin/go/src/runtime/sema.go:160
 3  0x0000000000497766 in sync.runtime_SemacquireMutex
    at /data/bin/go/src/runtime/sema.go:77
 4  0x00000000004d190d in sync.(*Mutex).lockSlow
    at /data/bin/go/src/sync/mutex.go:171
 5  0x00000000004d15ef in sync.(*Mutex).Lock
    at /data/bin/go/src/sync/mutex.go:90
 6  0x00000000004d1ce9 in sync.(*Once).doSlow
    at /data/bin/go/src/sync/once.go:70
 7  0x00000000004d1c65 in sync.(*Once).Do
    at /data/bin/go/src/sync/once.go:65
 8  0x00000000016f033d in github.com/ethereum/go-ethereum/core/bloombits.(*MatcherSession).Close
    at /data/src/bsc/core/bloombits/matcher.go:527
 9  0x0000000001f4f74b in github.com/ethereum/go-ethereum/eth/filters.(*Filter).indexedLogs.func1
    at /data/src/bsc/eth/filters/filter.go:224
10  0x0000000000461a73 in runtime.deferreturn
    at /data/bin/go/src/runtime/panic.go:476
11  0x0000000001f4f6ae in github.com/ethereum/go-ethereum/eth/filters.(*Filter).indexedLogs
    at /data/src/bsc/eth/filters/filter.go:240
12  0x0000000001f4e0f1 in github.com/ethereum/go-ethereum/eth/filters.(*Filter).Logs
    at /data/src/bsc/eth/filters/filter.go:194
13  0x0000000001f49daa in github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).GetLogs
    at /data/src/bsc/eth/filters/api.go:478
  1. s.closer is owned by Goroutine running MatcherSession.Multiplex, which is waiting on s.pend
  Goroutine 1512 - User: /data/bin/go/src/runtime/sema.go:62 sync.runtime_Semacquire (0x497647) [semacquire 13036920152269]
  ...snip...
 2  0x0000000000477c11 in runtime.semacquire1
    at /data/bin/go/src/runtime/sema.go:160
 3  0x0000000000497647 in sync.runtime_Semacquire
    at /data/bin/go/src/runtime/sema.go:62
 4  0x00000000004d44f7 in sync.(*WaitGroup).Wait
    at /data/bin/go/src/sync/waitgroup.go:116
 5  0x00000000016f03a5 in github.com/ethereum/go-ethereum/core/bloombits.(*MatcherSession).Close.func1
    at /data/src/bsc/core/bloombits/matcher.go:530
 6  0x00000000004d1dd8 in sync.(*Once).doSlow
    at /data/bin/go/src/sync/once.go:74
 7  0x00000000004d1c65 in sync.(*Once).Do
    at /data/bin/go/src/sync/once.go:65
 8  0x00000000016f033d in github.com/ethereum/go-ethereum/core/bloombits.(*MatcherSession).Close
    at /data/src/bsc/core/bloombits/matcher.go:527
 9  0x00000000016f1272 in github.com/ethereum/go-ethereum/core/bloombits.(*MatcherSession).Multiplex
    at /data/src/bsc/core/bloombits/matcher.go:640
10  0x0000000001f94605 in github.com/ethereum/go-ethereum/eth.(*EthAPIBackend).ServiceFilter.func1
    at /data/src/bsc/eth/api_backend.go:384
11  0x000000000049bd01 in runtime.goexit
    at /data/bin/go/src/runtime/asm_amd64.s:1598
  1. s.pend needs one more "Done" in Matcher.run in the following Goroutine, which is waiting on channels.
  Goroutine 460 - User: /data/src/bsc/core/bloombits/matcher.go:410 github.com/ethereum/go-ethereum/core/bloombits.(*Matcher).distributor (0x16ee95f) [select 13036920152269]
0  0x0000000000465bdd in runtime.gopark
   at /data/bin/go/src/runtime/proc.go:382
1  0x0000000000476eea in runtime.selectgo
   at /data/bin/go/src/runtime/select.go:327
2  0x00000000016ee95f in github.com/ethereum/go-ethereum/core/bloombits.(*Matcher).distributor
   at /data/src/bsc/core/bloombits/matcher.go:410
3  0x00000000016ec5be in github.com/ethereum/go-ethereum/core/bloombits.(*Matcher).run.func2
   at /data/src/bsc/core/bloombits/matcher.go:252
4  0x00000000009318d8 in github.com/panjf2000/ants/v2.(*goWorker).run.func1
   at /home/john/go/pkg/mod/github.com/panjf2000/ants/v2@v2.4.5/worker.go:70
5  0x000000000049bd01 in runtime.goexit
   at /data/bin/go/src/runtime/asm_amd64.s:1598

The 2nd goroutine did close the s.quit channel before waiting on s.pend. When the 3rd goroutine received that signal, it set the shutdown to nil, but at that time allocs was 1, so the loop in distributor still continued.

What the 3rd goroutine is waiting is data on m.deliveries. If that happens, the distributor returns, s.pending is marked done, and all goroutines resume.

The reason why m.deliveries is receiving no data is that the 2nd goroutine is stuck on s.Close() before s.deliverSections that sends data to m.deliveries.

Therefore, the proposed fix is to call s.deliverSections in MatcherSession.Multiplex before calling s.Close().

@jsvisa
Copy link
Contributor

jsvisa commented Sep 24, 2023

A deadlock occurs when a pruned node receives eth_getLogs for pruned blocks.

Do you mean this issue occurred when the ancient block data was pruned? If so, go-ethereum didn't prune any ancient data, so I think this issue can't trigger.

@fjl fjl changed the title core/bloombits: fix a deadlock when a matcher session hits an error core/bloombits: fix deadlock when matcher session hits an error Sep 25, 2023
@fjl fjl merged commit c2cfe35 into ethereum:master Sep 25, 2023
1 check passed
@fjl fjl added this to the 1.13.2 milestone Sep 25, 2023
@MariusVanDerWijden
Copy link
Member

Everyone is merging PRs right before I can review them :(

@msmania
Copy link
Contributor Author

msmania commented Sep 25, 2023

Do you mean this issue occurred when the ancient block data was pruned? If so, go-ethereum didn't prune any ancient data, so I think this issue can't trigger.

Yes, that's how the issue was emerged in BSC. Even if deadlock was not reproducible in Ethereum, the root cause was still in this repo's core/bloombits and this patch would prevent unknown problems in the future.

@fjl Thank you for merging this! I'll port this to other applicable blockchains.

@msmania msmania deleted the bugfix-deadlock-eth_getLogs branch September 25, 2023 19:37
tyler-smith pushed a commit to blocknative/go-ethereum that referenced this pull request Oct 12, 2023
…reum#28184)

When MatcherSession encounters an error, it attempts to close the session.
Closing waits for all goroutines to finish, including the 'distributor'. However, the
distributor will not exit until all requests have returned.

This patch fixes the issue by delivering the (empty) result to the distributor
before calling Close().
tyler-smith pushed a commit to blocknative/go-ethereum that referenced this pull request Oct 12, 2023
…reum#28184)

When MatcherSession encounters an error, it attempts to close the session.
Closing waits for all goroutines to finish, including the 'distributor'. However, the
distributor will not exit until all requests have returned.

This patch fixes the issue by delivering the (empty) result to the distributor
before calling Close().
siosw pushed a commit to fabriqnetwork/go-ethereum that referenced this pull request Oct 16, 2023
…reum#28184)

When MatcherSession encounters an error, it attempts to close the session.
Closing waits for all goroutines to finish, including the 'distributor'. However, the
distributor will not exit until all requests have returned.

This patch fixes the issue by delivering the (empty) result to the distributor
before calling Close().
devopsbo3 pushed a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
…reum#28184)

When MatcherSession encounters an error, it attempts to close the session.
Closing waits for all goroutines to finish, including the 'distributor'. However, the
distributor will not exit until all requests have returned.

This patch fixes the issue by delivering the (empty) result to the distributor
before calling Close().
devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants