Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bft): potential deadlock in the consensus ticker on close #2264

Merged
merged 1 commit into from
Jun 3, 2024

Conversation

gfanton
Copy link
Member

@gfanton gfanton commented Jun 3, 2024

This PR addresses a potential deadlock issue that can occur if ScheduleTimeout is used after the timeoutTicker service has been stopped. This problem has been observed in some CI jobs when stopping the node: https://github.com/gnolang/gno/actions/runs/9301771431/attempts/1 (thanks @thehowl).

The stacktrace indicates that timeoutTicker has likely been closed before ConsensusState, and ConsensusState is getting stuck while calling ScheduleTimeout because the tickChan will never be read.

The proposed solution is to simply ignore any scheduled timeouts if the timeoutTicker service has been closed.

I've extracted the relevant part of the stacktrace:

  • The first stack shows the ConsensusReactor waiting for ConsensusState on stop.
  • The second stack shows ConsensusState waiting to ScheduleTimeout on the timeoutTicker.
coverage: 53.8% of statements
panic: test timed out after 30m0s
running tests:
	TestTestdata (30m0s)
	TestTestdata/gnoland (29m54s)

---

goroutine 11039 [chan receive, 29 minutes]:
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).Wait(...)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:376
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusReactor).OnStop(0xc000176a00)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/reactor.go:92 +0x75
github.com/gnolang/gno/tm2/pkg/service.(*BaseService).Stop(0xc000176a00)
	/home/runner/work/gno/gno/tm2/pkg/service/service.go:168 +0x22e
github.com/gnolang/gno/tm2/pkg/p2p.(*Switch).OnStop(0xc000396600)

---

goroutine 15586 [chan send, 29 minutes]:
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).ScheduleTimeout(0x1d4e957d?, {0x5ed0bdb, 0x7, 0x0, 0x1})
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:71 +0x30
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).scheduleTimeout(...)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:472
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).scheduleRound0(0xc000ff7508, 0xc000ff7598)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:467 +0x8b
...
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000ff7508, 0x41bdda?, {0x0?, 0x7f11457181a8?})
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1518 +0x2f
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000ff7508, {{0x12fc4c0, 0xc000fe59f0}, {0x0, 0x0}})
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:696 +0xf8
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000ff7508, 0x0)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:655 +0x452
created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 11039
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:349 +0x4ef

---
Contributors' checklist...
  • Added new tests, or not needed, or not feasible
  • Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory
  • Updated the official documentation or not needed
  • No breaking changes were made, or a BREAKING CHANGE: xxx message was included in the description
  • Added references to related issues and PRs
  • Provided any useful hints for running manual tests
  • Added new benchmarks to generated graphs, if any. More info here.

Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
@gfanton gfanton self-assigned this Jun 3, 2024
@github-actions github-actions bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Jun 3, 2024
@gfanton gfanton changed the title fix: potential deadlock in the consensus ticker on close fix(bft): potential deadlock in the consensus ticker on close Jun 3, 2024
Copy link

codecov bot commented Jun 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.63%. Comparing base (edb321f) to head (3f24aaf).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2264      +/-   ##
==========================================
- Coverage   54.64%   54.63%   -0.02%     
==========================================
  Files         578      578              
  Lines       77870    77873       +3     
==========================================
- Hits        42551    42543       -8     
- Misses      32149    32158       +9     
- Partials     3170     3172       +2     
Flag Coverage Δ
tm2 54.41% <100.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gfanton gfanton requested a review from thehowl June 3, 2024 10:22
@thehowl thehowl merged commit f1e46e3 into gnolang:master Jun 3, 2024
41 checks passed
@gfanton gfanton deleted the fix/bft-ticker branch June 3, 2024 18:02
omarsy pushed a commit to TERITORI/gno that referenced this pull request Jun 3, 2024
…g#2264)

This PR addresses a potential deadlock issue that can occur if
`ScheduleTimeout` is used after the `timeoutTicker` service has been
stopped. This problem has been observed in some CI jobs when stopping
the node:
https://github.com/gnolang/gno/actions/runs/9301771431/attempts/1
(thanks @thehowl).

The stacktrace indicates that `timeoutTicker` has likely been closed
before `ConsensusState`, and `ConsensusState` is getting stuck while
calling `ScheduleTimeout` because the `tickChan` will never be read.

The proposed solution is to simply ignore any scheduled timeouts if the
`timeoutTicker` service has been closed.

I've extracted the relevant part of the stacktrace:
- The first stack shows the `ConsensusReactor` waiting for
`ConsensusState` on stop.
- The second stack shows `ConsensusState` waiting to `ScheduleTimeout`
on the `timeoutTicker`.
 
```
coverage: 53.8% of statements
panic: test timed out after 30m0s
running tests:
	TestTestdata (30m0s)
	TestTestdata/gnoland (29m54s)

---

goroutine 11039 [chan receive, 29 minutes]:
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).Wait(...)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:376
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusReactor).OnStop(0xc000176a00)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/reactor.go:92 +0x75
github.com/gnolang/gno/tm2/pkg/service.(*BaseService).Stop(0xc000176a00)
	/home/runner/work/gno/gno/tm2/pkg/service/service.go:168 +0x22e
github.com/gnolang/gno/tm2/pkg/p2p.(*Switch).OnStop(0xc000396600)

---

goroutine 15586 [chan send, 29 minutes]:
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).ScheduleTimeout(0x1d4e957d?, {0x5ed0bdb, 0x7, 0x0, 0x1})
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:71 +0x30
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).scheduleTimeout(...)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:472
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).scheduleRound0(0xc000ff7508, 0xc000ff7598)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:467 +0x8b
...
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000ff7508, 0x41bdda?, {0x0?, 0x7f11457181a8?})
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1518 +0x2f
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000ff7508, {{0x12fc4c0, 0xc000fe59f0}, {0x0, 0x0}})
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:696 +0xf8
github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000ff7508, 0x0)
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:655 +0x452
created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 11039
	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:349 +0x4ef

---
```

<details><summary>Contributors' checklist...</summary>

- [ ] Added new tests, or not needed, or not feasible
- [ ] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [ ] Updated the official documentation or not needed
- [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [ ] Added references to related issues and PRs
- [ ] Provided any useful hints for running manual tests
- [ ] Added new benchmarks to [generated
graphs](https://gnoland.github.io/benchmarks), if any. More info
[here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md).
</details>

Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 🌐 tendermint v2 Issues or PRs tm2 related
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants