Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bft): correctly drain channels events #1515

Merged
merged 16 commits into from
May 16, 2024

Conversation

gfanton
Copy link
Member

@gfanton gfanton commented Jan 10, 2024

addresses #1320

This PR does two things:

  • It prevents deadlocks in state_test by triggering a panic if GetRoundState becomes blocked. In the event of any panic, it will attempt to drain all relevant channels to identify which events have not been drained.
  • It fixes tests that have been failing randomly on the CI.

for now, you can checkout this branch, it will fail locally (using @jaekwon trick)

There are still some tests that I was unable to resolve. I have marked these with // XXX(FIXME) in tm2/pkg/bft/consensus/state_test.go:

  • TestStateFullRound1
  • TestWaitingTimeoutProposeOnNewRound
  • TestStateEnterProposeYesPrivValidator
  • TestStartNextHeightCorrectly
Contributors' checklist...
  • Added new tests, or not needed, or not feasible
  • Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory
  • Updated the official documentation or not needed
  • No breaking changes were made, or a BREAKING CHANGE: xxx message was included in the description
  • Added references to related issues and PRs
  • Provided any useful hints for running manual tests
  • Added new benchmarks to generated graphs, if any. More info here.

Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
@github-actions github-actions bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Jan 10, 2024
@gfanton gfanton changed the title fix(bft): add correct ensure event drain EvenSwitch channels fix(bft): correctly drain drain channels events Jan 10, 2024
Copy link

codecov bot commented Jan 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.03%. Comparing base (ccc6d5b) to head (783c214).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1515      +/-   ##
==========================================
- Coverage   49.04%   49.03%   -0.01%     
==========================================
  Files         576      576              
  Lines       77556    77556              
==========================================
- Hits        38035    38033       -2     
  Misses      36439    36439              
- Partials     3082     3084       +2     
Flag Coverage Δ
tm2 54.60% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
@gfanton gfanton changed the title fix(bft): correctly drain drain channels events fix(bft): correctly drain channels events Jan 11, 2024
Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
@jaekwon
Copy link
Contributor

jaekwon commented Jan 18, 2024

Thank you for doing this work.

@moul
Copy link
Member

moul commented Apr 21, 2024

@gfanton what's the status of this PR?

@gfanton
Copy link
Member Author

gfanton commented Apr 24, 2024

@moul

Current state is:

  • The PR causes the BFT test to fail early and shows the drained channel to aid debugging.
  • I've fixed some tests by rearranging the event order.
  • Unfotunatly, rearranging event consumption will not resolve the last 3 failing test. Or at last i wasn't able to do so.
  • I still believe the root issue lies not in the event system (even if there are rooms for improvement), but in how we test event ordering.

What we can do:

  • The core problem, to me, is that in an event bus system, a subscription is meant to be consumed. The test is actually leaving subscriptions on hold, which ultimately leads to the test getting stuck. This is especially true for a system that is meant to have no buffer size for sync reasons.
  • I still believe that using a single routine to queue events inside the tests can be a good idea. Even if the routine is asynchronous, events will be stored in the queue in the correct order.
  • Another potential solution could be to make subscriptions cancelable. By doing so, we can simply cancel previously stuck subscription. However, I'm still not sure about this, as we may still encounter a problem where a necessary subscription is stuck and cannot be canceled.

EDIT: Got another idea but I'm not really liking it as it looks a bit messy: The main deadlock issue come from the method GetRoundState, which uses cs.mutex. If we replace it with getRoundStateUnsafe (which retrieves the round state without locking) for these tests only, it should actually fix all the BFT tests. It can make sense, as I mentioned in my comments, that event bus subscription is meant to be consumed. If we consume events synchronously, it also makes sense to use some tricks to avoid deadlocks in this particular case.
Some tests would probably need to be updated as we need to ensure that RoundState is correctly updated before being retrieved.

My preference is still to have an event queue as it looks way cleaner to me.

@moul
Copy link
Member

moul commented Apr 29, 2024

Let's talk about merging the current state to enhance global efficiency in the upcoming review meeting. We can then collaborate with @jaekwon to resolve the remaining cases.

@github-actions github-actions bot added 📦 🤖 gnovm Issues or PRs gnovm related 📦 ⛰️ gno.land Issues or PRs gno.land package related labels May 1, 2024
Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
@thehowl thehowl marked this pull request as ready for review May 2, 2024 16:20
@thehowl
Copy link
Member

thehowl commented May 2, 2024

Review Meeting: @gfanton to do a cleanup / take another look, @jaekwon to review.

@gfanton
Copy link
Member Author

gfanton commented May 7, 2024

TL;DR: Reverting ensureGetRoundState to GetRoundState by eliminating the delay should resolve the last tests (even if technically they are still racy), boosting CI success rate to nearly 100%.
After merging, we should monitor consensus/state_test.go in future PRs to handle any remaining flaky tests.

Next Review Meeting Note: I need a final check on f17c9b5. If the commit looks good, the PR is ready for me to merge.

Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
move `ensureProposal` before `GetRoundState`

Signed-off-by: gfanton <8671905+gfanton@users.noreply.github.com>
@Kouteki Kouteki added this to the 🏗4️⃣ test4.gno.land milestone May 15, 2024
Copy link
Member

@zivkovicmilos zivkovicmilos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know I'm not a fan of these sorts of fixes, especially when they use the hidden jutsu of reflect :)

I think it's finally time to put the flaky CI to rest (at least most of it), so I'm willing to move on this fix 👏

It is worth noting that we have fixed all of these issues in the libtm testing suite, so even these tests are just here temporarily

tm2/pkg/bft/consensus/common_test.go Show resolved Hide resolved
tm2/pkg/bft/consensus/common_test.go Show resolved Hide resolved
@gfanton gfanton merged commit 0c9849a into gnolang:master May 16, 2024
142 of 143 checks passed
@gfanton gfanton deleted the fix/bft-drain-chan branch May 16, 2024 16:13
leohhhn pushed a commit to leohhhn/gno that referenced this pull request May 21, 2024
Co-authored-by: Manfred Touron <94029+moul@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related 📦 🤖 gnovm Issues or PRs gnovm related
Projects
Status: Done
Status: 🚀 Needed for Launch
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants