Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fix possible deadlock in bank #10466

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

BurtonQin
Copy link
Contributor

@BurtonQin BurtonQin commented Jun 9, 2020

Problem

There are two kinds of possible deadlock bugs:

1. locks in conflicting order:

e.g.

compare_bank freeze
stakes.read()
hash.write()
collect_return_eagerly
collect_rent_in_partition
store_account
stakes.write()//deadlock!
hash.read()//deadlock!

The fix is to swap the order of stakes and hash in compare_bank().

2. double read lock if comparing with itself

When compare_bank() compares itself, there are four cases where the first read lock is not released before the second lock. A deadlock may happen when the two read locks are interleaved by a write lock from another thread. The reason is that the priority policy of std::sync::RwLock is dependent on the underlying operating system's implementation. AFAIK, Windows and macOS use fair policy which leads to this kind of deadlock.

The fix is to add a shortcut by checking the ptr equality of self and dbank first.

Summary of Changes

  1. Swap the order of stakes and hash in compare_bank().
  2. Add a shortcut by checking the ptr equality of self and dbank in compare_bank().

@ryoqun
Copy link
Member

ryoqun commented Jun 9, 2020

This looks nice! Thanks for reporting and creating an actual pr! Did you find this by https://github.com/BurtonQin/rust-lock-bug-detector?

Also, do you mind to write a testcase? Considering the kind of this bug, it might be a bit hard?

@ryoqun ryoqun added the CI Pull Request is ready to enter CI label Jun 9, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Jun 9, 2020
@BurtonQin
Copy link
Contributor Author

This looks nice! Thanks for reporting and creating an actual pr! Did you find this by https://github.com/BurtonQin/rust-lock-bug-detector?

Also, do you mind to write a testcase? Considering the kind of this bug, it might be a bit hard?

Yes, and I am still improving this detector.
I do not think it is easy to write a test case because of the uncertainty of the execution sequences during run-time checking. So I choose to statically rule out every possible deadlock condition.

@ryoqun ryoqun self-requested a review June 9, 2020 06:21
@codecov
Copy link

codecov bot commented Jun 9, 2020

Codecov Report

Merging #10466 into master will decrease coverage by 0.0%.
The diff coverage is 94.4%.

@@            Coverage Diff            @@
##           master   #10466     +/-   ##
=========================================
- Coverage    81.6%    81.6%   -0.1%     
=========================================
  Files         296      296             
  Lines       69320    69326      +6     
=========================================
- Hits        56634    56614     -20     
- Misses      12686    12712     +26     

let bh = self.hash.read().unwrap();
let dbh = dbank.hash.read().unwrap();
assert_eq!(*bh, *dbh);

let st = self.stakes.read().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BurtonQin nits: reordering is alone suffice to fix the deadlock as you said. Thanks for spotting this!

But for maintainability perspective, the current code is too fragile, relaying on the correct locking order. I think this method's each code blocks should also be surrounded with {} so that each locks aren't held for longer than needed.

ryoqun
ryoqun previously approved these changes Jun 10, 2020
Copy link
Member

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely done!

@BurtonQin Could you work on the nit?

@mergify mergify bot dismissed ryoqun’s stale review June 10, 2020 13:12

Pull request has been modified.

@BurtonQin
Copy link
Contributor Author

BurtonQin commented Jun 10, 2020

nicely done!

@BurtonQin Could you work on the nit?

Sure. Pushed.

@ryoqun ryoqun added the CI Pull Request is ready to enter CI label Jun 10, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Jun 10, 2020
@ryoqun ryoqun added the automerge Merge this Pull Request automatically once CI passes label Jun 10, 2020
@solana-grimes
Copy link
Contributor

💔 Unable to automerge due to CI failure

@solana-grimes solana-grimes removed the automerge Merge this Pull Request automatically once CI passes label Jun 10, 2020
@ryoqun ryoqun merged commit 1e3554b into solana-labs:master Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants