Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

Closed
wants to merge 3 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Feb 3, 2022

Context:
As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and re-enabled again during DBImpl::Resume() if all recovery is completed. Before re-enabling file deletion, it assert(versions_->io_status().ok());, which IMO assumes versions_ is the version_ in the recovery process.

However, this is not necessarily true due to s = error_handler_.ClearBGError(); happening before that assertion can unblock some foreground thread by EventHelpers::NotifyOnErrorRecoveryEnd() as part of the ClearBGError(). That foreground thread can do whatever it wants including closing/reopening the db and clean up that same versions_.

As a consequence, assert(versions_->io_status().ok());, will access io_status() of a nullptr and test like DBErrorHandlingFSTest.MultiCFWALWriteError becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to reopen the db, where versions_ gets reset to nullptr as part of the old db clean-up. If this happens right before assert(versions_->io_status().ok()); gets excuted in the background thread, then we can see error like

db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());

Summary:

  • I proposed to call s = error_handler_.ClearBGError(); after we know it's fine to wake up foreground, which I think is right before we LOG ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");
    • As the context, the orignal Allow DB resume after background errors #3997 introducing DBImpl::Resume() calls s = error_handler_.ClearBGError(); very close to calling ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); while the later First step towards handling MANIFEST write error #6949 distances these two calls a bit.
    • And it seems fine to me that s = error_handler_.ClearBGError(); happens after EnableFileDeletions(/*force=*/true); at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before s = error_handler_.ClearBGError();, which basically is resetting some state variables.
  • In addition, to make ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); more clear, I separated it into its own if-block.

Test plan:

  • Manually reproduce the assertion failure inDBErrorHandlingFSTest.MultiCFWALWriteError by injecting sleep like below so that it's more likely for assert(versions_->io_status().ok()); to execute after reopening the db in the foreground (i.e, testing) thread
sleep(1);
assert(versions_->io_status().ok());

python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
#1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
#2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
#3   rocksdb/error_handler_fs_test() 
  • The assertion failure does not happen with PR
    python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError
    [100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented Feb 3, 2022

EDIT: see PR summary for more context.

Requested @anand1976 and @riversand963 for review since this fix somehow lies between your #3997 and #6949.

Requested @ajkr cuz you followed along my investigation of the test.

@hx235
Copy link
Contributor Author

hx235 commented Feb 3, 2022

TODO: Will update the HISTORY later once we think it's a good approach to fix.

Comment on lines 427 to 436
mutex_.Lock();
if (s.ok()) {
s = error_handler_.ClearBGError();
} else {
// NOTE: this is needed to pass ASSERT_STATUS_CHECKED
// in the DBSSTTest.DBWithMaxSpaceAllowedRandomized test.
// See https://github.com/facebook/rocksdb/pull/7715#issuecomment-754947952
error_handler_.GetRecoveryError().PermitUncheckedError();
}
mutex_.Unlock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed to call s = error_handler_.ClearBGError(); after we know it's fine to wake up foreground, which I think is right before we LOG ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");

  • As the context, the orignal Allow DB resume after background errors #3997 introducing DBImpl::Resume() calls s = error_handler_.ClearBGError(); very close to calling ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); while the later First step towards handling MANIFEST write error #6949 distances these two calls a bit.
  • And it seems fine to me that s = error_handler_.ClearBGError(); happens after EnableFileDeletions(/force=/true); at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before s = error_handler_.ClearBGError();, which basically is resetting some state variables.

@@ -422,16 +414,31 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
// during previous error handling.
if (file_deletion_disabled) {
// Always return ok
s = EnableFileDeletions(/*force=*/true);
if (!s.ok()) {
Status enable_file_deletion_s = EnableFileDeletions(/*force=*/true);
Copy link
Contributor Author

@hx235 hx235 Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve the previous behavior of #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated enable_file_deletion_s from the general s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the EnableFileDeletions() always returns ok, we can still use s and assert it's ok. This will allow us not to have an extra variable enable_file_deletion_s.

}
mutex_.Unlock();

if (s.ok()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); more clear, I separated it into its own if-block.

@ajkr ajkr removed their request for review February 6, 2022 23:04
@ajkr
Copy link
Contributor

ajkr commented Feb 6, 2022

Remove myself from reviewers as I won't have time/energy to learn auto-recovery anytime soon. @anand1976 probably already knows it so can review more efficiently.

@hx235 hx235 changed the title [draft] Attempt to deflake DBErrorHandlingFSTest.MultiCFWALWriteError Deflake DBErrorHandlingFSTest.MultiCFWALWriteError Feb 11, 2022
Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

}
mutex_.Unlock();

if (s.ok()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that after calling ClearBGError(), nothing is guaranteed. That makes this block problematic since the mutex is unlocked and locked again. In the meantime, the DB may be deleted and lines 445-456 may access stale memory. We can continue to hold the mutex here to avoid that possibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you - the unlock/lock again behavior came from the original PR 52d4c9b#diff-6fdb755f590d9b01ecb89bd8ceb28577e85536d4472f8e4fc3addeb9a65f3645R269 but I will change it. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hx235 for the fix! Left a few minor comments.

}
mutex_.Unlock();

if (s.ok()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe it's better to log the status anyway even if it's non-ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -422,16 +414,31 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
// during previous error handling.
if (file_deletion_disabled) {
// Always return ok
s = EnableFileDeletions(/*force=*/true);
if (!s.ok()) {
Status enable_file_deletion_s = EnableFileDeletions(/*force=*/true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the EnableFileDeletions() always returns ok, we can still use s and assert it's ok. This will allow us not to have an extra variable enable_file_deletion_s.

// See https://github.com/facebook/rocksdb/pull/7715#issuecomment-754947952
error_handler_.GetRecoveryError().PermitUncheckedError();
}
mutex_.Unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we want to unlock here and later re-lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


mutex_.Lock();
if (s.ok()) {
s = error_handler_.ClearBGError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comment here about how the db may be closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@hx235 hx235 force-pushed the deflake-ubsan-check branch from 9dfc456 to ea4cc03 Compare February 24, 2022 22:25
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hx235 for the fix. Mostly LGTM with two minor comments.

HISTORY.md Outdated
@@ -3,6 +3,9 @@
### New Features
* Allow WriteBatchWithIndex to index a WriteBatch that includes keys with user-defined timestamps. The index itself does not have timestamp.

### Bug Fixes
* Fixed a data race on `VersionSet` in `DBImpl::ResumeImpl()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: illegal (stale) memory access on 'versions_' caused by race between db close issued by ... and DBImpl::ResumeImpl().
The '...' can be something like "user-defined listener's OnErrorRecoveryXXX() methods", but not sure what is a good succint way of describing it.

Copy link
Contributor Author

@hx235 hx235 Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! And thank you for providing an example.

I also found it hard to write HISTORY.md for fix like this since (a) it's not as visible as fixing a bug directly impacting user API (so I am not even sure whether to include this in HISTORY.md) and (b)it involves internal implementation logic (so I am not sure how much detail to disclose).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can keep the entry in the HISTORY.md simple, just reference this PR so that people can conveniently know where to look for more context. For example

* Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and `EventListener::OnErrorRecoveryXX()` (#9496)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -423,16 +415,36 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
if (file_deletion_disabled) {
// Always return ok
s = EnableFileDeletions(/*force=*/true);
assert(s.ok());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
we can turn this assert(s.ok()) into a assert(false) following the line 423 in the if (!s.ok()) branch.
I understand this (non-ok) should not happen now, but maybe it's better to "try" to log the error and then abort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ajkr pushed a commit that referenced this pull request Mar 3, 2022
Summary:
**Context:**
As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal #3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later #6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: #9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
#1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
#2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
#3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab
ajkr pushed a commit that referenced this pull request Mar 7, 2022
Summary:
**Context:**
As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal #3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later #6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: #9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
#2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
#3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab
facebook-github-bot pushed a commit that referenced this pull request Dec 20, 2024
…3234)

Summary:
`DBErrorHandlingFSTest.AtomicFlushNoSpaceError` is flaky due to seg fault during error recovery:
```
...
frame #5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16
frame #6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18
frame #7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30
frame #8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20
frame #9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46
```

I suspect this is due to DB being destructed and reopened during recovery. Specifically, the [ClearBGError() call](https://github.com/facebook/rocksdb/blob/c72e79a262bf696faf5f8becabf92374fc14b464/db/db_impl/db_impl.cc#L425) can release and reacquire mutex, and DB can be closed during this time. So it's not safe to access DB state after ClearBGError(). There was a similar story in #9496. [Moving the obsolete files logic after ClearBGError()](#11955) probably makes the seg fault more easily triggered.

This PR updates `ClearBGError()` to guarantee that db close cannot finish until the method is returned and the mutex is released. So that we can safely access DB state after calling it.

Pull Request resolved: #13234

Test Plan: I could not trigger the seg fault locally, will just monitor future test failures.

Reviewed By: jowlyzhang

Differential Revision: D67476836

Pulled By: cbi42

fbshipit-source-id: dfb3e9ccd4eb3d43fc596ec10e4052861eeec002
@ti-chi-bot ti-chi-bot bot mentioned this pull request Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants