Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

hx235 · 2022-02-03T21:28:27Z

Context:
As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and re-enabled again during DBImpl::Resume() if all recovery is completed. Before re-enabling file deletion, it assert(versions_->io_status().ok());, which IMO assumes versions_ is the version_ in the recovery process.

However, this is not necessarily true due to s = error_handler_.ClearBGError(); happening before that assertion can unblock some foreground thread by EventHelpers::NotifyOnErrorRecoveryEnd() as part of the ClearBGError(). That foreground thread can do whatever it wants including closing/reopening the db and clean up that same versions_.

As a consequence, assert(versions_->io_status().ok());, will access io_status() of a nullptr and test like DBErrorHandlingFSTest.MultiCFWALWriteError becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to reopen the db, where versions_ gets reset to nullptr as part of the old db clean-up. If this happens right before assert(versions_->io_status().ok()); gets excuted in the background thread, then we can see error like

db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());

Summary:

I proposed to call s = error_handler_.ClearBGError(); after we know it's fine to wake up foreground, which I think is right before we LOG ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");
- As the context, the orignal Allow DB resume after background errors #3997 introducing DBImpl::Resume() calls s = error_handler_.ClearBGError(); very close to calling ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); while the later First step towards handling MANIFEST write error #6949 distances these two calls a bit.
- And it seems fine to me that s = error_handler_.ClearBGError(); happens after EnableFileDeletions(/*force=*/true); at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before s = error_handler_.ClearBGError();, which basically is resetting some state variables.
In addition, to make ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); more clear, I separated it into its own if-block.

Test plan:

Manually reproduce the assertion failure inDBErrorHandlingFSTest.MultiCFWALWriteError by injecting sleep like below so that it's more likely for assert(versions_->io_status().ok()); to execute after reopening the db in the foreground (i.e, testing) thread

sleep(1);
assert(versions_->io_status().ok());

python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
#1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
#2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
#3   rocksdb/error_handler_fs_test()

The assertion failure does not happen with PR
python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError
[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)

facebook-github-bot · 2022-02-03T22:47:32Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2022-02-03T22:58:16Z

EDIT: see PR summary for more context.

Requested @anand1976 and @riversand963 for review since this fix somehow lies between your #3997 and #6949.

Requested @ajkr cuz you followed along my investigation of the test.

hx235 · 2022-02-03T22:58:57Z

TODO: Will update the HISTORY later once we think it's a good approach to fix.

hx235 · 2022-02-05T03:16:06Z

db/db_impl/db_impl.cc

+  mutex_.Lock();
+  if (s.ok()) {
+    s = error_handler_.ClearBGError();
+  } else {
+    // NOTE: this is needed to pass ASSERT_STATUS_CHECKED
+    // in the DBSSTTest.DBWithMaxSpaceAllowedRandomized test.
+    // See https://github.com/facebook/rocksdb/pull/7715#issuecomment-754947952
+    error_handler_.GetRecoveryError().PermitUncheckedError();
+  }
+  mutex_.Unlock();


I proposed to call s = error_handler_.ClearBGError(); after we know it's fine to wake up foreground, which I think is right before we LOG ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");

As the context, the orignal Allow DB resume after background errors #3997 introducing DBImpl::Resume() calls s = error_handler_.ClearBGError(); very close to calling ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); while the later First step towards handling MANIFEST write error #6949 distances these two calls a bit.

And it seems fine to me that s = error_handler_.ClearBGError(); happens after EnableFileDeletions(/force=/true); at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before s = error_handler_.ClearBGError();, which basically is resetting some state variables.

hx235 · 2022-02-05T03:16:15Z

db/db_impl/db_impl.cc

@@ -422,16 +414,31 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
    // during previous error handling.
    if (file_deletion_disabled) {
      // Always return ok
-      s = EnableFileDeletions(/*force=*/true);
-      if (!s.ok()) {
+      Status enable_file_deletion_s = EnableFileDeletions(/*force=*/true);


To preserve the previous behavior of #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated enable_file_deletion_s from the general s

Since the EnableFileDeletions() always returns ok, we can still use s and assert it's ok. This will allow us not to have an extra variable enable_file_deletion_s.

hx235 · 2022-02-05T03:16:36Z

db/db_impl/db_impl.cc

+  }
+  mutex_.Unlock();
+
+  if (s.ok()) {


To make ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB"); more clear, I separated it into its own if-block.

ajkr · 2022-02-06T23:05:32Z

Remove myself from reviewers as I won't have time/energy to learn auto-recovery anytime soon. @anand1976 probably already knows it so can review more efficiently.

anand1976

Great catch!

anand1976 · 2022-02-17T21:04:01Z

db/db_impl/db_impl.cc

+  }
+  mutex_.Unlock();
+
+  if (s.ok()) {


You're right that after calling ClearBGError(), nothing is guaranteed. That makes this block problematic since the mutex is unlocked and locked again. In the meantime, the DB may be deleted and lines 445-456 may access stale memory. We can continue to hold the mutex here to avoid that possibility.

Got you - the unlock/lock again behavior came from the original PR 52d4c9b#diff-6fdb755f590d9b01ecb89bd8ceb28577e85536d4472f8e4fc3addeb9a65f3645R269 but I will change it. Thanks!

riversand963

Thanks @hx235 for the fix! Left a few minor comments.

riversand963 · 2022-02-24T04:40:43Z

db/db_impl/db_impl.cc

+  }
+  mutex_.Unlock();
+
+  if (s.ok()) {


Also, maybe it's better to log the status anyway even if it's non-ok.

riversand963 · 2022-02-24T05:59:28Z

db/db_impl/db_impl.cc

@@ -422,16 +414,31 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
    // during previous error handling.
    if (file_deletion_disabled) {
      // Always return ok
-      s = EnableFileDeletions(/*force=*/true);
-      if (!s.ok()) {
+      Status enable_file_deletion_s = EnableFileDeletions(/*force=*/true);


Since the EnableFileDeletions() always returns ok, we can still use s and assert it's ok. This will allow us not to have an extra variable enable_file_deletion_s.

riversand963 · 2022-02-24T06:00:27Z

db/db_impl/db_impl.cc

+    // See https://github.com/facebook/rocksdb/pull/7715#issuecomment-754947952
+    error_handler_.GetRecoveryError().PermitUncheckedError();
+  }
+  mutex_.Unlock();


Any reason we want to unlock here and later re-lock?

Good question - see https://github.com/facebook/rocksdb/pull/9496/files#r814277685!

riversand963 · 2022-02-24T17:00:20Z

db/db_impl/db_impl.cc

+
+  mutex_.Lock();
+  if (s.ok()) {
+    s = error_handler_.ClearBGError();


Maybe add some comment here about how the db may be closed?

facebook-github-bot · 2022-02-24T22:25:47Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

riversand963

Thanks @hx235 for the fix. Mostly LGTM with two minor comments.

riversand963 · 2022-02-24T23:52:21Z

HISTORY.md

@@ -3,6 +3,9 @@
 ### New Features
 * Allow WriteBatchWithIndex to index a WriteBatch that includes keys with user-defined timestamps. The index itself does not have timestamp.

+### Bug Fixes
+* Fixed a data race on `VersionSet` in `DBImpl::ResumeImpl()`


Nit: illegal (stale) memory access on 'versions_' caused by race between db close issued by ... and DBImpl::ResumeImpl().
The '...' can be something like "user-defined listener's OnErrorRecoveryXXX() methods", but not sure what is a good succint way of describing it.

Good point! And thank you for providing an example.

I also found it hard to write HISTORY.md for fix like this since (a) it's not as visible as fixing a bug directly impacting user API (so I am not even sure whether to include this in HISTORY.md) and (b)it involves internal implementation logic (so I am not sure how much detail to disclose).

Yeah, we can keep the entry in the HISTORY.md simple, just reference this PR so that people can conveniently know where to look for more context. For example

* Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and `EventListener::OnErrorRecoveryXX()` (#9496)

riversand963 · 2022-02-24T23:57:44Z

db/db_impl/db_impl.cc

@@ -423,16 +415,36 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
    if (file_deletion_disabled) {
      // Always return ok
      s = EnableFileDeletions(/*force=*/true);
+      assert(s.ok());


Nit:
we can turn this assert(s.ok()) into a assert(false) following the line 423 in the if (!s.ok()) branch.
I understand this (non-ok) should not happen now, but maybe it's better to "try" to log the error and then abort.

Good point!

facebook-github-bot · 2022-02-25T18:15:23Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-25T18:15:46Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand1976

LGTM

Summary: **Context:** As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process. However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`. As a consequence, `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like ``` db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet' assert(versions_->io_status().ok()); ``` **Summary:** - I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` - As the context, the orignal #3997 introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later #6949 distances these two calls a bit. - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables. - In addition, to preserve the previous behavior of #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s` - In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block. Pull Request resolved: #9496 Test Plan: - Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread ``` sleep(1); assert(versions_->io_status().ok()); ``` `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.MultiCFWALWriteError Received signal 11 (Segmentation fault) #0 rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext) /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421 #1 rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600 #2 rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError() /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310 #3 rocksdb/error_handler_fs_test() ``` - The assertion failure does not happen with PR `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` `[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms) ` Reviewed By: riversand963, anand1976 Differential Revision: D33990099 Pulled By: hx235 fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab

Summary: **Context:** As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process. However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`. As a consequence, `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like ``` db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet' assert(versions_->io_status().ok()); ``` **Summary:** - I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` - As the context, the orignal #3997 introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later #6949 distances these two calls a bit. - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables. - In addition, to preserve the previous behavior of #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s` - In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block. Pull Request resolved: #9496 Test Plan: - Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread ``` sleep(1); assert(versions_->io_status().ok()); ``` `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.MultiCFWALWriteError Received signal 11 (Segmentation fault) #1 rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600 #2 rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError() /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310 #3 rocksdb/error_handler_fs_test() ``` - The assertion failure does not happen with PR `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` `[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms) ` Reviewed By: riversand963, anand1976 Differential Revision: D33990099 Pulled By: hx235 fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab

…3234) Summary: `DBErrorHandlingFSTest.AtomicFlushNoSpaceError` is flaky due to seg fault during error recovery: ``` ... frame #5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16 frame #6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18 frame #7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30 frame #8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20 frame #9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46 ``` I suspect this is due to DB being destructed and reopened during recovery. Specifically, the [ClearBGError() call](https://github.com/facebook/rocksdb/blob/c72e79a262bf696faf5f8becabf92374fc14b464/db/db_impl/db_impl.cc#L425) can release and reacquire mutex, and DB can be closed during this time. So it's not safe to access DB state after ClearBGError(). There was a similar story in #9496. [Moving the obsolete files logic after ClearBGError()](#11955) probably makes the seg fault more easily triggered. This PR updates `ClearBGError()` to guarantee that db close cannot finish until the method is returned and the mutex is released. So that we can safely access DB state after calling it. Pull Request resolved: #13234 Test Plan: I could not trigger the seg fault locally, will just monitor future test failures. Reviewed By: jowlyzhang Differential Revision: D67476836 Pulled By: cbi42 fbshipit-source-id: dfb3e9ccd4eb3d43fc596ec10e4052861eeec002

facebook-github-bot added the CLA Signed label Feb 3, 2022

hx235 force-pushed the deflake-ubsan-check branch from 08e8949 to 9dfc456 Compare February 3, 2022 21:46

hx235 requested review from riversand963, anand1976 and ajkr February 3, 2022 22:58

hx235 commented Feb 5, 2022

View reviewed changes

ajkr removed their request for review February 6, 2022 23:04

hx235 changed the title ~~[draft] Attempt to deflake DBErrorHandlingFSTest.MultiCFWALWriteError~~ Deflake DBErrorHandlingFSTest.MultiCFWALWriteError Feb 11, 2022

anand1976 reviewed Feb 17, 2022

View reviewed changes

riversand963 reviewed Feb 24, 2022

View reviewed changes

hx235 added 2 commits February 24, 2022 14:13

clear bg error later

c4ee960

Address feedback: hold lock longer, more comment, add to HISTORY

ea4cc03

hx235 force-pushed the deflake-ubsan-check branch from 9dfc456 to ea4cc03 Compare February 24, 2022 22:25

riversand963 approved these changes Feb 24, 2022

View reviewed changes

Address feedback: history, error logging

b1b9977

anand1976 approved these changes Feb 25, 2022

View reviewed changes

facebook-github-bot closed this in 87a8b3c Feb 25, 2022

cbi42 mentioned this pull request Dec 19, 2024

Deflake unit test DBErrorHandlingFSTest.AtomicFlushNoSpaceError #13234

Closed

ti-chi-bot bot mentioned this pull request Dec 31, 2024

No pause write tikv/rocksdb#399

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

hx235 commented Feb 3, 2022 •

edited

Loading

facebook-github-bot commented Feb 3, 2022

hx235 commented Feb 3, 2022 •

edited

Loading

hx235 commented Feb 3, 2022 •

edited

Loading

hx235 Feb 5, 2022

hx235 Feb 5, 2022 •

edited

Loading

riversand963 Feb 24, 2022

hx235 Feb 5, 2022

ajkr commented Feb 6, 2022

anand1976 left a comment

anand1976 Feb 17, 2022

hx235 Feb 24, 2022

hx235 Feb 24, 2022

riversand963 left a comment

riversand963 Feb 24, 2022

hx235 Feb 24, 2022

riversand963 Feb 24, 2022

riversand963 Feb 24, 2022

hx235 Feb 24, 2022

hx235 Feb 24, 2022

riversand963 Feb 24, 2022

hx235 Feb 24, 2022

facebook-github-bot commented Feb 24, 2022

riversand963 left a comment

riversand963 Feb 24, 2022

hx235 Feb 25, 2022 •

edited

Loading

riversand963 Feb 25, 2022

hx235 Feb 25, 2022

riversand963 Feb 24, 2022

hx235 Feb 25, 2022

hx235 Feb 25, 2022

facebook-github-bot commented Feb 25, 2022

facebook-github-bot commented Feb 25, 2022

anand1976 left a comment

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

Conversation

hx235 commented Feb 3, 2022 • edited Loading

facebook-github-bot commented Feb 3, 2022

hx235 commented Feb 3, 2022 • edited Loading

hx235 commented Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

hx235 Feb 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr commented Feb 6, 2022

anand1976 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riversand963 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 24, 2022

riversand963 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 Feb 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 25, 2022

facebook-github-bot commented Feb 25, 2022

anand1976 left a comment

Choose a reason for hiding this comment

hx235 commented Feb 3, 2022 •

edited

Loading

hx235 commented Feb 3, 2022 •

edited

Loading

hx235 commented Feb 3, 2022 •

edited

Loading

hx235 Feb 5, 2022 •

edited

Loading

hx235 Feb 25, 2022 •

edited

Loading