Skip to content

Commit

Permalink
Added the test case for compaction recover
Browse files Browse the repository at this point in the history
  • Loading branch information
zhichao-cao committed Jul 15, 2020
1 parent 8a139f8 commit e2b5396
Show file tree
Hide file tree
Showing 5 changed files with 99 additions and 42 deletions.
2 changes: 1 addition & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
* DB identity (`db_id`) and DB session identity (`db_session_id`) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. The session ID for SstFileWriter (resp., Repairer) resets every time `SstFileWriter::Open` (resp., `Repairer::Run`) is called.
* Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
* `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is added, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`. In this default case, backup table filenames are of the form `<file_number>_<crc32c>_<db_session_id>.sst` as opposed to `<file_number>_<crc32c>_<file_size>.sst`. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the option `kChecksumAndFileSize` is added to allow use of old naming in case it is needed. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note that `share_files_with_checksum_naming` comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.
* Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.

### Bug Fixes
* Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.
Expand Down Expand Up @@ -63,7 +64,6 @@
* Generate file checksum in SstFileWriter if Options.file_checksum_gen_factory is set. The checksum and checksum function name are stored in ExternalSstFileInfo after the sst file write is finished.
* Add a value_size_soft_limit in read options which limits the cumulative value size of keys read in batches in MultiGet. Once the cumulative value size of found keys exceeds read_options.value_size_soft_limit, all the remaining keys are returned with status Abort without further finding their values. By default the value_size_soft_limit is std::numeric_limits<uint64_t>::max().
* Enable SST file ingestion with file checksum information when calling IngestExternalFiles(const std::vector<IngestExternalFileArg>& args). Added files_checksums and files_checksum_func_names to IngestExternalFileArg such that user can ingest the sst files with their file checksum information. Added verify_file_checksum to IngestExternalFileOptions (default is True). To be backward compatible, if DB does not enable file checksum or user does not provide checksum information (vectors of files_checksums and files_checksum_func_names are both empty), verification of file checksum is always sucessful. If DB enables file checksum, DB will always generate the checksum for each ingested SST file during Prepare stage of ingestion and store the checksum in Manifest, unless verify_file_checksum is False and checksum information is provided by the application. In this case, we only verify the checksum function name and directly store the ingested checksum in Manifest. If verify_file_checksum is set to True, DB will verify the ingested checksum and function name with the genrated ones. Any mismatch will fail the ingestion. Note that, if IngestExternalFileOptions::write_global_seqno is True, the seqno will be changed in the ingested file. Therefore, the checksum of the file will be changed. In this case, a new checksum will be generated after the seqno is updated and be stored in the Manifest.
* Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable BGIOError happens during compaction, flush, WAL write, DB will be in read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. bgerror_resume_retry_interval decides the time interval between two auto resumes.

### Performance Improvements
* Eliminate redundant key comparisons during random access in block-based tables.
Expand Down
1 change: 0 additions & 1 deletion db/compaction/compaction_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -621,7 +621,6 @@ Status CompactionJob::Run() {
break;
}
}

if (io_status_.ok()) {
io_status_ = io_s;
}
Expand Down
14 changes: 1 addition & 13 deletions db/error_handler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
#include "db/db_impl/db_impl.h"
#include "db/event_helpers.h"
#include "file/sst_file_manager_impl.h"
#include <iostream>

namespace ROCKSDB_NAMESPACE {

Expand Down Expand Up @@ -260,17 +259,6 @@ Status ErrorHandler::SetBGError(const Status& bg_err, BackgroundErrorReason reas

Status ErrorHandler::SetBGError(const IOStatus& bg_io_err,
BackgroundErrorReason reason) {
if (reason == BackgroundErrorReason::kCompaction) {
std::cout<<"compaction\n";
} else if (reason == BackgroundErrorReason::kFlush) {
std::cout<<"flush\n";
} else if (reason == BackgroundErrorReason::kWriteCallback) {
std::cout<<"write call back\n";
} else if (reason == BackgroundErrorReason::kMemTable) {
std::cout<<"memtable\n";
} else {
std::cout<<"other\n";
}
db_mutex_->AssertHeld();
if (bg_io_err.ok()) {
return Status::OK();
Expand Down Expand Up @@ -312,7 +300,7 @@ Status ErrorHandler::SetBGError(const IOStatus& bg_io_err,
if (BackgroundErrorReason::kCompaction == reason) {
Status bg_err(new_bg_io_err, Status::Severity::kSoftError);
if (bg_err.severity() > bg_error_.severity()) {
bg_error_ = bg_err;
bg_error_ = bg_err;
}
return bg_error_;
} else {
Expand Down
107 changes: 88 additions & 19 deletions db/error_handler_fs_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeError) {
options.env = fault_fs_env.get();
options.create_if_missing = true;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;

listener->EnableAutoRecovery(false);
Expand Down Expand Up @@ -298,6 +299,7 @@ TEST_F(DBErrorHandlingFSTest, ManifestWriteRetryableError) {
options.env = fault_fs_env.get();
options.create_if_missing = true;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;
std::string old_manifest;
std::string new_manifest;
Expand Down Expand Up @@ -467,6 +469,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionManifestWriteRetryableError) {
options.create_if_missing = true;
options.level0_file_num_compaction_trigger = 2;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;
std::string old_manifest;
std::string new_manifest;
Expand Down Expand Up @@ -585,6 +588,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) {
options.create_if_missing = true;
options.level0_file_num_compaction_trigger = 2;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;
DestroyAndReopen(options);

Expand All @@ -602,7 +606,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) {
{{"DBImpl::FlushMemTable:FlushMemTableFinished",
"BackgroundCallCompaction:0"}});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"BackgroundCallCompaction:0",
"CompactionJob::OpenCompactionOutputFile",
[&](void*) { fault_fs->SetFilesystemActive(false, error_msg); });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();

Expand All @@ -611,7 +615,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) {
ASSERT_EQ(s, Status::OK());

s = dbfull()->TEST_WaitForCompact();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);

fault_fs->SetFilesystemActive(true);
SyncPoint::GetInstance()->ClearAllCallBacks();
Expand Down Expand Up @@ -745,7 +749,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteError) {
WriteBatch batch;

for (auto i = 0; i < 100; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand All @@ -758,7 +762,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteError) {
int write_error = 0;

for (auto i = 100; i < 199; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

SyncPoint::GetInstance()->SetCallBack(
Expand Down Expand Up @@ -808,6 +812,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) {
options.writable_file_max_buffer_size = 32768;
options.listeners.emplace_back(listener);
options.paranoid_checks = true;
options.max_bgerror_resume_count = 0;
Status s;
Random rnd(301);

Expand All @@ -821,7 +826,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) {
WriteBatch batch;

for (auto i = 0; i < 100; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand All @@ -836,7 +841,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) {
int write_error = 0;

for (auto i = 100; i < 200; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

SyncPoint::GetInstance()->SetCallBack(
Expand Down Expand Up @@ -872,7 +877,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) {
WriteBatch batch;

for (auto i = 200; i < 300; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand Down Expand Up @@ -913,7 +918,7 @@ TEST_F(DBErrorHandlingFSTest, MultiCFWALWriteError) {

for (auto i = 1; i < 4; ++i) {
for (auto j = 0; j < 100; ++j) {
batch.Put(handles_[i], Key(j), RandomString(&rnd, 1024));
batch.Put(handles_[i], Key(j), rnd.RandomString(1024));
}
}

Expand All @@ -928,7 +933,7 @@ TEST_F(DBErrorHandlingFSTest, MultiCFWALWriteError) {

// Write to one CF
for (auto i = 100; i < 199; ++i) {
batch.Put(handles_[2], Key(i), RandomString(&rnd, 1024));
batch.Put(handles_[2], Key(i), rnd.RandomString(1024));
}

SyncPoint::GetInstance()->SetCallBack(
Expand Down Expand Up @@ -1017,7 +1022,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBCompactionError) {
WriteBatch batch;

for (auto j = 0; j <= 100; ++j) {
batch.Put(Key(j), RandomString(&rnd, 1024));
batch.Put(Key(j), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand All @@ -1032,7 +1037,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBCompactionError) {

// Write to one CF
for (auto j = 100; j < 199; ++j) {
batch.Put(Key(j), RandomString(&rnd, 1024));
batch.Put(Key(j), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand Down Expand Up @@ -1130,7 +1135,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBVariousErrors) {
WriteBatch batch;

for (auto j = 0; j <= 100; ++j) {
batch.Put(Key(j), RandomString(&rnd, 1024));
batch.Put(Key(j), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand All @@ -1145,7 +1150,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBVariousErrors) {

// Write to one CF
for (auto j = 100; j < 199; ++j) {
batch.Put(Key(j), RandomString(&rnd, 1024));
batch.Put(Key(j), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand Down Expand Up @@ -1694,6 +1699,70 @@ TEST_F(DBErrorHandlingFSTest,
Close();
}

TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableErrorAutoRecover) {
// In this test, in the first round of compaction, the FS is set to error.
// So the first compaction fails due to retryable IO error and it is mapped
// to soft error. Then, compaction is rescheduled, in the second round of
// compaction, the FS is set to active and compaction is successful, so
// the test will hit the CompactionJob::FinishCompactionOutputFile1 sync
// point.
std::shared_ptr<FaultInjectionTestFS> fault_fs(
new FaultInjectionTestFS(FileSystem::Default()));
std::unique_ptr<Env> fault_fs_env(NewCompositeEnv(fault_fs));
std::shared_ptr<ErrorHandlerFSListener> listener(
new ErrorHandlerFSListener());
Options options = GetDefaultOptions();
options.env = fault_fs_env.get();
options.create_if_missing = true;
options.level0_file_num_compaction_trigger = 2;
options.listeners.emplace_back(listener);
Status s;
std::atomic<bool> fail_first(false);
std::atomic<bool> fail_second(true);
DestroyAndReopen(options);

IOStatus error_msg = IOStatus::IOError("Retryable IO Error");
error_msg.SetRetryable(true);

Put(Key(0), "va;");
Put(Key(2), "va;");
s = Flush();
ASSERT_EQ(s, Status::OK());

listener->OverrideBGError(Status(error_msg, Status::Severity::kHardError));
listener->EnableAutoRecovery(false);
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
{{"DBImpl::FlushMemTable:FlushMemTableFinished",
"BackgroundCallCompaction:0"},
{"CompactionJob::FinishCompactionOutputFile1",
"CompactionWriteRetryableErrorAutoRecover0"}});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DBImpl::BackgroundCompaction:Start",
[&](void*) { fault_fs->SetFilesystemActive(true); });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"BackgroundCallCompaction:0", [&](void*) { fail_first.store(true); });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"CompactionJob::OpenCompactionOutputFile", [&](void*) {
if (fail_first.load() && fail_second.load()) {
fault_fs->SetFilesystemActive(false, error_msg);
fail_second.store(false);
}
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();

Put(Key(1), "val");
s = Flush();
ASSERT_EQ(s, Status::OK());

s = dbfull()->TEST_WaitForCompact();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);

TEST_SYNC_POINT("CompactionWriteRetryableErrorAutoRecover0");
SyncPoint::GetInstance()->ClearAllCallBacks();
SyncPoint::GetInstance()->DisableProcessing();
Destroy(options);
}

TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) {
std::shared_ptr<FaultInjectionTestFS> fault_fs(
new FaultInjectionTestFS(FileSystem::Default()));
Expand Down Expand Up @@ -1721,7 +1790,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) {
WriteBatch batch;

for (auto i = 0; i < 100; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand All @@ -1736,7 +1805,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) {
int write_error = 0;

for (auto i = 100; i < 200; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
{{"RecoverFromRetryableBGIOError:BeforeResume0", "WALWriteError1:0"},
Expand Down Expand Up @@ -1778,7 +1847,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) {
WriteBatch batch;

for (auto i = 200; i < 300; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand Down Expand Up @@ -1825,7 +1894,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover2) {
WriteBatch batch;

for (auto i = 0; i < 100; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand All @@ -1840,7 +1909,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover2) {
int write_error = 0;

for (auto i = 100; i < 200; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
{{"RecoverFromRetryableBGIOError:BeforeWait0", "WALWriteError2:0"},
Expand Down Expand Up @@ -1882,7 +1951,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover2) {
WriteBatch batch;

for (auto i = 200; i < 300; ++i) {
batch.Put(Key(i), RandomString(&rnd, 1024));
batch.Put(Key(i), rnd.RandomString(1024));
}

WriteOptions wopts;
Expand Down
17 changes: 9 additions & 8 deletions include/rocksdb/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -1138,20 +1138,21 @@ struct DBOptions {
// Default: false
bool best_efforts_recovery = false;

// It defines How many times call db resume when retryable BGError happens.
// When BGError happens, SetBGError is called to deal with the Error. If
// the error can be auto-recovered, db resume is called in background to
// recover from the error. If this value is 0 or negative, db resume will
// not be called.
// It defines how many times db resume is called by a separate thread when
// background retryable IO Error happens. When background retryable IO
// Error happens, SetBGError is called to deal with the error. If the error
// can be auto-recovered (e.g., retryable IO Error during Flush or WAL write),
// then db resume is called in background to recover from the error. If this
// value is 0 or negative, db resume will not be called.
//
// Default: 0
int max_bgerror_resume_count = 0;
// Default: INT_MAX
int max_bgerror_resume_count = INT_MAX;

// If max_bgerror_resume_count is >= 2, db resume is called multiple times.
// This option decides how long to wait to retry the next resume if the
// previous resume fails and satisfy redo resume conditions.
//
// Default: 10000000 (microseconds).
// Default: 1000000 (microseconds).
uint64_t bgerror_resume_retry_interval = 1000000;
};

Expand Down

0 comments on commit e2b5396

Please sign in to comment.