diff --git a/HISTORY.md b/HISTORY.md index 09a05e1004f..52bc6f3a546 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -24,6 +24,7 @@ * DB identity (`db_id`) and DB session identity (`db_session_id`) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. The session ID for SstFileWriter (resp., Repairer) resets every time `SstFileWriter::Open` (resp., `Repairer::Run`) is called. * Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5. * `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is added, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`. In this default case, backup table filenames are of the form `__.sst` as opposed to `__.sst`. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the option `kChecksumAndFileSize` is added to allow use of old naming in case it is needed. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note that `share_files_with_checksum_naming` comes into effect only when both `share_files_with_checksum` and `share_table_files` are true. +* Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes. ### Bug Fixes * Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further. @@ -63,7 +64,6 @@ * Generate file checksum in SstFileWriter if Options.file_checksum_gen_factory is set. The checksum and checksum function name are stored in ExternalSstFileInfo after the sst file write is finished. * Add a value_size_soft_limit in read options which limits the cumulative value size of keys read in batches in MultiGet. Once the cumulative value size of found keys exceeds read_options.value_size_soft_limit, all the remaining keys are returned with status Abort without further finding their values. By default the value_size_soft_limit is std::numeric_limits::max(). * Enable SST file ingestion with file checksum information when calling IngestExternalFiles(const std::vector& args). Added files_checksums and files_checksum_func_names to IngestExternalFileArg such that user can ingest the sst files with their file checksum information. Added verify_file_checksum to IngestExternalFileOptions (default is True). To be backward compatible, if DB does not enable file checksum or user does not provide checksum information (vectors of files_checksums and files_checksum_func_names are both empty), verification of file checksum is always sucessful. If DB enables file checksum, DB will always generate the checksum for each ingested SST file during Prepare stage of ingestion and store the checksum in Manifest, unless verify_file_checksum is False and checksum information is provided by the application. In this case, we only verify the checksum function name and directly store the ingested checksum in Manifest. If verify_file_checksum is set to True, DB will verify the ingested checksum and function name with the genrated ones. Any mismatch will fail the ingestion. Note that, if IngestExternalFileOptions::write_global_seqno is True, the seqno will be changed in the ingested file. Therefore, the checksum of the file will be changed. In this case, a new checksum will be generated after the seqno is updated and be stored in the Manifest. -* Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable BGIOError happens during compaction, flush, WAL write, DB will be in read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. bgerror_resume_retry_interval decides the time interval between two auto resumes. ### Performance Improvements * Eliminate redundant key comparisons during random access in block-based tables. diff --git a/db/compaction/compaction_job.cc b/db/compaction/compaction_job.cc index d75224afde1..d19b719777c 100644 --- a/db/compaction/compaction_job.cc +++ b/db/compaction/compaction_job.cc @@ -621,7 +621,6 @@ Status CompactionJob::Run() { break; } } - if (io_status_.ok()) { io_status_ = io_s; } diff --git a/db/error_handler.cc b/db/error_handler.cc index 06c5849029d..e344e99317a 100644 --- a/db/error_handler.cc +++ b/db/error_handler.cc @@ -7,7 +7,6 @@ #include "db/db_impl/db_impl.h" #include "db/event_helpers.h" #include "file/sst_file_manager_impl.h" -#include namespace ROCKSDB_NAMESPACE { @@ -260,17 +259,6 @@ Status ErrorHandler::SetBGError(const Status& bg_err, BackgroundErrorReason reas Status ErrorHandler::SetBGError(const IOStatus& bg_io_err, BackgroundErrorReason reason) { - if (reason == BackgroundErrorReason::kCompaction) { - std::cout<<"compaction\n"; - } else if (reason == BackgroundErrorReason::kFlush) { - std::cout<<"flush\n"; - } else if (reason == BackgroundErrorReason::kWriteCallback) { - std::cout<<"write call back\n"; - } else if (reason == BackgroundErrorReason::kMemTable) { - std::cout<<"memtable\n"; - } else { - std::cout<<"other\n"; - } db_mutex_->AssertHeld(); if (bg_io_err.ok()) { return Status::OK(); @@ -312,7 +300,7 @@ Status ErrorHandler::SetBGError(const IOStatus& bg_io_err, if (BackgroundErrorReason::kCompaction == reason) { Status bg_err(new_bg_io_err, Status::Severity::kSoftError); if (bg_err.severity() > bg_error_.severity()) { - bg_error_ = bg_err; + bg_error_ = bg_err; } return bg_error_; } else { diff --git a/db/error_handler_fs_test.cc b/db/error_handler_fs_test.cc index 4c747e5d0f6..cf1d7189a61 100644 --- a/db/error_handler_fs_test.cc +++ b/db/error_handler_fs_test.cc @@ -192,6 +192,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeError) { options.env = fault_fs_env.get(); options.create_if_missing = true; options.listeners.emplace_back(listener); + options.max_bgerror_resume_count = 0; Status s; listener->EnableAutoRecovery(false); @@ -298,6 +299,7 @@ TEST_F(DBErrorHandlingFSTest, ManifestWriteRetryableError) { options.env = fault_fs_env.get(); options.create_if_missing = true; options.listeners.emplace_back(listener); + options.max_bgerror_resume_count = 0; Status s; std::string old_manifest; std::string new_manifest; @@ -467,6 +469,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionManifestWriteRetryableError) { options.create_if_missing = true; options.level0_file_num_compaction_trigger = 2; options.listeners.emplace_back(listener); + options.max_bgerror_resume_count = 0; Status s; std::string old_manifest; std::string new_manifest; @@ -585,6 +588,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) { options.create_if_missing = true; options.level0_file_num_compaction_trigger = 2; options.listeners.emplace_back(listener); + options.max_bgerror_resume_count = 0; Status s; DestroyAndReopen(options); @@ -602,7 +606,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) { {{"DBImpl::FlushMemTable:FlushMemTableFinished", "BackgroundCallCompaction:0"}}); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack( - "BackgroundCallCompaction:0", + "CompactionJob::OpenCompactionOutputFile", [&](void*) { fault_fs->SetFilesystemActive(false, error_msg); }); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing(); @@ -611,7 +615,7 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) { ASSERT_EQ(s, Status::OK()); s = dbfull()->TEST_WaitForCompact(); - ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError); + ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError); fault_fs->SetFilesystemActive(true); SyncPoint::GetInstance()->ClearAllCallBacks(); @@ -745,7 +749,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteError) { WriteBatch batch; for (auto i = 0; i < 100; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; @@ -758,7 +762,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteError) { int write_error = 0; for (auto i = 100; i < 199; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } SyncPoint::GetInstance()->SetCallBack( @@ -808,6 +812,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) { options.writable_file_max_buffer_size = 32768; options.listeners.emplace_back(listener); options.paranoid_checks = true; + options.max_bgerror_resume_count = 0; Status s; Random rnd(301); @@ -821,7 +826,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) { WriteBatch batch; for (auto i = 0; i < 100; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; @@ -836,7 +841,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) { int write_error = 0; for (auto i = 100; i < 200; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } SyncPoint::GetInstance()->SetCallBack( @@ -872,7 +877,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableError) { WriteBatch batch; for (auto i = 200; i < 300; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; @@ -913,7 +918,7 @@ TEST_F(DBErrorHandlingFSTest, MultiCFWALWriteError) { for (auto i = 1; i < 4; ++i) { for (auto j = 0; j < 100; ++j) { - batch.Put(handles_[i], Key(j), RandomString(&rnd, 1024)); + batch.Put(handles_[i], Key(j), rnd.RandomString(1024)); } } @@ -928,7 +933,7 @@ TEST_F(DBErrorHandlingFSTest, MultiCFWALWriteError) { // Write to one CF for (auto i = 100; i < 199; ++i) { - batch.Put(handles_[2], Key(i), RandomString(&rnd, 1024)); + batch.Put(handles_[2], Key(i), rnd.RandomString(1024)); } SyncPoint::GetInstance()->SetCallBack( @@ -1017,7 +1022,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBCompactionError) { WriteBatch batch; for (auto j = 0; j <= 100; ++j) { - batch.Put(Key(j), RandomString(&rnd, 1024)); + batch.Put(Key(j), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1032,7 +1037,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBCompactionError) { // Write to one CF for (auto j = 100; j < 199; ++j) { - batch.Put(Key(j), RandomString(&rnd, 1024)); + batch.Put(Key(j), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1130,7 +1135,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBVariousErrors) { WriteBatch batch; for (auto j = 0; j <= 100; ++j) { - batch.Put(Key(j), RandomString(&rnd, 1024)); + batch.Put(Key(j), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1145,7 +1150,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBVariousErrors) { // Write to one CF for (auto j = 100; j < 199; ++j) { - batch.Put(Key(j), RandomString(&rnd, 1024)); + batch.Put(Key(j), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1694,6 +1699,70 @@ TEST_F(DBErrorHandlingFSTest, Close(); } +TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableErrorAutoRecover) { + // In this test, in the first round of compaction, the FS is set to error. + // So the first compaction fails due to retryable IO error and it is mapped + // to soft error. Then, compaction is rescheduled, in the second round of + // compaction, the FS is set to active and compaction is successful, so + // the test will hit the CompactionJob::FinishCompactionOutputFile1 sync + // point. + std::shared_ptr fault_fs( + new FaultInjectionTestFS(FileSystem::Default())); + std::unique_ptr fault_fs_env(NewCompositeEnv(fault_fs)); + std::shared_ptr listener( + new ErrorHandlerFSListener()); + Options options = GetDefaultOptions(); + options.env = fault_fs_env.get(); + options.create_if_missing = true; + options.level0_file_num_compaction_trigger = 2; + options.listeners.emplace_back(listener); + Status s; + std::atomic fail_first(false); + std::atomic fail_second(true); + DestroyAndReopen(options); + + IOStatus error_msg = IOStatus::IOError("Retryable IO Error"); + error_msg.SetRetryable(true); + + Put(Key(0), "va;"); + Put(Key(2), "va;"); + s = Flush(); + ASSERT_EQ(s, Status::OK()); + + listener->OverrideBGError(Status(error_msg, Status::Severity::kHardError)); + listener->EnableAutoRecovery(false); + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency( + {{"DBImpl::FlushMemTable:FlushMemTableFinished", + "BackgroundCallCompaction:0"}, + {"CompactionJob::FinishCompactionOutputFile1", + "CompactionWriteRetryableErrorAutoRecover0"}}); + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack( + "DBImpl::BackgroundCompaction:Start", + [&](void*) { fault_fs->SetFilesystemActive(true); }); + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack( + "BackgroundCallCompaction:0", [&](void*) { fail_first.store(true); }); + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack( + "CompactionJob::OpenCompactionOutputFile", [&](void*) { + if (fail_first.load() && fail_second.load()) { + fault_fs->SetFilesystemActive(false, error_msg); + fail_second.store(false); + } + }); + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing(); + + Put(Key(1), "val"); + s = Flush(); + ASSERT_EQ(s, Status::OK()); + + s = dbfull()->TEST_WaitForCompact(); + ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError); + + TEST_SYNC_POINT("CompactionWriteRetryableErrorAutoRecover0"); + SyncPoint::GetInstance()->ClearAllCallBacks(); + SyncPoint::GetInstance()->DisableProcessing(); + Destroy(options); +} + TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) { std::shared_ptr fault_fs( new FaultInjectionTestFS(FileSystem::Default())); @@ -1721,7 +1790,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) { WriteBatch batch; for (auto i = 0; i < 100; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1736,7 +1805,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) { int write_error = 0; for (auto i = 100; i < 200; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency( {{"RecoverFromRetryableBGIOError:BeforeResume0", "WALWriteError1:0"}, @@ -1778,7 +1847,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover1) { WriteBatch batch; for (auto i = 200; i < 300; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1825,7 +1894,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover2) { WriteBatch batch; for (auto i = 0; i < 100; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; @@ -1840,7 +1909,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover2) { int write_error = 0; for (auto i = 100; i < 200; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency( {{"RecoverFromRetryableBGIOError:BeforeWait0", "WALWriteError2:0"}, @@ -1882,7 +1951,7 @@ TEST_F(DBErrorHandlingFSTest, WALWriteRetryableErrorAutoRecover2) { WriteBatch batch; for (auto i = 200; i < 300; ++i) { - batch.Put(Key(i), RandomString(&rnd, 1024)); + batch.Put(Key(i), rnd.RandomString(1024)); } WriteOptions wopts; diff --git a/include/rocksdb/options.h b/include/rocksdb/options.h index 5cdafc7bcb4..74e2e62ff7a 100644 --- a/include/rocksdb/options.h +++ b/include/rocksdb/options.h @@ -1138,20 +1138,21 @@ struct DBOptions { // Default: false bool best_efforts_recovery = false; - // It defines How many times call db resume when retryable BGError happens. - // When BGError happens, SetBGError is called to deal with the Error. If - // the error can be auto-recovered, db resume is called in background to - // recover from the error. If this value is 0 or negative, db resume will - // not be called. + // It defines how many times db resume is called by a separate thread when + // background retryable IO Error happens. When background retryable IO + // Error happens, SetBGError is called to deal with the error. If the error + // can be auto-recovered (e.g., retryable IO Error during Flush or WAL write), + // then db resume is called in background to recover from the error. If this + // value is 0 or negative, db resume will not be called. // - // Default: 0 - int max_bgerror_resume_count = 0; + // Default: INT_MAX + int max_bgerror_resume_count = INT_MAX; // If max_bgerror_resume_count is >= 2, db resume is called multiple times. // This option decides how long to wait to retry the next resume if the // previous resume fails and satisfy redo resume conditions. // - // Default: 10000000 (microseconds). + // Default: 1000000 (microseconds). uint64_t bgerror_resume_retry_interval = 1000000; };