RocksDB corruption leads to endless failure #1383

acelyc111 · 2023-03-06T12:09:35Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do?
Some RocksDB instances corrupted for some reasons, maybe disk driver IO error, data corruption, etc.
What did you expect to see?
The cluster can recover from error automatically.
What did you see instead?
The replica server will close the replica when see write errors, but will start the replica again in the same place whose data is still corrupted. And then, the error occured again and again.

For read requests, the replica will not handle the error instead, then read requests will fail again and again.

Write:

Read:

What version of Pegasus are you using?
1.12, 2.0

#1383 This is a refactor patch before fixing #1383. This patch has no functionality changes, but just including refactors: 1. Moves functions `load()`, `newr()` and `clear_on_failure()` from class replica to class replica_stub, and the first two have been renamed to `load_replica()` and `new_replica()`. 2. Encapsulates a new function `move_to_err_path`. 3. Some minor refactors like fix typo.

#1383 To handle the return code of read and write requests, it would be great to refactor the return code of the related functions. This patch change to use rocksdb::Status::code insteadn of meanless integer, and left some TODOs to be dealt in follow up patchs.

#1383 This patch fix some minor issues includes: - short return id `FLAGS_fd_disabled` is true in `remove_replica_on_meta_server` to avoid running meaningless logic - encapsulate a new function `wait_closing_replicas_finished()` in `replica_stub` - marks some functions as `const` or `override` - marks some parameters or variables as `const` - adds missing lock - fixes some typos - use short-circuit return style

#1383 The replica instance path will be removed to trash path, a.k.a `<table_id>.<pid>.<timestamp>.err`, but it may not complete when a replica server crash, then the path is left but some files (e.g. `.init-info`) in the path have been moved. When restart the server after that, server will crash because of a check on existence of the files, which is not necessary, the server is able to trash the corrupt path and start normally, the missing replica can be recovered from other servers automatically. This patch removes the check.

#1383 This patch deal with the error `kCorruption` returned from storage engine of write requests. After replica server got such an error, it will trash the replica to a trash path `<app_id>.<pid>.pegasus.<timestamp>.err`. Note that the replica server may crash because the corrupted replica has been trashed and closed, it is left to be completed by another patches.

#1383 Commit 9303c3a introduced a flaky test, this patch try to fix it. This patch also introduce some integration test utils, they would be helpful for following patches.

#1383 ReplicaServer doesn't handle the error returned from storage engine, thus even if the storage engine is corrupted, the server doesn't recognize these situactions, and still running happily. However, the client always gets an error status. This situaction will not recover automatically except stopping the server and moving away the corrupted RocksDB directories manually. This patch handle the kCorruption error returned from storage engine, then close the replcia, move the directory to ".err" trash path. The replica is able to recover automatically (if RF > 1).

#1383 This is a minor refactor work on class fs_manager, including: - use `uint64_t` instead of `unsigned` in fs_manager module. - remove useless "test" parameters.

#1383 This patch moves some functions to fs_manager which are more reasonable to be responsibilities of class fs_manager rather than those of class replica_stub.

#1383 In prior implemention, every replica has a "dir_node status", if a dir_node has some abnormal status (e.g. in space insufficient), we have to update all replicas' referenced "dir_node status", it is implemented in `replica_stub::update_disks_status`. This make the "dir_node status" updating path too long, and a bit of duplicate. A new implemention is completed in #1473, every replica has a reference of dir_node directly, so it would be easy to update replcia's "dir_node status" by updating the referenced dir_node's status once. Before the new implemention, this patch submit a minor refactor to remove `replica_stub::update_disks_status` and related functions and variables. Also some unit tests have been updated.

#1383 This patch removes the duplicated _disk_tag and _disk_status of the dir_node where it is placed on, instead, introduce a dir_node pointer for replica. So once the status of the dir_node updated, we can judge the replica's status more conveniently. Some unit tests have been updated as well, including: - change the test directory from `./` to `test_dir` - simplify the logic of replica_disk_test related test

#1383 A disk (a.k.a node_dir in Pegasus) is possible to become SPACE_INSUFFICIENT or IO_ERROR from NORMAL, meanwhile, it's possible to recovery from SPACE_INSUFFICIENT to NORMAL. So we can keep all node_dirs in system, but only reject to assign replicas on abnormal node_dirs, reject to do write type of operations on abnormal node_dirs. This patch also update some unit tests.

…#1522) #1383 This patch moves some functions to fs_manager which are more reasonable to be responsibilities of class fs_manager rather than those of other classes, includeing: - remove `fs_manager::for_each_dir_node` - minimize some locks - rename `fs_manager::is_dir_node_available` to `fs_manager::is_dir_node_exist` - move `get_disk_infos` code to class `fs_manager` and encapsulate it as a function - move `validate_migrate_op` code to class `fs_manager` and encapsulate it as a function - move `disk_status_to_error_code` from replica_2pc.cpp to class `fs_manager`

…en encounter read/write IO error (#1473) #1383 This patch deal with the IO error populated from storage engine of read and write operations, the replica will be closed and mark the dir_node as disk_status::IO_ERROR. The dir_node marked as IO_ERROR will not be selected when new replicas created as patch 4dcbb1e implemented. This patch also add/update some unit tests.

…ub' (apache#1384) 对应的社区commit: https://github.com/apache/incubator-pegasus/pull/1384/files apache#1383 This is a refactor patch before fixing apache#1383. This patch has no functionality changes, but just including refactors: 1. Moves functions `load()`, `newr()` and `clear_on_failure()` from class replica to class replica_stub, and the first two have been renamed to `load_replica()` and `new_replica()`. 2. Encapsulates a new function `move_to_err_path`. 3. Some minor refactors like fix typo.

…apache#1422) 对应社区commit: https://github.com/apache/incubator-pegasus/pull/1422/files 其中，单测 integration_test.cpp 未添加，原因是整个function test的变更过大不便添加，等最后都合入后再单独补充 apache#1383 This patch deal with the error `kCorruption` returned from storage engine of write requests. After replica server got such an error, it will trash the replica to a trash path `<app_id>.<pid>.pegasus.<timestamp>.err`. Note that the replica server may crash because the corrupted replica has been trashed and closed, it is left to be completed by another patches.

…pache#1456) 对应社区commit: https://github.com/apache/incubator-pegasus/pull/1456/files apache#1383 This is a refactor patch before fixing apache#1383. This patch has no functionality changes, but just including refactors: 1. Moves functions `load()`, `newr()` and `clear_on_failure()` from class replica to class replica_stub, and the first two have been renamed to `load_replica()` and `new_replica()`. 2. Encapsulates a new function `move_to_err_path`. 3. Some minor refactors like fix typo.

…ns (apache#1447) 对应社区commit: https://github.com/apache/incubator-pegasus/pull/1447/files 注: 单测部分变更较大,本次未合入 apache#1383 ReplicaServer doesn't handle the error returned from storage engine, thus even if the storage engine is corrupted, the server doesn't recognize these situactions, and still running happily. However, the client always gets an error status. This situaction will not recover automatically except stopping the server and moving away the corrupted RocksDB directories manually. This patch handle the kCorruption error returned from storage engine, then close the replcia, move the directory to ".err" trash path. The replica is able to recover automatically (if RF > 1).

acelyc111 added the type/bug This issue reports a bug. label Mar 6, 2023

acelyc111 mentioned this issue Mar 6, 2023

refactor: Move some functions from 'replica' to 'replica_stub' #1384

Merged

This was referenced Mar 8, 2023

fix: Trash the unrecoverable rocksDB instance to .err path acelyc111/pegasus#71

Closed

refactor: use RocksDB Status code instead of meaningless int acelyc111/pegasus#72

Closed

acelyc111 mentioned this issue Mar 16, 2023

fix: Fault-tolerant storage engine errors for write operations #1399

Closed

This was referenced Mar 25, 2023

refactor: return Status::code instead of meanless int acelyc111/pegasus#74

Closed

refactor: return Status::code instead of meanless integer #1417

Merged

This was referenced Mar 29, 2023

fix: Fix the corruption RocksDB instance will be reused bug acelyc111/pegasus#75

Closed

fix: Fix the corruption RocksDB instance will be reused bug #1422

Merged

refactor: minor refactor on replica module #1423

Merged

acelyc111 mentioned this issue Apr 3, 2023

fix: log error but not crash if found an imcomplete replica path #1428

Merged

This was referenced Apr 11, 2023

fix: Fault-tolerant storage engine errors for read operations acelyc111/pegasus#76

Closed

fix(ut): fix a flaky test integration_test.write_corrupt_db #1442

Merged

This was referenced Apr 17, 2023

fix: Fault-tolerant storage engine errors for read operations #1447

Merged

Start replica server failed due to incomplete created RocksDB directory #1450

Closed

This was referenced May 8, 2023

refactor: fs manager acelyc111/pegasus#78

Closed

feat(replica): close the replica and mark the dir_node as IO_ERROR when encounter read/write IO error #1473

Merged

acelyc111 mentioned this issue May 16, 2023

refactor: minor refactor on class fs_manager #1476

Merged

empiredan pushed a commit that referenced this issue May 17, 2023

refactor: minor refactor on class fs_manager (#1476)

6b9fba3

#1383 This is a minor refactor work on class fs_manager, including: - use `uint64_t` instead of `unsigned` in fs_manager module. - remove useless "test" parameters.

This was referenced May 17, 2023

refactor: improve the single-responsibility of class fs_manager #1477

Merged

refactor: remove some useless code #1480

Merged

acelyc111 mentioned this issue May 25, 2023

refactor: update replica's dir_node status (part1) #1487

Merged

acelyc111 mentioned this issue May 26, 2023

refactor: update replica's dir_node status (part2) #1489

Merged

acelyc111 mentioned this issue Jun 4, 2023

feat: skip IO_ERROR dir_node when assign replicas #1512

Merged

acelyc111 mentioned this issue Jun 8, 2023

refactor: improve the single-responsibility of class fs_manager (2/n) #1522

Merged

acelyc111 closed this as completed Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RocksDB corruption leads to endless failure #1383

RocksDB corruption leads to endless failure #1383

acelyc111 commented Mar 6, 2023 •

edited

Loading

RocksDB corruption leads to endless failure #1383

RocksDB corruption leads to endless failure #1383

Comments

acelyc111 commented Mar 6, 2023 • edited Loading

Bug Report

acelyc111 commented Mar 6, 2023 •

edited

Loading