Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB corruption leads to endless failure #1383

Closed
acelyc111 opened this issue Mar 6, 2023 · 0 comments
Closed

RocksDB corruption leads to endless failure #1383

acelyc111 opened this issue Mar 6, 2023 · 0 comments
Labels
type/bug This issue reports a bug.

Comments

@acelyc111
Copy link
Member

acelyc111 commented Mar 6, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    Some RocksDB instances corrupted for some reasons, maybe disk driver IO error, data corruption, etc.

  2. What did you expect to see?
    The cluster can recover from error automatically.

  3. What did you see instead?
    The replica server will close the replica when see write errors, but will start the replica again in the same place whose data is still corrupted. And then, the error occured again and again.

For read requests, the replica will not handle the error instead, then read requests will fail again and again.

Write:
image

Read:
image

  1. What version of Pegasus are you using?
    1.12, 2.0
@acelyc111 acelyc111 added the type/bug This issue reports a bug. label Mar 6, 2023
empiredan pushed a commit that referenced this issue Mar 8, 2023
#1383

This is a refactor patch before fixing #1383. This patch has no functionality changes, but just including refactors:
1. Moves functions `load()`, `newr()` and `clear_on_failure()` from class replica to class replica_stub, and the first two have been renamed to `load_replica()` and `new_replica()`.
2. Encapsulates a new function `move_to_err_path`.
3. Some minor refactors like fix typo.
acelyc111 added a commit that referenced this issue Mar 27, 2023
#1383

To handle the return code of read and write requests, it would be great
to refactor the return code of the related functions.
This patch change to use rocksdb::Status::code insteadn of meanless integer,
and left some TODOs to be dealt in follow up patchs.
acelyc111 added a commit that referenced this issue Mar 31, 2023
#1383

This patch fix some minor issues includes:
- short return id `FLAGS_fd_disabled` is true in `remove_replica_on_meta_server`
  to avoid running meaningless logic
- encapsulate a new function `wait_closing_replicas_finished()` in `replica_stub`
- marks some functions as `const` or `override`
- marks some parameters or variables as `const`
- adds missing lock
- fixes some typos
- use short-circuit return style
acelyc111 added a commit that referenced this issue Apr 4, 2023
#1383

The replica instance path will be removed to trash path, a.k.a
`<table_id>.<pid>.<timestamp>.err`, but it may not complete when
a replica server crash, then the path is left but some files
(e.g. `.init-info`) in the path have been moved. When restart
the server after that, server will crash because of a check on
existence of the files, which is not necessary, the server is
able to trash the corrupt path and start normally, the missing
replica can be recovered from other servers automatically.

This patch removes the check.
empiredan pushed a commit that referenced this issue Apr 11, 2023
#1383

This patch deal with the error `kCorruption` returned from storage
engine of write requests. After replica server got such an error,
it will trash the replica to a trash path
`<app_id>.<pid>.pegasus.<timestamp>.err`.

Note that the replica server may crash because the corrupted replica
has been trashed and closed, it is left to be completed by another
patches.
acelyc111 added a commit that referenced this issue Apr 17, 2023
#1383

Commit 9303c3a introduced a flaky test, this
patch try to fix it.

This patch also introduce some integration test utils, they would be helpful
for following patches.
empiredan pushed a commit that referenced this issue Apr 24, 2023
#1383

ReplicaServer doesn't handle the error returned from storage engine, thus
even if the storage engine is corrupted, the server doesn't recognize these
situactions, and still running happily. However, the client always gets an
error status.
This situaction will not recover automatically except stopping the server
and moving away the corrupted RocksDB directories manually.

This patch handle the kCorruption error returned from storage engine, then
close the replcia, move the directory to ".err" trash path. The replica is
able to recover automatically (if RF > 1).
empiredan pushed a commit that referenced this issue May 17, 2023
#1383

This is a minor refactor work on class fs_manager, including:
- use `uint64_t` instead of `unsigned` in fs_manager module.
- remove useless "test" parameters.
acelyc111 added a commit that referenced this issue May 25, 2023
#1383

This patch moves some functions to fs_manager which are more reasonable to be
responsibilities of class fs_manager rather than those of class replica_stub.
empiredan pushed a commit that referenced this issue May 26, 2023
#1383

In prior implemention, every replica has a "dir_node status", if a dir_node has
some abnormal status (e.g. in space insufficient), we have to update all replicas'
referenced "dir_node status", it is implemented in `replica_stub::update_disks_status`.
This make the "dir_node status" updating path too long, and a bit of duplicate.

A new implemention is completed in #1473,
every replica has a reference of dir_node directly, so it would be easy to update replcia's
"dir_node status" by updating the referenced dir_node's status once.

Before the new implemention, this patch submit a minor refactor to remove
`replica_stub::update_disks_status` and related functions and variables. Also some unit
tests have been updated.
acelyc111 added a commit that referenced this issue May 31, 2023
#1383

This patch removes the duplicated _disk_tag and _disk_status of the dir_node where
it is placed on, instead, introduce a dir_node pointer for replica. So once the
status of the dir_node updated, we can judge the replica's status more conveniently.

Some unit tests have been updated as well, including:
- change the test directory from `./` to `test_dir`
- simplify the logic of replica_disk_test related test
acelyc111 added a commit that referenced this issue Jun 8, 2023
#1383

A disk (a.k.a node_dir in Pegasus) is possible to become SPACE_INSUFFICIENT or
IO_ERROR from NORMAL, meanwhile, it's possible to recovery from SPACE_INSUFFICIENT
to NORMAL. So we can keep all node_dirs in system, but only reject to assign
replicas on abnormal node_dirs, reject to do write type of operations on abnormal
node_dirs.

This patch also update some unit tests.
acelyc111 added a commit that referenced this issue Jun 8, 2023
…#1522)

#1383

This patch moves some functions to fs_manager which are more reasonable to be
responsibilities of class fs_manager rather than those of other classes, includeing:
- remove `fs_manager::for_each_dir_node`
- minimize some locks
- rename `fs_manager::is_dir_node_available` to `fs_manager::is_dir_node_exist`
- move `get_disk_infos` code to class `fs_manager` and encapsulate it as a function
- move `validate_migrate_op` code to class `fs_manager` and encapsulate it as a function
- move `disk_status_to_error_code` from replica_2pc.cpp to class `fs_manager`
acelyc111 added a commit that referenced this issue Jun 14, 2023
…en encounter read/write IO error (#1473)

#1383

This patch deal with the IO error populated from storage engine of read and write
operations, the replica will be closed and mark the dir_node as disk_status::IO_ERROR.
The dir_node marked as IO_ERROR will not be selected when new replicas created as
patch 4dcbb1e implemented.
This patch also add/update some unit tests.
GehaFearless pushed a commit to GehaFearless/incubator-pegasus that referenced this issue Feb 28, 2024
…ub' (apache#1384)

对应的社区commit: https://github.com/apache/incubator-pegasus/pull/1384/files

apache#1383

This is a refactor patch before fixing apache#1383. This patch has no functionality changes, but just including refactors:

1. Moves functions `load()`, `newr()` and `clear_on_failure()` from class replica to class replica_stub, and the first two have been renamed to `load_replica()` and `new_replica()`.
2. Encapsulates a new function `move_to_err_path`.
3. Some minor refactors like fix typo.
GehaFearless pushed a commit to GehaFearless/incubator-pegasus that referenced this issue Feb 28, 2024
…apache#1422)

对应社区commit: https://github.com/apache/incubator-pegasus/pull/1422/files

其中,单测 integration_test.cpp 未添加,原因是整个function test的变更过大不便添加,等
最后都合入后再单独补充

apache#1383

This patch deal with the error `kCorruption` returned from storage engine of
write requests. After replica server got such an error, it will trash the
replica to a trash path `<app_id>.<pid>.pegasus.<timestamp>.err`.

Note that the replica server may crash because the corrupted replica has been
trashed and closed, it is left to be completed by another patches.
GehaFearless pushed a commit to GehaFearless/incubator-pegasus that referenced this issue Feb 28, 2024
…pache#1456)

对应社区commit: https://github.com/apache/incubator-pegasus/pull/1456/files

apache#1383

This is a refactor patch before fixing apache#1383. This patch has no functionality
changes, but just including refactors:
1. Moves functions `load()`, `newr()` and `clear_on_failure()` from class replica
to class replica_stub, and the first two have been renamed to `load_replica()`
and `new_replica()`.
2. Encapsulates a new function `move_to_err_path`.
3. Some minor refactors like fix typo.
GehaFearless pushed a commit to GehaFearless/incubator-pegasus that referenced this issue Feb 28, 2024
…ns (apache#1447)

对应社区commit: https://github.com/apache/incubator-pegasus/pull/1447/files

注: 单测部分变更较大,本次未合入

apache#1383

ReplicaServer doesn't handle the error returned from storage engine, thus
even if the storage engine is corrupted, the server doesn't recognize these
situactions, and still running happily. However, the client always gets an
error status.
This situaction will not recover automatically except stopping the server
and moving away the corrupted RocksDB directories manually.

This patch handle the kCorruption error returned from storage engine, then
close the replcia, move the directory to ".err" trash path. The replica is
able to recover automatically (if RF > 1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

1 participant