Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recover-disk draft #3496

Merged
merged 16 commits into from
Feb 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/content/latest/troubleshoot/nodes/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,15 @@ menu:
</div>
</a>
</div>
<div class="col-12 col-md-6 col-lg-12 col-xl-6">
<a class="section-link glyphicon-floppy-disk" href="check-stats">
<div class="head">
<img class="icon" src="/images/section_icons/troubleshoot/troubleshoot.png" aria-hidden="true" />
<div class="title">Disk failure</div>
</div>
<div class="body">
How to recover a YB-TServer from disk failure
</div>
</a>
</div>
</div>
35 changes: 35 additions & 0 deletions docs/content/latest/troubleshoot/nodes/recover-disk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Disk failure
linkTitle: Disk failure
description: Recover failing disk
aliases:
- /troubleshoot/nodes/disk-failure/
menu:
latest:
parent: troubleshoot-nodes
weight: 849
isTocNested: true
showAsideToc: true
---

YugabyteDB can be configured to use multiple storage disks by setting the [`--fs_data_dirs`](../../reference/configuration/yb-tserver.md) configuration option.
This introduces the possibility of disk failure and recovery issues.

## Cluster replication recovery
The `yb-tserver` service automatically detects disk failures and attempts to spread the data from the failed disk to other healthy nodes in the cluster.
In a single-zone setup with a replication factor (RF) of `3`: if you started with four nodes or more,
then there would be at least three nodes left after one failed.
In this case, rereplication is automatically started if a YB-TServer or disk is down for 10 minutes.

In a multi-zone setup with a replication factor (RF) of `3`: YugabyteDB will try to keep one copy of data per zone.
In this case, for automatic rereplication of data, a zone needs to have at least two YB-TServers so that if one fails,
its data can be rereplicated to the other. Thus, this would mean at least a six-node cluster.

## Failed disk replacement
The steps to replace a failed disk are:

1. Stop the YB-TServer node.
2. Replace the disks that have failed.
3. Restart the `yb-tserver` service.

On restart, the YB-TServer will see the new empty disk and start replicating tablets from other nodes.