Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add broken index backup document #2096

Merged
merged 1 commit into from
Jul 3, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/user-guides/backup-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,3 +247,65 @@ Agent Sidecar tries to get the backup file from S3, unpacks it, and starts index

In using both the PV and S3 case, the backup file used for restoration will prioritize the file on PV.
If the backup file does not exist on the PV, the backup file will be retrieved from S3 via the Vald Agent Sidecar and restored.

## Broken index backup

If a backup file of an index is corrupted for some reason, Vald agent fails to load the index file, and the index file is then identified as a broken index.

> Causes of broken index could be agent crash during save index operation, partial storage corruption, etc.

When an index is broken, the default behavior is to discard it and continue running the Pod. This is useful for saving storage space, but sometimes you may need to inspect the contents of a broken index at a later time. By enabling the `broken index backup` feature, a backup is created without deleting the broken index before running the Pod. This feature can help you investigate the cause of index corruption at a later time.

### Settings

To enable this feature, set the `agent.ngt.broken_index_history_limit` setting to at least 1 (default: 0). The system stores backups of broken indexes up to the number of generations specified by this variable. If a backup of a broken index is needed that goes beyond this value, the system will delete the oldest backup.

```
agent:
ngt:
...
broken_index_history_limit: 3
ykadowak marked this conversation as resolved.
Show resolved Hide resolved
ykadowak marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LanguageTool] reported by reviewdog 🐶
This sentence does not start with an uppercase letter. (UPPERCASE_SENTENCE_START)
Suggestions: Broken
URL: https://languagetool.org/insights/post/spelling-capital-letters/
Rule: https://community.languagetool.org/rule/show/UPPERCASE_SENTENCE_START?lang=en-US
Category: CASING

...
```

### Backup location

The backup is stored under `${index_path}/broken`. Each directory name represents the Unix nanosecond when an attempt was made to read the broken index.

```
${index_path}/
origin/
ngt-meta.kvsdb
ngt-timestamp.kvsdb
metadata.json
prf
grp
tre
obj
broken/
1611271735938403848/
ngt-meta.kvsdb
...
1611271749583028942/
ngt-meta.kvsdb
...
1611271759849304593/
ngt-meta.kvsdb
...
```

### Restore

#### CoW: disabled

If an index file exists under `${index_path}/origin`, restore is attempted based on that index file. If the restore fails, the index file is backed up as a broken index. The agent starts in its initial state.

#### CoW: enabled

If an index file exists under `${index_path}/origin`, restore is attempted based on that index file. If the restore fails, `${index_path}/origin` is backed up as a broken index at that point. Then, restore is attempted based on the index file in `${index_path}/backup` (one generation older index file). If the restore fails again, the agent starts in its initial state.

### Metrics

The number of generations of broken indexes currently stored can be obtained as a metric `agent_core_ngt_broken_index_store_count`.

Reference: [vald/k8s/metrics/grafana/dashboards/01-vald-agent.yaml](https://github.com/vdaas/vald/blob/main/k8s/metrics/grafana/dashboards/01-vald-agent.yaml)