Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/op-guide: add "membership mis-reconfiguration", explain "--force-new-cluster" #9177

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions Documentation/op-guide/recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,25 @@ $ etcd \
```

Now the restored etcd cluster should be available and serving the keyspace given by the snapshot.

## Restoring a cluster from membership mis-reconfiguration

Previously, etcd panics on [membership mis-reconfiguration with wrong URLs](https://github.com/coreos/etcd/issues/9173). v3.2.15 and v3.3.0+ return [error early in client-side](https://github.com/coreos/etcd/pull/9174) before etcd server panic.

To fix such misconfiguration while keeping original data, `--force-new-cluster` flag can be used to overwrite cluster configuration. Please be CAUTIOUS when using this flag because it will panic if other members from previous cluster are still alive. Please follow the instructions below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not think we should promote the --force-new-cluster command anymore. we should instead promote the restore from snapshot way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that was why it's only documented in v2 docs. Closing.


1. stop all etcd processes in the cluster.
2. Choose one member to restore data from.
3. Create a separate copy of original data/WAL directories, just in case.
4. Start etcd with `--force-new-cluster` option pointing to original data/WAL directories. This will initialize a new, single-member cluster with default advertised peer URLs (or given URLs), but preserve the entire contents of the etcd data store. That is, it commits configuration changes forcing to remove all previous cluster members and add itself to a single-node cluster.
```bash
etcd \
--data-dir=${PREV_DATA_DIR} \
--wal-dir=${PREV_WAL_DIR} \
--force-new-cluster
```
5. Verify that this single node is available serving the original data.
6. Remove data/WAL directories in other members.
7. Add back those members with `etcdctl member add` command.

Optionally in step 4, you may start `etcd` with `--force-new-cluster --snapshot-count 1` and verify membership configuration is persisted on disk. Shut down. And restart without `--force-new-cluster` flag.