Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Recovery

Kaushal M edited this page Mar 9, 2018 · 2 revisions

This page describes how to recover a GD2 cluster from a complete cluster shutdown.

On a complete cluster shutdown, GD2 cannot startup as no etcd servers are available to initially connect to. To allow startup we need to identify and bring up the last known etcd servers first.

The procedure below is pretty complex right now. We will try to simplify it in further GD2 releases.

Recovery steps

Find last known etcd servers

GD2 saves the etcd information in the store config file. This file is present at DATADIR/glusterd2/store.toml. The usual paths are /var/lib/glusterd2/store.toml or /usr/local/var/lib/glusterd2/store.toml.

An example store.toml file is below.

CAFile = ""
CURLs = ["http://0.0.0.0:2379"]
CertFile = ""
ClntCertFile = ""
ClntKeyFile = ""
ConfFile = "/var/lib/glusterd2/store.toml"
Dir = "/var/lib/glusterd2/store"
Endpoints = ["http://172.17.0.3:2379","http://172.17.0.4:2379","http://172.17.0.4:2379"]
KeyFile = ""
NoEmbed = false
PURLs = ["http://0.0.0.0:2380"]
UseTLS = false

On any of the GD2 peers in the cluster, identify the last known etcd servers from the store.toml file. The last known servers are saved as Endpoints in store.toml.

In the above example the last known etcd servers are "http://localhost:2379". Keep note of this list.

Restore one of the etcd servers

On one of the last known etcd servers do the following to import old data.

NOTE: All of these must be done on only one of the peers that was an etcd server

  • First take a backup of the old store data under DATADIR/glusterd2/store, and create an empty store directory.
mv /var/lib/glusterd2/store{,.bak}
mkdir /var/lib/glusterd2/store
  • Recreate the etcd data dir DATADIR/glusterd2/store/etcd.data, in a single node mode. This requires etcdctl tool to be available. Get the NODEID from the DATADIR/glusterd2/uuid.toml.
ETCDCTL_API=3 etcdctl snapshot restore ../store.bak/etcd.data/member/snap/db --name <NODEID> --initial-cluster <NODEID>=http://<PUBLIC_IP>:2380 --initial-advertise-peer-urls http://<PUBLIC_IP>:2380 --data-dir /var/lib/glusterd2/store/etcd.data --skip-hash-check
  • Now, start GD2 in single node mode. To do this, just remove the store.toml file before starting GD2.
mv /var/lib/glusterd2/store.toml{,.bak}
glusterd2
# or systemctl start glusterd2

GD2 should now begin running on this node, with the data imported from the snapshot.

Start the other GD2 peers

On every GD2 remaining peer, do the following.

  • Edit store.toml and set Endpoints = ["<PUBLIC_IP of restored node>:2379"].
  • Create a backup of the DATADIR/store directory if present.
mv /var/lib/glusterd2/store{,.bak}
  • Start GD2
glusterd2
# or systemctl start glusterd2

NOTE: Bring up the rest of the nodes one by one with a delay between each. This is to allow elasticetcd to safely catch all the new servers coming up and select etcd servers correctly.