database file does not match with snapshot #7834

hasbro17 · 2017-04-28T00:07:31Z

Overview

If I create a 1 node cluster from a snapshot restored data-dir and add a 2nd member, then restarting the 2nd member will cause it to exit after an error like so:

2017-04-27 16:10:25.544472 C | etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

Steps to reproduce the issue:

Prepare a simple snapshot snapshot.db from a single member etcd cluster:
Start a temporary cluster on localhost:

$ etcd
2017-04-27 16:37:58.667981 I | etcdmain: etcd Version: 3.1.6
2017-04-27 16:37:58.668081 I | etcdmain: Git SHA: e5b7ee2
2017-04-27 16:37:58.668087 I | etcdmain: Go Version: go1.8
2017-04-27 16:37:58.668089 I | etcdmain: Go OS/Arch: darwin/amd64
. . .

Put some data in the cluster and create a snapshot:

$ ETCDCTL_API=3 etcdctl put foo1 bar1
OK
$ ETCDCTL_API=3 etcdctl snapshot save snapshot.db

Kill the existing cluster once we have the snapshot.

Now restore the data directory infra1.etcd for the 1st member of our new cluster from the snapshot.db file:

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db --name infra1 --initial-cluster infra1=http://127.0.0.1:2380 --initial-advertise-peer-urls http://127.0.0.1:2380

Start the 1st member infra1 of the new cluster using the restored data-dir infra1.etcd:

etcd --data-dir=infra1.etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:2380 --initial-advertise-peer-urls http://127.0.0.1:2380 --initial-cluster 'infra1=http://127.0.0.1:2380' --initial-cluster-state new

Add a 2nd member infra2 to cluster:

ETCDCTL_API=3 etcdctl member add infra2 --peer-urls=http://127.0.0.1:22380

Start the 2nd member:

etcd --data-dir=infra2.etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster 'infra1=http://127.0.0.1:2380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing

Kill and restart the etcd server for the 2nd member infra2:

# ^C to stop the 2nd member

etcd --data-dir=infra2.etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster 'infra1=http://127.0.0.1:2380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing

The 2nd member should exit with the following logs:

2017-04-27 17:01:48.689084 I | etcdmain: etcd Version: 3.1.6
2017-04-27 17:01:48.689187 I | etcdmain: Git SHA: e5b7ee2
2017-04-27 17:01:48.689191 I | etcdmain: Go Version: go1.8
2017-04-27 17:01:48.689195 I | etcdmain: Go OS/Arch: darwin/amd64
2017-04-27 17:01:48.689199 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-04-27 17:01:48.689204 N | etcdmain: failed to detect default host (default host not supported on darwin_amd64)
2017-04-27 17:01:48.689263 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-04-27 17:01:48.689344 I | embed: listening for peers on http://127.0.0.1:22380
2017-04-27 17:01:48.689408 I | embed: listening for client requests on 127.0.0.1:22379
2017-04-27 17:01:48.693297 I | etcdserver: recovered store from snapshot at index 5
2017-04-27 17:01:48.693316 I | etcdserver: name = infra2
2017-04-27 17:01:48.693320 I | etcdserver: data dir = infra2.etcd
2017-04-27 17:01:48.693325 I | etcdserver: member dir = infra2.etcd/member
2017-04-27 17:01:48.693328 I | etcdserver: heartbeat = 100ms
2017-04-27 17:01:48.693332 I | etcdserver: election = 1000ms
2017-04-27 17:01:48.693335 I | etcdserver: snapshot count = 10000
2017-04-27 17:01:48.693345 I | etcdserver: advertise client URLs = http://127.0.0.1:22379
2017-04-27 17:01:48.693827 I | etcdserver: restarting member b5ebdc02693bef64 in cluster 7bdc11851051b492 at commit index 7
2017-04-27 17:01:48.693915 I | raft: b5ebdc02693bef64 became follower at term 18
2017-04-27 17:01:48.693931 I | raft: newRaft b5ebdc02693bef64 [peers: [b5ebdc02693bef64,bf9071f4639c75cc], term: 18, commit: 7, applied: 5, lastindex: 7, lastterm: 18]
2017-04-27 17:01:48.694053 I | etcdserver/api: enabled capabilities for version 3.1
2017-04-27 17:01:48.694069 I | etcdserver/membership: added member bf9071f4639c75cc [http://127.0.0.1:2380] to cluster 7bdc11851051b492 from store
2017-04-27 17:01:48.694073 I | etcdserver/membership: added member b5ebdc02693bef64 [http://127.0.0.1:22380] to cluster 7bdc11851051b492 from store
2017-04-27 17:01:48.694077 I | etcdserver/membership: set the cluster version to 3.1 from store
2017-04-27 17:01:48.700019 C | etcdmain: database file (infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

The snap directories for both data directories look like so:

$ ls infra1.etcd/member/snap/
0000000000000001-0000000000000001.snap  db
$ ls infra2.etcd/member/snap/
0000000000000002-0000000000000005.snap  db

The text was updated successfully, but these errors were encountered:

fanminshi · 2017-04-29T00:17:58Z

reproduce the steps with this commit c407e09 appears to cause infra2 node to hang or not serving any requests instead of returning 2017-04-27 16:10:25.544472 C | etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

fanminshi · 2017-05-01T21:13:16Z

This commit 0054e7e causes etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5) error.

If 'StartEtcd' returns before starting gRPC server (e.g. mismatch snapshot, misconfiguration), receiving from grpcServerC blocks forever. This patch just closes the channel to not block on grpcServerC, and proceeds to next stop operations in Close. This was masking the issues in etcd-io#7834 Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

previously, apply() doesn't set consistIndex for EntryConfChange type. this causes a misalignment between consistIndex and applied index where EntryConfChange entry results setting applied index but not consistIndex. suppose that addMember() is called and leader reflects that change. 1. applied index and consistIndex is now misaligned. 2. a new follower node joined. 3. leader sends the snapshot to follower where the applied index is the snapshot metadata index. 4. follower node saves the snapshot and database(includes consistIndex) from leader. 5. restarting follower loads snapshot and database. 6. follower checks snapshot metadata index(same as applied index) and database consistIndex, finds them don't match, and then panic. FIXES etcd-io#7834

previously, apply() doesn't set consistIndex for EntryConfChange type. this causes a misalignment between consistIndex and applied index where EntryConfChange entry results setting applied index but not consistIndex. suppose that addMember() is called and leader reflects that change. 1. applied index and consistIndex is now misaligned. 2. a new follower node joined. 3. leader sends the snapshot to follower where the applied index is the snapshot metadata index. 4. follower node saves the snapshot and database(includes consistIndex) from leader. 5. restarting follower loads snapshot and database. 6. follower checks snapshot metadata index(same as applied index) and database consistIndex, finds them don't match, and then panic. FIXES #7834

This was referenced Apr 28, 2017

When killed. Etcdv3 node can require manual intervention to bring back #7628

Closed

Upgrade restarts all pods after a restore happens coreos/etcd-operator#1008

Closed

gyuho added the type/bug label Apr 28, 2017

fanminshi self-assigned this Apr 28, 2017

gyuho mentioned this issue May 1, 2017

embed: fix blocking Close before gRPC server start #7848

Merged

fanminshi mentioned this issue May 2, 2017

etcdserver: apply() sets consistIndex for any entry type #7856

Merged

fanminshi closed this as completed in #7856 May 3, 2017

xiang90 mentioned this issue Jun 15, 2017

KV index is smaller than snapshot index #8103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database file does not match with snapshot #7834

database file does not match with snapshot #7834

hasbro17 commented Apr 28, 2017 •

edited by gyuho

Loading

fanminshi commented Apr 29, 2017 •

edited

Loading

fanminshi commented May 1, 2017

database file does not match with snapshot #7834

database file does not match with snapshot #7834

Comments

hasbro17 commented Apr 28, 2017 • edited by gyuho Loading

Overview

Steps to reproduce the issue:

fanminshi commented Apr 29, 2017 • edited Loading

fanminshi commented May 1, 2017

hasbro17 commented Apr 28, 2017 •

edited by gyuho

Loading

fanminshi commented Apr 29, 2017 •

edited

Loading