Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database file does not match with snapshot #7834

Closed
hasbro17 opened this issue Apr 28, 2017 · 2 comments
Closed

database file does not match with snapshot #7834

hasbro17 opened this issue Apr 28, 2017 · 2 comments
Assignees
Labels

Comments

@hasbro17
Copy link
Contributor

hasbro17 commented Apr 28, 2017

Overview

If I create a 1 node cluster from a snapshot restored data-dir and add a 2nd member, then restarting the 2nd member will cause it to exit after an error like so:

2017-04-27 16:10:25.544472 C | etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

Steps to reproduce the issue:

Prepare a simple snapshot snapshot.db from a single member etcd cluster:
Start a temporary cluster on localhost:

$ etcd
2017-04-27 16:37:58.667981 I | etcdmain: etcd Version: 3.1.6
2017-04-27 16:37:58.668081 I | etcdmain: Git SHA: e5b7ee2
2017-04-27 16:37:58.668087 I | etcdmain: Go Version: go1.8
2017-04-27 16:37:58.668089 I | etcdmain: Go OS/Arch: darwin/amd64
. . .

Put some data in the cluster and create a snapshot:

$ ETCDCTL_API=3 etcdctl put foo1 bar1
OK
$ ETCDCTL_API=3 etcdctl snapshot save snapshot.db

Kill the existing cluster once we have the snapshot.

Now restore the data directory infra1.etcd for the 1st member of our new cluster from the snapshot.db file:

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db --name infra1 --initial-cluster infra1=http://127.0.0.1:2380 --initial-advertise-peer-urls http://127.0.0.1:2380

Start the 1st member infra1 of the new cluster using the restored data-dir infra1.etcd:

etcd --data-dir=infra1.etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:2380 --initial-advertise-peer-urls http://127.0.0.1:2380 --initial-cluster 'infra1=http://127.0.0.1:2380' --initial-cluster-state new

Add a 2nd member infra2 to cluster:

ETCDCTL_API=3 etcdctl member add infra2 --peer-urls=http://127.0.0.1:22380

Start the 2nd member:

etcd --data-dir=infra2.etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster 'infra1=http://127.0.0.1:2380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing

Kill and restart the etcd server for the 2nd member infra2:

# ^C to stop the 2nd member

etcd --data-dir=infra2.etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster 'infra1=http://127.0.0.1:2380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing

The 2nd member should exit with the following logs:

2017-04-27 17:01:48.689084 I | etcdmain: etcd Version: 3.1.6
2017-04-27 17:01:48.689187 I | etcdmain: Git SHA: e5b7ee2
2017-04-27 17:01:48.689191 I | etcdmain: Go Version: go1.8
2017-04-27 17:01:48.689195 I | etcdmain: Go OS/Arch: darwin/amd64
2017-04-27 17:01:48.689199 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-04-27 17:01:48.689204 N | etcdmain: failed to detect default host (default host not supported on darwin_amd64)
2017-04-27 17:01:48.689263 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-04-27 17:01:48.689344 I | embed: listening for peers on http://127.0.0.1:22380
2017-04-27 17:01:48.689408 I | embed: listening for client requests on 127.0.0.1:22379
2017-04-27 17:01:48.693297 I | etcdserver: recovered store from snapshot at index 5
2017-04-27 17:01:48.693316 I | etcdserver: name = infra2
2017-04-27 17:01:48.693320 I | etcdserver: data dir = infra2.etcd
2017-04-27 17:01:48.693325 I | etcdserver: member dir = infra2.etcd/member
2017-04-27 17:01:48.693328 I | etcdserver: heartbeat = 100ms
2017-04-27 17:01:48.693332 I | etcdserver: election = 1000ms
2017-04-27 17:01:48.693335 I | etcdserver: snapshot count = 10000
2017-04-27 17:01:48.693345 I | etcdserver: advertise client URLs = http://127.0.0.1:22379
2017-04-27 17:01:48.693827 I | etcdserver: restarting member b5ebdc02693bef64 in cluster 7bdc11851051b492 at commit index 7
2017-04-27 17:01:48.693915 I | raft: b5ebdc02693bef64 became follower at term 18
2017-04-27 17:01:48.693931 I | raft: newRaft b5ebdc02693bef64 [peers: [b5ebdc02693bef64,bf9071f4639c75cc], term: 18, commit: 7, applied: 5, lastindex: 7, lastterm: 18]
2017-04-27 17:01:48.694053 I | etcdserver/api: enabled capabilities for version 3.1
2017-04-27 17:01:48.694069 I | etcdserver/membership: added member bf9071f4639c75cc [http://127.0.0.1:2380] to cluster 7bdc11851051b492 from store
2017-04-27 17:01:48.694073 I | etcdserver/membership: added member b5ebdc02693bef64 [http://127.0.0.1:22380] to cluster 7bdc11851051b492 from store
2017-04-27 17:01:48.694077 I | etcdserver/membership: set the cluster version to 3.1 from store
2017-04-27 17:01:48.700019 C | etcdmain: database file (infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

The snap directories for both data directories look like so:

$ ls infra1.etcd/member/snap/
0000000000000001-0000000000000001.snap  db
$ ls infra2.etcd/member/snap/
0000000000000002-0000000000000005.snap  db
@fanminshi
Copy link
Member

fanminshi commented Apr 29, 2017

reproduce the steps with this commit c407e09 appears to cause infra2 node to hang or not serving any requests instead of returning 2017-04-27 16:10:25.544472 C | etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

@fanminshi
Copy link
Member

This commit 0054e7e causes etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5) error.

gyuho added a commit to gyuho/etcd that referenced this issue May 1, 2017
If 'StartEtcd' returns before starting gRPC server
(e.g. mismatch snapshot, misconfiguration),
receiving from grpcServerC blocks forever. This patch
just closes the channel to not block on grpcServerC,
and proceeds to next stop operations in Close.

This was masking the issues in etcd-io#7834

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue May 1, 2017
If 'StartEtcd' returns before starting gRPC server
(e.g. mismatch snapshot, misconfiguration),
receiving from grpcServerC blocks forever. This patch
just closes the channel to not block on grpcServerC,
and proceeds to next stop operations in Close.

This was masking the issues in etcd-io#7834

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue May 1, 2017
If 'StartEtcd' returns before starting gRPC server
(e.g. mismatch snapshot, misconfiguration),
receiving from grpcServerC blocks forever. This patch
just closes the channel to not block on grpcServerC,
and proceeds to next stop operations in Close.

This was masking the issues in etcd-io#7834

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue May 1, 2017
If 'StartEtcd' returns before starting gRPC server
(e.g. mismatch snapshot, misconfiguration),
receiving from grpcServerC blocks forever. This patch
just closes the channel to not block on grpcServerC,
and proceeds to next stop operations in Close.

This was masking the issues in etcd-io#7834

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue May 1, 2017
If 'StartEtcd' returns before starting gRPC server
(e.g. mismatch snapshot, misconfiguration),
receiving from grpcServerC blocks forever. This patch
just closes the channel to not block on grpcServerC,
and proceeds to next stop operations in Close.

This was masking the issues in etcd-io#7834

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
fanminshi added a commit to fanminshi/etcd that referenced this issue May 2, 2017
previously, apply() doesn't set consistIndex for EntryConfChange type.
this causes a misalignment between consistIndex and applied index
where EntryConfChange entry results setting applied index but not consistIndex.

suppose that addMember() is called and leader reflects that change.
1. applied index and consistIndex is now misaligned.
2. a new follower node joined.
3. leader sends the snapshot to follower
	where the applied index is the snapshot metadata index.
4. follower node saves the snapshot and database(includes consistIndex) from leader.
5. restarting follower loads snapshot and database.
6. follower checks snapshot metadata index(same as applied index) and database consistIndex,
	finds them don't match, and then panic.

FIXES etcd-io#7834
fanminshi added a commit to fanminshi/etcd that referenced this issue May 2, 2017
previously, apply() doesn't set consistIndex for EntryConfChange type.
this causes a misalignment between consistIndex and applied index
where EntryConfChange entry results setting applied index but not consistIndex.

suppose that addMember() is called and leader reflects that change.
1. applied index and consistIndex is now misaligned.
2. a new follower node joined.
3. leader sends the snapshot to follower
	where the applied index is the snapshot metadata index.
4. follower node saves the snapshot and database(includes consistIndex) from leader.
5. restarting follower loads snapshot and database.
6. follower checks snapshot metadata index(same as applied index) and database consistIndex,
	finds them don't match, and then panic.

FIXES etcd-io#7834
fanminshi added a commit to fanminshi/etcd that referenced this issue May 2, 2017
previously, apply() doesn't set consistIndex for EntryConfChange type.
this causes a misalignment between consistIndex and applied index
where EntryConfChange entry results setting applied index but not consistIndex.

suppose that addMember() is called and leader reflects that change.
1. applied index and consistIndex is now misaligned.
2. a new follower node joined.
3. leader sends the snapshot to follower
	where the applied index is the snapshot metadata index.
4. follower node saves the snapshot and database(includes consistIndex) from leader.
5. restarting follower loads snapshot and database.
6. follower checks snapshot metadata index(same as applied index) and database consistIndex,
	finds them don't match, and then panic.

FIXES etcd-io#7834
fanminshi added a commit to fanminshi/etcd that referenced this issue May 2, 2017
previously, apply() doesn't set consistIndex for EntryConfChange type.
this causes a misalignment between consistIndex and applied index
where EntryConfChange entry results setting applied index but not consistIndex.

suppose that addMember() is called and leader reflects that change.
1. applied index and consistIndex is now misaligned.
2. a new follower node joined.
3. leader sends the snapshot to follower
	where the applied index is the snapshot metadata index.
4. follower node saves the snapshot and database(includes consistIndex) from leader.
5. restarting follower loads snapshot and database.
6. follower checks snapshot metadata index(same as applied index) and database consistIndex,
	finds them don't match, and then panic.

FIXES etcd-io#7834
gyuho pushed a commit that referenced this issue May 3, 2017
previously, apply() doesn't set consistIndex for EntryConfChange type.
this causes a misalignment between consistIndex and applied index
where EntryConfChange entry results setting applied index but not consistIndex.

suppose that addMember() is called and leader reflects that change.
1. applied index and consistIndex is now misaligned.
2. a new follower node joined.
3. leader sends the snapshot to follower
	where the applied index is the snapshot metadata index.
4. follower node saves the snapshot and database(includes consistIndex) from leader.
5. restarting follower loads snapshot and database.
6. follower checks snapshot metadata index(same as applied index) and database consistIndex,
	finds them don't match, and then panic.

FIXES #7834
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants