Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: renaming db happens after snapshot persists to wal and snap files #7876

Merged
merged 2 commits into from
May 9, 2017

Conversation

fanminshi
Copy link
Member

@fanminshi fanminshi commented May 4, 2017

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will cause a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES #7628

@@ -95,3 +96,19 @@ func (nc *notifier) notify(err error) {
nc.err = err
close(nc.c)
}

func loadBackend(bepath string, be *backend.Backend, QuotaBackendBytes int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just return backend.Backend?
And s/QuotaBackendBytes/quotaBackendBytes/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that.

snap/db.go Outdated
@@ -52,23 +55,26 @@ func (s *Snapshotter) SaveDBFrom(r io.Reader, id uint64) (int64, error) {
return n, err
}

plog.Infof("saved database snapshot to disk [total bytes: %d]", n)

plog.Infof("saved database snapshot %v to disk [total bytes: %d]", fn, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%s to disk?

// 1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
// 2. rename xxx.snap.db to db if exists.
// 3. load backend again with the new db file.
snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use DBFilePath?

Copy link
Member Author

@fanminshi fanminshi May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to use snap.DBFilePath() function? func (s *Snapshotter) DBFilePath(id uint64) (string, error) is tied to an object. that's why I created another GetDBFilePathByID() that's a class function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, we can't do s.r.storage.DBFilePath because we haven't created EtcdServer yet.

@fanminshi
Copy link
Member Author

all fixed @gyuho

if snapfn != "" {
if err := os.Rename(snapfn, bepath); err != nil {
plog.Panicf("rename snapshot file error: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we close existing be before we rename and re-open?

Defer to @xiang90 and @heyitsanthony

Thanks!

@heyitsanthony
Copy link
Contributor

Why doesn't etcd rename the db snapshot before the raft snapshot is written down? Patching over a lost rename after the raft snapshot is saved could be masking atomicity problems with the snapshot path.

@fanminshi
Copy link
Member Author

@heyitsanthony

Saving snapshot from the leader and renaming it is not an atomic procedure in the original code.

  1. Follower saves snapshot from leader with format xxx.snap.db at https://github.com/coreos/etcd/blob/master/rafthttp/http.go#L205

  2. Follower then renames xxx.snap.db to db at here.
    https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L812

The node from #7628 has proceeded to step 1 but crashed before step 2 happened.

@heyitsanthony
Copy link
Contributor

@fanminshi yes, that's what the code does. I'm asking if the snapshot saving protocol is correct since raft is saying a snapshot was applied but on reboot it turns out it is not since the rename didn't take.

@fanminshi
Copy link
Member Author

I look at the code again. It appears to me that the saving protocol is not correct.

saving .wal and .snap from incoming snapshot happens in a separate go routine than renaming xxx.snap.db to db; Hence, there is not way to guarantee db, .wal, and .snap are in synced.

// save snapshot to .wal and .snap
if !raft.IsEmptySnap(rd.Snapshot) {
	// gofail: var raftBeforeSaveSnap struct{}
	if err := r.storage.SaveSnap(rd.Snapshot); err != nil {
		plog.Fatalf("raft save snapshot error: %v", err)
	}
	// gofail: var raftAfterSaveSnap struct{}
	r.raftStorage.ApplySnapshot(rd.Snapshot)
	plog.Infof("raft applied incoming snapshot at index %d", rd.Snapshot.Metadata.Index)
	// gofail: var raftAfterApplySnap struct{}
}

https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L227-L231

Saving protocol must guarantees .wal, .snap, db are saved in one atomic operation, or else the situations, (db is new and .wal and .snap is old) and (db is old and .wal and .snap is new as shown in #7628), can happen.

I am not sure if that guarantee can be achieved.
one solution is to minimize the spanning window of saving .wal, .snap, db so the chance of diverge is small.
another solution is optimistically recover on restart.

@heyitsanthony
Copy link
Contributor

Saving protocol must guarantees .wal, .snap, db are saved in one atomic operation

It cannot and that is not necessary. All that's needed is an ordering so that etcd won't try to load the snapshot before it's been entirely applied to disk.

one solution is to minimize the spanning window of saving .wal, .snap, db so the chance of diverge is small.

This will lead to corruption/data loss.

another solution is optimistically recover on restart.

This will lead to corruption/data loss.

@fanminshi
Copy link
Member Author

@heyitsanthony the ordering I am thinking about is to save/renaming db before .wal and .snap is saved.

case: follower didn't not save db and crashed.
Ok: restart follower will use on old db and old .wal and leader will send snapshot again.

case: follower saved new db and crashed before saving .wal and .snap.
db is new but .wal and .snap is old. I am not sure how follower should handle this case. From my testing, I saw follower panic in raft layer if the follower is brand new.(I need to investigate this one more)

case: follower save both db and .wal and .snap:
Ok: follower should restart without any problem.

@heyitsanthony
Copy link
Contributor

@fanminshi whatever write-out policy guaranteeing the reboot the invariant snapshot.Metadata.Index <= db.ConsistentIndex() (checked here: https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L457) when loading from the starting db file should be OK.

@codecov-io
Copy link

codecov-io commented May 4, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@a53a9e1). Click here to learn what that means.
The diff coverage is 60.86%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #7876   +/-   ##
=========================================
  Coverage          ?   75.62%           
=========================================
  Files             ?      332           
  Lines             ?    26316           
  Branches          ?        0           
=========================================
  Hits              ?    19901           
  Misses            ?     4979           
  Partials          ?     1436
Impacted Files Coverage Δ
etcdserver/util.go 78.46% <53.33%> (ø)
etcdserver/server.go 80.43% <66.66%> (ø)
snap/db.go 66.66% <80%> (ø)
etcdserver/raft.go 87.87% <80%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a53a9e1...dfdaf08. Read the comment docs.

@fanminshi fanminshi changed the title etcdserver: renames xxx.snap.db to db in NewServer() etcdserver: renaming db happens before snapshot persists to wal and snap files May 4, 2017
@fanminshi
Copy link
Member Author

All fixed. PTAL

@@ -223,6 +226,8 @@ func (r *raftNode) start(rh *raftReadyHandler) {
// gofail: var raftAfterSave struct{}

if !raft.IsEmptySnap(rd.Snapshot) {
// waits etcd server to finish renaming snap db to db.
<-replaceDBDone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this reuse raftDone instead of needing a separate channel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that the purpose of raftDone is to notified etcdserver that raft has persisted log.

The purpose of replaceDBDone is to notify raft that etcdserver has rename the snapshot db to db.

Both chans serve different purposes. I don't see a good way to control above logic with one chan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The applier will always post to raftDone if there's a snapshot and the raft loop will always wait on raftDone if there's a snapshot. What's to control? I want to avoid the extra channel allocation on this path if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, it seems to me the raftDone should act as execution flow chan between etcd server and raft since both need to send msg to it. I am not sure if the name raftDone make sense anymore.

@heyitsanthony
Copy link
Contributor

test case?

@fanminshi
Copy link
Member Author

@heyitsanthony I manually injection of panic before renaming db https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L812 to simulate #7628.

is there a simple way to programmatically simulate

  1. leader transfer snap to a follower
  2. follower panic before renaming
  3. follower restart with a new binary that doesn't have the panic.

@heyitsanthony
Copy link
Contributor

@fanminshi it might be easier to use the Recorder mocks to test the order of operations is obeyed. Something like TestSnapshot in server_test.go.

@fanminshi fanminshi force-pushed the fix_7628 branch 2 times, most recently from 7590bf3 to 1e58b07 Compare May 5, 2017 21:05
@fanminshi
Copy link
Member Author

Add a test. PTAL

}
}

// func TestSnapshotOrdering2(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete


testdir, err := ioutil.TempDir(os.TempDir(), "testsnapdir")
if err != nil {
t.Fatalf("Couldn't open tempdir (%v)", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/C/c; error messages are lower case

}
defer os.RemoveAll(testdir)
if err := os.MkdirAll(testdir+"/member/snap", 0755); err != nil {
t.Fatalf("Couldn't make snap dir (%v)", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/C/c

seenDBFilePath = true
case "SaveSnap":
if !seenDBFilePath {
t.Fatalf("expect DBFilePath calls before SaveSnap, but it is other way around")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better error message please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SaveSnap() calls before DBFilePath(), but it shouldn't." ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveSnap called before DBFilePath

n.readyc <- ready
var seenDBFilePath bool
timer := time.After(5 * time.Second)
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why loop? is the sequence nondeterministic?

Copy link
Member Author

@fanminshi fanminshi May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save() https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L217 can be called before or after DBFilePath().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sequence can be [Save(), DBFilePath(), SaveSnap()] or [ DBFilePath(), Save(), SaveSnap()]. all I care is that DBFilePath() happens before SaveSnap()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I got that from the first comment

s.applyV2 = &applierV2store{store: s.store, cluster: s.cluster}

be, tmpPath := backend.NewDefaultTmpBackend()
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer os.RemoveAll(tmpPath)

@@ -83,7 +83,7 @@ type RaftTimer interface {
type apply struct {
entries []raftpb.Entry
snapshot raftpb.Snapshot
raftDone <-chan struct{} // rx {} after raft has persisted messages
notifyc chan struct{} // notifyc acts a bridge between etcdserver and raftNode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// notifyc synchronizes etcd server applies with the raft node
notifyc chan struct{}

@@ -223,6 +223,8 @@ func (r *raftNode) start(rh *raftReadyHandler) {
// gofail: var raftAfterSave struct{}

if !raft.IsEmptySnap(rd.Snapshot) {
// waits etcd server to finish renaming snap db to db.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// wait for snapshot db to be renamed to in-use db

@@ -812,6 +812,8 @@ func (s *EtcdServer) applySnapshot(ep *etcdProgress, apply *apply) {
if err := os.Rename(snapfn, fn); err != nil {
plog.Panicf("rename snapshot file error: %v", err)
}
// notifies raftNode that db has been replaced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// raft can now claim the snapshot has been applied

@@ -951,6 +951,149 @@ func TestSnapshot(t *testing.T) {
<-ch
}

// TestSnapshotOrdering ensures that when applying snapshot etcdserver renames snap db to db before raft persists snapshot to wal and snap files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TestSnapshotOrdering ensures the snapshot db is applied before acknowledging the snapshot in raft

@fanminshi
Copy link
Member Author

All fixed. PTAL

}

rs := raft.NewMemoryStorage()
p := mockstorage.NewStorageRecorderStream(testdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p := mockstorage.NewStorageRecorder(testdir)

otherwise hangs on failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess not, NewStorageRecorder doesn't work with the fix...

@fanminshi
Copy link
Member Author

merge when green.

@heyitsanthony
Copy link
Contributor

@fanminshi is this ready to merge?

@xiang90
Copy link
Contributor

xiang90 commented May 8, 2017

@heyitsanthony @fanminshi

I have a concern on this approach. If we blindly pick up the db file with the current snapshot recovery ordering (db -> snap), we might end up with starting the server with an old snap file and a new db.

Then we have a storage state that is ahead of the raft index.

We probably need to either rollback db file or move forward with the unfinished snapshot process during a restart when etcd server failed in the process of snapshot recovery.

@fanminshi
Copy link
Member Author

I think that storage state that's ahead of raft index doesn't make sense; we expect db state <= raft index . I am working on a new fix to address @xiang90 concerns.

The order right now is, 1. save db snapshot xxx.snap.db, 2. rename xxx.snap.db to db, 3. persists snapshot to snap and wal.

If etcd server fails before 3, then db is ahead of raft.

Suppose the new ordering is 1. save db snapshot xxx.snap.db, 2. persists snapshot to snap and wal, 3. rename xxx.snap.db to db.

  • failure at before and after 1 triggers leader to resent snapshot.

  • failure after 2 and before 3 results snap and wal being newer than db. Since step 1 must happen before step 2, all we need to do is to rename xxx.snap.db to db and load that on server restart.

  • failure after 3 is fine since etcd server will load the most recent db and snap on restart.

@fanminshi
Copy link
Member Author

fix db state ahead of raft issue. PTAL

snap/db.go Outdated
return GetDBFilePathByID(s.dir, id)
}

func GetDBFilePathByID(dbPath string, id uint64) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid public functions when not used outside of package

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbFilePathFromID

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needs to export this function so that etcdserver can use it to find xxx.snap.db.

@fanminshi fanminshi changed the title etcdserver: renaming db happens before snapshot persists to wal and snap files etcdserver: renaming db happens after snapshot persists to wal and snap files May 8, 2017
@@ -95,3 +96,20 @@ func (nc *notifier) notify(err error) {
nc.err = err
close(nc.c)
}

func loadBackend(bepath string, quotaBackendBytes int64) (be backend.Backend) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openBackend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change.

@@ -385,6 +372,25 @@ func NewServer(cfg *ServerConfig) (srv *EtcdServer, err error) {
plog.Panicf("recovered store from snapshot error: %v", err)
}
plog.Infof("recovered store from snapshot at index %d", snapshot.Metadata.Index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably move this part to a new func.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

kv := mvcc.New(be, &lease.FakeLessor{}, &cIndex)
kvindex := kv.ConsistentIndex()
if snapshot.Metadata.Index > kvindex {
snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyitsanthony usingGetDBFilePathByID() at here.

if snapshot.Metadata.Index > kvindex {
snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index)
if err != nil {
plog.Panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a better panic message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"finding xxx.snap.db error: w/e"?

plog.Panic(err)
}
if err := os.Rename(snapfn, bepath); err != nil {
plog.Panicf("rename snapshot file error: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a better panic message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need more context on this error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"renaming xxx.snap.db to db error: w/e"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error recovering etcd from snapshot: failed to...

@fanminshi fanminshi force-pushed the fix_7628 branch 2 times, most recently from 813a678 to cf77919 Compare May 9, 2017 17:56
@fanminshi
Copy link
Member Author

all fixed. PTAL

@@ -15,11 +15,19 @@
package etcdserver

import (
"os"
"time"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt this blank line?

@xiang90
Copy link
Contributor

xiang90 commented May 9, 2017

@fanminshi LGTM in general. Defer to @gyuho and @heyitsanthony

// gofail: var raftAfterSaveSnap struct{}
r.raftStorage.ApplySnapshot(rd.Snapshot)
plog.Infof("raft applied incoming snapshot at index %d", rd.Snapshot.Metadata.Index)
// gofail: var raftAfterApplySnap struct{}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line?

@heyitsanthony
Copy link
Contributor

lgtm after fixing line formatting suggestions

fanminshi added 2 commits May 9, 2017 14:00
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
@fanminshi
Copy link
Member Author

fixed formatting issue. merging when green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants