-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: renaming db happens after snapshot persists to wal and snap files #7876
Conversation
etcdserver/util.go
Outdated
@@ -95,3 +96,19 @@ func (nc *notifier) notify(err error) { | |||
nc.err = err | |||
close(nc.c) | |||
} | |||
|
|||
func loadBackend(bepath string, be *backend.Backend, QuotaBackendBytes int64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just return backend.Backend
?
And s/QuotaBackendBytes/quotaBackendBytes/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that.
snap/db.go
Outdated
@@ -52,23 +55,26 @@ func (s *Snapshotter) SaveDBFrom(r io.Reader, id uint64) (int64, error) { | |||
return n, err | |||
} | |||
|
|||
plog.Infof("saved database snapshot to disk [total bytes: %d]", n) | |||
|
|||
plog.Infof("saved database snapshot %v to disk [total bytes: %d]", fn, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%s to disk
?
etcdserver/server.go
Outdated
// 1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists. | ||
// 2. rename xxx.snap.db to db if exists. | ||
// 3. load backend again with the new db file. | ||
snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use DBFilePath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to use snap.DBFilePath() function? func (s *Snapshotter) DBFilePath(id uint64) (string, error)
is tied to an object. that's why I created another GetDBFilePathByID() that's a class function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, we can't do s.r.storage.DBFilePath
because we haven't created EtcdServer
yet.
all fixed @gyuho |
etcdserver/server.go
Outdated
if snapfn != "" { | ||
if err := os.Rename(snapfn, bepath); err != nil { | ||
plog.Panicf("rename snapshot file error: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we close existing be
before we rename and re-open?
Defer to @xiang90 and @heyitsanthony
Thanks!
Why doesn't etcd rename the db snapshot before the raft snapshot is written down? Patching over a lost rename after the raft snapshot is saved could be masking atomicity problems with the snapshot path. |
Saving snapshot from the leader and renaming it is not an atomic procedure in the original code.
The node from #7628 has proceeded to step 1 but crashed before step 2 happened. |
@fanminshi yes, that's what the code does. I'm asking if the snapshot saving protocol is correct since raft is saying a snapshot was applied but on reboot it turns out it is not since the rename didn't take. |
I look at the code again. It appears to me that the saving protocol is not correct. saving .wal and .snap from incoming snapshot happens in a separate go routine than renaming xxx.snap.db to db; Hence, there is not way to guarantee db, .wal, and .snap are in synced. // save snapshot to .wal and .snap
if !raft.IsEmptySnap(rd.Snapshot) {
// gofail: var raftBeforeSaveSnap struct{}
if err := r.storage.SaveSnap(rd.Snapshot); err != nil {
plog.Fatalf("raft save snapshot error: %v", err)
}
// gofail: var raftAfterSaveSnap struct{}
r.raftStorage.ApplySnapshot(rd.Snapshot)
plog.Infof("raft applied incoming snapshot at index %d", rd.Snapshot.Metadata.Index)
// gofail: var raftAfterApplySnap struct{}
} https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L227-L231 Saving protocol must guarantees .wal, .snap, db are saved in one atomic operation, or else the situations, (db is new and .wal and .snap is old) and (db is old and .wal and .snap is new as shown in #7628), can happen. I am not sure if that guarantee can be achieved. |
It cannot and that is not necessary. All that's needed is an ordering so that etcd won't try to load the snapshot before it's been entirely applied to disk.
This will lead to corruption/data loss.
This will lead to corruption/data loss. |
@heyitsanthony the ordering I am thinking about is to save/renaming db before .wal and .snap is saved. case: follower didn't not save db and crashed. case: follower saved new db and crashed before saving .wal and .snap. case: follower save both db and .wal and .snap: |
@fanminshi whatever write-out policy guaranteeing the reboot the invariant |
Codecov Report
@@ Coverage Diff @@
## master #7876 +/- ##
=========================================
Coverage ? 75.62%
=========================================
Files ? 332
Lines ? 26316
Branches ? 0
=========================================
Hits ? 19901
Misses ? 4979
Partials ? 1436
Continue to review full report at Codecov.
|
All fixed. PTAL |
etcdserver/raft.go
Outdated
@@ -223,6 +226,8 @@ func (r *raftNode) start(rh *raftReadyHandler) { | |||
// gofail: var raftAfterSave struct{} | |||
|
|||
if !raft.IsEmptySnap(rd.Snapshot) { | |||
// waits etcd server to finish renaming snap db to db. | |||
<-replaceDBDone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this reuse raftDone
instead of needing a separate channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me that the purpose of raftDone
is to notified etcdserver that raft has persisted log.
The purpose of replaceDBDone
is to notify raft that etcdserver has rename the snapshot db to db.
Both chans serve different purposes. I don't see a good way to control above logic with one chan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The applier will always post to raftDone if there's a snapshot and the raft loop will always wait on raftDone if there's a snapshot. What's to control? I want to avoid the extra channel allocation on this path if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, it seems to me the raftDone should act as execution flow chan between etcd server and raft since both need to send msg to it. I am not sure if the name raftDone
make sense anymore.
test case? |
@heyitsanthony I manually injection of panic before renaming db https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L812 to simulate #7628. is there a simple way to programmatically simulate
|
@fanminshi it might be easier to use the Recorder mocks to test the order of operations is obeyed. Something like |
7590bf3
to
1e58b07
Compare
Add a test. PTAL |
etcdserver/server_test.go
Outdated
} | ||
} | ||
|
||
// func TestSnapshotOrdering2(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
etcdserver/server_test.go
Outdated
|
||
testdir, err := ioutil.TempDir(os.TempDir(), "testsnapdir") | ||
if err != nil { | ||
t.Fatalf("Couldn't open tempdir (%v)", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/C/c; error messages are lower case
etcdserver/server_test.go
Outdated
} | ||
defer os.RemoveAll(testdir) | ||
if err := os.MkdirAll(testdir+"/member/snap", 0755); err != nil { | ||
t.Fatalf("Couldn't make snap dir (%v)", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/C/c
etcdserver/server_test.go
Outdated
seenDBFilePath = true | ||
case "SaveSnap": | ||
if !seenDBFilePath { | ||
t.Fatalf("expect DBFilePath calls before SaveSnap, but it is other way around") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better error message please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SaveSnap() calls before DBFilePath(), but it shouldn't."
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SaveSnap called before DBFilePath
etcdserver/server_test.go
Outdated
n.readyc <- ready | ||
var seenDBFilePath bool | ||
timer := time.After(5 * time.Second) | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why loop? is the sequence nondeterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save() https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L217 can be called before or after DBFilePath().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sequence can be [Save(), DBFilePath(), SaveSnap()]
or [ DBFilePath(), Save(), SaveSnap()]
. all I care is that DBFilePath()
happens before SaveSnap()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I got that from the first comment
etcdserver/server_test.go
Outdated
s.applyV2 = &applierV2store{store: s.store, cluster: s.cluster} | ||
|
||
be, tmpPath := backend.NewDefaultTmpBackend() | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer os.RemoveAll(tmpPath)
etcdserver/raft.go
Outdated
@@ -83,7 +83,7 @@ type RaftTimer interface { | |||
type apply struct { | |||
entries []raftpb.Entry | |||
snapshot raftpb.Snapshot | |||
raftDone <-chan struct{} // rx {} after raft has persisted messages | |||
notifyc chan struct{} // notifyc acts a bridge between etcdserver and raftNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// notifyc synchronizes etcd server applies with the raft node
notifyc chan struct{}
etcdserver/raft.go
Outdated
@@ -223,6 +223,8 @@ func (r *raftNode) start(rh *raftReadyHandler) { | |||
// gofail: var raftAfterSave struct{} | |||
|
|||
if !raft.IsEmptySnap(rd.Snapshot) { | |||
// waits etcd server to finish renaming snap db to db. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// wait for snapshot db to be renamed to in-use db
etcdserver/server.go
Outdated
@@ -812,6 +812,8 @@ func (s *EtcdServer) applySnapshot(ep *etcdProgress, apply *apply) { | |||
if err := os.Rename(snapfn, fn); err != nil { | |||
plog.Panicf("rename snapshot file error: %v", err) | |||
} | |||
// notifies raftNode that db has been replaced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// raft can now claim the snapshot has been applied
etcdserver/server_test.go
Outdated
@@ -951,6 +951,149 @@ func TestSnapshot(t *testing.T) { | |||
<-ch | |||
} | |||
|
|||
// TestSnapshotOrdering ensures that when applying snapshot etcdserver renames snap db to db before raft persists snapshot to wal and snap files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TestSnapshotOrdering ensures the snapshot db is applied before acknowledging the snapshot in raft
All fixed. PTAL |
} | ||
|
||
rs := raft.NewMemoryStorage() | ||
p := mockstorage.NewStorageRecorderStream(testdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p := mockstorage.NewStorageRecorder(testdir)
otherwise hangs on failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess not, NewStorageRecorder
doesn't work with the fix...
merge when green. |
@fanminshi is this ready to merge? |
I have a concern on this approach. If we blindly pick up the db file with the current snapshot recovery ordering (db -> snap), we might end up with starting the server with an old snap file and a new db. Then we have a storage state that is ahead of the raft index. We probably need to either rollback db file or move forward with the unfinished snapshot process during a restart when etcd server failed in the process of snapshot recovery. |
I think that storage state that's ahead of raft index doesn't make sense; we expect db state <= raft index . I am working on a new fix to address @xiang90 concerns. The order right now is, 1. save db snapshot xxx.snap.db, 2. rename xxx.snap.db to db, 3. persists snapshot to snap and wal. If etcd server fails before 3, then db is ahead of raft. Suppose the new ordering is 1. save db snapshot xxx.snap.db, 2. persists snapshot to snap and wal, 3. rename xxx.snap.db to db.
|
fix db state ahead of raft issue. PTAL |
snap/db.go
Outdated
return GetDBFilePathByID(s.dir, id) | ||
} | ||
|
||
func GetDBFilePathByID(dbPath string, id uint64) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid public functions when not used outside of package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbFilePathFromID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needs to export this function so that etcdserver can use it to find xxx.snap.db.
etcdserver/util.go
Outdated
@@ -95,3 +96,20 @@ func (nc *notifier) notify(err error) { | |||
nc.err = err | |||
close(nc.c) | |||
} | |||
|
|||
func loadBackend(bepath string, quotaBackendBytes int64) (be backend.Backend) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openBackend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change.
@@ -385,6 +372,25 @@ func NewServer(cfg *ServerConfig) (srv *EtcdServer, err error) { | |||
plog.Panicf("recovered store from snapshot error: %v", err) | |||
} | |||
plog.Infof("recovered store from snapshot at index %d", snapshot.Metadata.Index) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably move this part to a new func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
etcdserver/server.go
Outdated
kv := mvcc.New(be, &lease.FakeLessor{}, &cIndex) | ||
kvindex := kv.ConsistentIndex() | ||
if snapshot.Metadata.Index > kvindex { | ||
snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heyitsanthony usingGetDBFilePathByID()
at here.
etcdserver/server.go
Outdated
if snapshot.Metadata.Index > kvindex { | ||
snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index) | ||
if err != nil { | ||
plog.Panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a better panic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"finding xxx.snap.db error: w/e"?
etcdserver/server.go
Outdated
plog.Panic(err) | ||
} | ||
if err := os.Rename(snapfn, bepath); err != nil { | ||
plog.Panicf("rename snapshot file error: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a better panic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need more context on this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"renaming xxx.snap.db to db error: w/e"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error recovering etcd from snapshot: failed to...
813a678
to
cf77919
Compare
all fixed. PTAL |
etcdserver/util.go
Outdated
@@ -15,11 +15,19 @@ | |||
package etcdserver | |||
|
|||
import ( | |||
"os" | |||
"time" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gofmt this blank line?
@fanminshi LGTM in general. Defer to @gyuho and @heyitsanthony |
etcdserver/raft.go
Outdated
// gofail: var raftAfterSaveSnap struct{} | ||
r.raftStorage.ApplySnapshot(rd.Snapshot) | ||
plog.Infof("raft applied incoming snapshot at index %d", rd.Snapshot.Metadata.Index) | ||
// gofail: var raftAfterApplySnap struct{} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
lgtm after fixing line formatting suggestions |
…ap files In the case that follower recieves a snapshot from leader and crashes before renaming xxx.snap.db to db but after snapshot has persisted to .wal and .snap, restarting follower results loading old db, new .wal, and new .snap. This will causes a index mismatch between snap metadata index and consistent index from db. This pr forces an ordering where saving/renaming db must happen after snapshot is persisted to wal and snap file. this guarantees wal and snap files are newer than db. on server restart, etcd server checks if snap index > db consistent index. if yes, etcd server attempts to load xxx.snap.db where xxx=snap index if there is any and panic other wise. FIXES etcd-io#7628
…g wal and snap files
fixed formatting issue. merging when green. |
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will cause a index mismatch between snap metadata index
and consistent index from db.
This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.
FIXES #7628