Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(raft/log): truncate file and reset offset correctly #830

Merged
merged 1 commit into from
Jun 5, 2014

Conversation

yichengq
Copy link
Contributor

@yichengq yichengq commented Jun 5, 2014

@xiangli-cmu

@xiang90
Copy link
Contributor

xiang90 commented Jun 5, 2014

lgtm

@philips This fixes the problem that probably causes #829.

yichengq added a commit that referenced this pull request Jun 5, 2014
fix(raft/log): truncate file and reset offset correctly
@yichengq yichengq merged commit 757bb3a into etcd-io:master Jun 5, 2014
@philips
Copy link
Contributor

philips commented Jun 6, 2014

oh, dang. good fix.

@ongardie
Copy link

Sorry I'm a bit late on this one and completely off-topic, but I noticed the notification for this PR and got worried. What I'm worried about is that it'd be dangerous for a server with a corrupt disk to truncate some entries off the end of its log and then continue participating in the cluster.

What's this code trying to do? What are the errors in decoding that result in the file being truncated? A few more comments could help.

@yichengq
Copy link
Contributor Author

@ongardie
This code mainly focuses on a bug on implementation. Even if a file is truncated, its I/O offset doesn't change and new log entries will be appended to the old point. This could be a big problem when it loads the log next time because the data may confuse the protobuf parsing.

The error comes from failing to parse in protobuf. t is highly possible that the machine was rebooted when sync the latest entries into the disk and only part of it is recorded.

In most cases this should not be a problem for cluster because the others could recover the log.

But it would be terrible if some entries are really lost in the whole cluster. We could print out some warning for the truncation, but I don't know whether there is better solution for that. I could think about some way is to add a flag that indicates log miss when rejoining the cluster to ensure the safety.

@ongardie
Copy link

Yeah, it's really hard to tell whether the disk was corrupted before or after acking the write. I'd definitely add a warning here as a minimum.

@yichengq yichengq deleted the 98 branch December 7, 2014 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants