-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock between Shutdown() and takeSnapshot() #277
Fix deadlock between Shutdown() and takeSnapshot() #277
Conversation
This change makes sure that a takeSnapshot() call which is run about at the same time as Shutdown() does not block indefinitely on a ConfigurationFuture request. The situation can happen if takeSnapshot() manages to push the request to configurationCh (which is buffered) but the follower/candidate/leader loop exists before processing it, in turn making the shutdownFuture.Error() returned by Shutdown() block indefinitely (since it indirectly waits for takeSnapshot to finish). With this change the sample program attached to hashicorp#268 doesn't fail anymore.
@freeekanayaka I think I understand the issue.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code looks to work but why not do something like this:
diff --git a/future.go b/future.go
index fac59a5..448b7ed 100644
--- a/future.go
+++ b/future.go
@@ -81,6 +81,7 @@ type deferError struct {
err error
errCh chan error
responded bool
+ ShutdownCh chan struct{}
}
func (d *deferError) init() {
@@ -97,7 +98,11 @@ func (d *deferError) Error() error {
if d.errCh == nil {
panic("waiting for response on nil channel")
}
- d.err = <-d.errCh
+ select {
+ case d.err = <- d.errCh:
+ case <- d.ShutdownCh:
+ d.err = ErrRaftShutdown
+ }
return d.err
}
diff --git a/snapshot.go b/snapshot.go
index 5287ebc..57c336e 100644
--- a/snapshot.go
+++ b/snapshot.go
@@ -146,6 +146,7 @@ func (r *Raft) takeSnapshot() (string, error) {
// We have to use the future here to safely get this information since
// it is owned by the main thread.
configReq := &configurationsFuture{}
+ configReq.ShutdownCh = r.shutdownCh
configReq.init()
select {
case r.configurationsCh <- configReq:
That diff should allow Raft to cancel arbitrary deferError when shutting down (if the ShutdownCh is set).
@mkeeler your diff sounds fine to me (although if I had to make a nit, it'd be probably better to wait for at least one more use case of something like deferError.ShutdownCh before introducing such an ad-hoc API). Please make sure that the sample program I attached to #277 does not crash anymore with your diff applied. |
Hey there, |
Hey there, |
I don't know if it still applies, it surely did when I pushed the PR. I had even attached a program to reproduce the issue, see #268, which you can use to check it. I don't use hashicorp/raft anymore these days. |
Hey guys! Are there any plans to fix this issue? example code mentioned in #268, still can reproduce the issue from actual master branch, and I found this bug while searching for some clues on why nomad server is freezing sometimes in our system, seems like a go routine responsible for dealing with allocations hangs, and then found raft calls inside. Not sure if this is a reason of our nomad freezes, but at least this bug still can be reproduced. Thanks! |
Will look at this soon. |
This change makes sure that a takeSnapshot() call which is run about
at the same time as Shutdown() does not block indefinitely on a
ConfigurationFuture request.
The situation can happen if takeSnapshot() manages to push the request
to configurationCh (which is buffered) but the
follower/candidate/leader loop exists before processing it, in turn
making the shutdownFuture.Error() returned by Shutdown() block
indefinitely (since it indirectly waits for takeSnapshot to finish).
With this change the sample program attached to #268 doesn't fail
anymore.