Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock between Shutdown() and takeSnapshot() #277

Conversation

freeekanayaka
Copy link
Contributor

This change makes sure that a takeSnapshot() call which is run about
at the same time as Shutdown() does not block indefinitely on a
ConfigurationFuture request.

The situation can happen if takeSnapshot() manages to push the request
to configurationCh (which is buffered) but the
follower/candidate/leader loop exists before processing it, in turn
making the shutdownFuture.Error() returned by Shutdown() block
indefinitely (since it indirectly waits for takeSnapshot to finish).

With this change the sample program attached to #268 doesn't fail
anymore.

This change makes sure that a takeSnapshot() call which is run about
at the same time as Shutdown() does not block indefinitely on a
ConfigurationFuture request.

The situation can happen if takeSnapshot() manages to push the request
to configurationCh (which is buffered) but the
follower/candidate/leader loop exists before processing it, in turn
making the shutdownFuture.Error() returned by Shutdown() block
indefinitely (since it indirectly waits for takeSnapshot to finish).

With this change the sample program attached to hashicorp#268 doesn't fail
anymore.
@mkeeler
Copy link
Member

mkeeler commented Aug 15, 2018

@freeekanayaka I think I understand the issue.

takeSnapshots blocks because the configurationsCh is no longer being handled by the main raft routine but shutdown is halted because waitShutdown waits for all the go routines run with the state's goFunc function to finish.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code looks to work but why not do something like this:

diff --git a/future.go b/future.go
index fac59a5..448b7ed 100644
--- a/future.go
+++ b/future.go
@@ -81,6 +81,7 @@ type deferError struct {
 	err       error
 	errCh     chan error
 	responded bool
+	ShutdownCh chan struct{} 
 }
 
 func (d *deferError) init() {
@@ -97,7 +98,11 @@ func (d *deferError) Error() error {
 	if d.errCh == nil {
 		panic("waiting for response on nil channel")
 	}
-	d.err = <-d.errCh
+	select {
+	case d.err = <- d.errCh:
+	case <- d.ShutdownCh:
+		d.err = ErrRaftShutdown
+	}
 	return d.err
 }
 
diff --git a/snapshot.go b/snapshot.go
index 5287ebc..57c336e 100644
--- a/snapshot.go
+++ b/snapshot.go
@@ -146,6 +146,7 @@ func (r *Raft) takeSnapshot() (string, error) {
 	// We have to use the future here to safely get this information since
 	// it is owned by the main thread.
 	configReq := &configurationsFuture{}
+	configReq.ShutdownCh = r.shutdownCh
 	configReq.init()
 	select {
 	case r.configurationsCh <- configReq:

That diff should allow Raft to cancel arbitrary deferError when shutting down (if the ShutdownCh is set).

@freeekanayaka
Copy link
Contributor Author

@mkeeler your diff sounds fine to me (although if I had to make a nit, it'd be probably better to wait for at least one more use case of something like deferError.ShutdownCh before introducing such an ad-hoc API).

Please make sure that the sample program I attached to #277 does not crash anymore with your diff applied.

@stale
Copy link

stale bot commented Oct 18, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale
Copy link

stale bot commented Dec 17, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale stale bot added the waiting-reply label Dec 17, 2019
@freeekanayaka
Copy link
Contributor Author

freeekanayaka commented Dec 17, 2019

I don't know if it still applies, it surely did when I pushed the PR. I had even attached a program to reproduce the issue, see #268, which you can use to check it. I don't use hashicorp/raft anymore these days.

@stale stale bot removed the waiting-reply label Dec 17, 2019
@Poluect
Copy link

Poluect commented Jan 20, 2020

Hey guys! Are there any plans to fix this issue? example code mentioned in #268, still can reproduce the issue from actual master branch, and I found this bug while searching for some clues on why nomad server is freezing sometimes in our system, seems like a go routine responsible for dealing with allocations hangs, and then found raft calls inside. Not sure if this is a reason of our nomad freezes, but at least this bug still can be reproduced. Thanks!

@hanshasselberg
Copy link
Member

Will look at this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants