Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in consul server between leadership reconcile channel and barrier write #3230

Closed
preetapan opened this issue Jul 5, 2017 · 1 comment
Assignees
Labels
type/bug Feature does not function as expected

Comments

@preetapan
Copy link
Contributor

preetapan commented Jul 5, 2017

consul version for both Client and Server

Client: 0.8.5
Server: 0.8.5

Operating system and Environment details

Tested in Linux, but should be reproducible in other environments

Description of the Issue (and unexpected/desired result)

Reproduction steps

I found this when working on #1744.

Run consul server with bootstrap-expect=1, and make it lose leadership by filling up disk (although this can be any resource, not just disk). After a few iterations, leader election messages stop being logged, and going to debug/pprof/goroutine?debug=1 should show the deadlocked go-routines. When the node gives up leadership because of a disk write error, Raft's runLeader is waiting for the reconcile channel to be free to write to it, but consul's monitorLeadership method is unable to read from the reconcile channel to drain it because its waiting for the barrier write to finish. The barrier write blocks because even though the node's state is set to be a follower, it waits till the runLeader loop finishes so that the runFollower goroutine can process the apply channel containing the barrier write. Thus the deadlock!

Discussed solution ideas with @slackpad, and simplest thing to do is to not make the barrier write block forever. It should timeout after a conservative time period, and return rather than waiting if there is a barrier error.

Log Fragments or Link to gist

(see goroutine dump attachment)
Include appropriate Client or Server log fragments. If the log is longer
than a few dozen lines, please include the URL to the
gist.

TIP: Use -log-level=TRACE on the client and server to capture the maximum log detail.

@preetapan preetapan added the type/bug Feature does not function as expected label Jul 5, 2017
@preetapan preetapan self-assigned this Jul 5, 2017
@preetapan
Copy link
Contributor Author

goroutine.txt

@preetapan preetapan added this to Internal Cleanup in Consul 0.9.0 Jul 6, 2017
@preetapan preetapan added this to Internal Cleanup in Consul 0.9.0 Jul 6, 2017
@preetapan preetapan added this to Internal Cleanup in Consul 0.9.0 Jul 6, 2017
@slackpad slackpad added this to Internal Cleanup in Consul 0.9.0 Jul 7, 2017
@slackpad slackpad moved this from Internal Cleanup to Done in Consul 0.9.0 Jul 7, 2017
slackpad added a commit to hashicorp/nomad that referenced this issue Oct 17, 2017
There was a deadlock issue we fixed under hashicorp/consul#3230,
and then discovered an issue with under hashicorp/consul#3545. This
PR ports over those fixes, as well as makes the revoke actions only happen if leadership was
established. This brings the Nomad leader loop inline with Consul's.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
No open projects
Development

No branches or pull requests

1 participant