Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barrier write timeout can cause permanently degraded leader #3545

Closed
slackpad opened this issue Oct 5, 2017 · 2 comments · Fixed by #3546
Closed

Barrier write timeout can cause permanently degraded leader #3545

slackpad opened this issue Oct 5, 2017 · 2 comments · Fixed by #3546
Assignees
Milestone

Comments

@slackpad
Copy link
Contributor

slackpad commented Oct 5, 2017

We encountered a user where the Consul leader ran into a network issue that lasted several minutes. Over the span of about 20 minutes there were errors queueing writes, with messages like consul.coordinate: Batch update failed: timed out enqueuing operation. Finally, one of these happened during a leader reconcile loop and we saw consul: failed to wait for barrier: timed out enqueuing operation.

Since the leader had been established for several days, this hit the return statement in the error handling logic here:

https://github.com/hashicorp/consul/blob/v1.0.0-beta2/agent/consul/leader.go#L97-L101

Which hit the defer-ed revokeLeadership() call here:

https://github.com/hashicorp/consul/blob/v1.0.0-beta2/agent/consul/leader.go#L111-L116

The result was that the leader loop exited and that the consistent read flag got reset, resulting in the application getting Not ready to serve consistent reads. Since the server never actually lost leadership it was stuck in this state. Restarting the leader fixed it. This was found in 0.9.2, but the code is the same in master today.

The change that added the return was #3230, which was working under the assumption that exiting the leader loop would cause the server to step down. In this unfortunate set of circumstances, the network errors were such that we hit the barrier write error, but we were able to service the Raft requirements to stay a leader. This is a bit of a manifestation of the separate path for pipelined Raft replication + a separate path for AE messages for maintaining leadership (I think), so that stems from our implementation.

In any case, the right thing to do here is to keep the timeout part of the deadlock fix from #3230, but do the old goto WAIT instead of the return. This would have let the leader make another attempt at the barrier later, and would have kept the leader tracking goroutine.

There's no good API now to ask Raft to step down, so later we might want to consider adding that, and if the leader tracking goroutine hasn't made progress then we could voluntarily step down, but the simple fix outlined above should be sufficient to defer to Raft's health checking. There are no other return statements of this type in leaderLoop().

@slackpad slackpad added this to the 1.0 milestone Oct 5, 2017
slackpad added a commit that referenced this issue Oct 6, 2017
There were two issues here. First, we needed to not exit when there
was a timeout trying to write the barrier, because Raft might not
step down, so we'd be left as the leader but having run all the step
down actions.

Second, we didn't close over the stopCh correctly, so it was possible
to nil that out and have the leaderLoop never exit. We close over it
properly AND sequence the nil-ing of it AFTER the leaderLoop exits for
good measure, so the code is more robust.

Fixes #3545
slackpad added a commit that referenced this issue Oct 6, 2017
* Fixes handling of stop channel and failed barrier attempts.

There were two issues here. First, we needed to not exit when there
was a timeout trying to write the barrier, because Raft might not
step down, so we'd be left as the leader but having run all the step
down actions.

Second, we didn't close over the stopCh correctly, so it was possible
to nil that out and have the leaderLoop never exit. We close over it
properly AND sequence the nil-ing of it AFTER the leaderLoop exits for
good measure, so the code is more robust.

Fixes #3545

* Cleans up based on code review feedback.

* Tweaks comments.

* Renames variables and removes comments.
slackpad added a commit to hashicorp/nomad that referenced this issue Oct 17, 2017
There was a deadlock issue we fixed under hashicorp/consul#3230,
and then discovered an issue with under hashicorp/consul#3545. This
PR ports over those fixes, as well as makes the revoke actions only happen if leadership was
established. This brings the Nomad leader loop inline with Consul's.
@cshabi
Copy link

cshabi commented Oct 31, 2017

I would like to upgrade my 0.7.5 cluster to version 0.9.3.
Is it possible to provide a separate package for that version with a fix for this bug?

@slackpad
Copy link
Contributor Author

Hi @cshabi we don't have plans to backport this fix, so you'd need to cherry pick it into a local branch and build Consul.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants