Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too long Raft.Stats call because of not stable in time AppendEntries.StoreLogs routine #302

Closed
maksm90 opened this issue Dec 26, 2018 · 1 comment · Fixed by #379
Closed

Comments

@maksm90
Copy link

maksm90 commented Dec 26, 2018

I have encountered a problem on cloud virtualized storage that the rpc Raft.Stats called from consul may periodically stall. As a consequence consul leader emits to log messages about not healthy followers at the same time leaving the cluster is safe.

The primary investigation has revealed that Raft.Stats stalls on getting configuration of raft node (ConfigurationFuture wrapper) that deals with follower loop (through configurationsCh channel inside runFollower routine). AFAIC each request to Follower including heartbeats and raft RPCs are handled sequentially. The output of RPC time metrics have shown that the appendEntries rpc or, more precisely storeLogs stage has wide spread of latency. It happens because of not stable synchronization of logs latency to persistent storage (fdatasync syscall in BoltDB storage backend). And separate measurement of fdatasync latency confirmed this hypothesis.

It's a problem of cloud provider. But such issue exposes the looseness of architecture - lightweight read requests to Follower have to wait block ones. What about not blocking reads based on snapshot before commit of logs (such as in MVCC scheme)? Is it possible and could be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants