[FIXED] Clustering: leadership acquired actions could get stuck #1287
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If a leadership changed occurred while leadership actions were
executed, before the raft.Barrier() call was made, the server
would be stuck in that call. This is because RAFT library
notifies the Streaming server code that a leadership changed
through a go channel that was just of size 1. Since the
streaming server read from the channel and then executes
the leadership acquired code, it could not read from the
notification channel that caused the RAFT library to block
on a go channel send, which then made the Barrier() call
block.
I believe the right approach is to have a bigger notification
go channel instead of making Barrier() time out. If it does
timeout, the server should then transfer leadership, which
I am afraid could cause a cascading effect if all servers
getting elected need longer that the chosen timeout to
apply all the preceding entries to the FSM.
Signed-off-by: Ivan Kozlovic ivan@synadia.com