-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document some Apply error cases better #561
Conversation
@otoolep does this address your question on hashicorp/consul#2893? Open to suggestions about improvements to the docs or code. I think in general though there is no way to make a stronger guarantee in all cases than this. |
Before I comment on this, I want to be as clear as I can on what is going on. I re-read the Raft paper, to refresh my memory on all this.
This implies that the leader that is about to "lose leadership" received a "true" response from at least one follower for the given log command. As a result 2 nodes of the 3 node cluster (the leader and the follower) consider the log committed. What would normally happen next is that each node would then apply the committed log to the state machine. So it sounds like what you're saying is in the narrow window of time between "commit index" being updated and "applied index" being updated, the leader loses leadership, and falls back to being follower. Is that what you're saying? If that's the case, the new leader comes up and starts interrogating nodes to determine which logs are in a majority of cluster nodes -- and finds the command we just spoke about, and then each node applies it to its own FSM. This seems reasonable. I do want to confirm that the Hashicorp implementation will only apply the command once however, to each FSM under each node, even in this scenario. Is that the case here? |
I know what I'm suspicious about. It's this statement: // If the node discovers it is no longer the leader while applying the command,
// it will return ErrLeadershipLost. "Applying the command" implies that the leader already considers the log committed. If it did not consider the log committed, it should never be in the business of applying it. So which is it? Is leadership lost during replication, or is leadership lost during application of committed logs? |
Thinking about this more, I think it would be helpful if
(the comment for OR
However this second option doesn't make sense, since every node applies every committed command to the its own FSM, and Leadership state has nothing to do with applying to the FSM. Of course, the Raft function I think the fundamental doc change for this should be something as follows: // ErrLeadershipLost is returned when a leader cannot confirm that a log entry was
// committed because it's been deposed while waiting for confirmation from the Followers
// that the log was replicated. When this error is returned a client cannot be certain if the log
// was replicated successfully or not. See section 8 of the Raft paper to read more about this
// type of scenario.
ErrLeadershipLost = errors.New("leadership lost while committing log") This whole situation seems similar to the following situation: a client is talking to a Raft system, the Leader commits and applies the request successfully, but the response from the Leader back to the client is dropped (perhaps the Leader crashes just after applying the log). The client is truly uncertain about whether the request was successful. This case is called out in the Raft paper (see section 8). But the most important thing is that the system remains safe and correct, and each node will always agree on the committed log entries in the log. |
Related issues on the rqlite GitHub repo: rqlite/rqlite#1123 |
Thanks for the detailed thoughts @otoolep. I think you've called out a few other good points of confusion. I'll see if I can pick out the questions in what you posted first and then we can see what the best approach is here.
Hmm not quite, it's actually a bit more subtle than that. For example a leader might write to it's disk and then replicate a log to 2 followers, they might both also write to disk - at this point the log is "committed" in the cluster, but the leader doesn't yet know it. If the leader is partitioned from both followers right after it sent the AppendEntries message it will not know that the log is actually committed or not, but it can't just keep waiting forever to find out because then the cluster won't make progress when there are network partitions. So it has to return an error that says "i'm not the leader any more, but I don't know if this operation was committed or not.". If it does receive at least one (assuming 3 node cluster) ack for the AppendEntries then it does know that it's committed and so it could (but doesn't currently) return a different error. But the thing is it's still not been a successful write in the sense that a subsequent read to the leader only might not see the result of the write as its not yet been applied to the FSM and we rely on the "leader lease" optimization to allow read-after-write consistency on the leader without read always going to a quorum of nodes.
Almost. You described one particular case but the other is like I mentioned above where the leader might not even have observed enough acks to know it was committed yet at the point it has to step down. We could treat these all differently but so far haven't seen a case where the client could make use of that distinction in our implementations. We still can't say it was a "succesful" write because we define that to be one which can immediately be observed by reading the result from the leader which is only true after the FSM apply has happened as well.
Yes Raft guarantees that each op that is committed in the cluster will be applied exactly one to each FSM and in the same order. What we don't make any guarantees about though is if you client gets and
Ah yes, I see why this is confusing. I was going to explain but you already figured it out:
We use
So this ^ is using "applying" in the more general sense of "has called
I think this is the crux of the issue. Thanks for calling it out explicitly. The thing is, right now it means (almost) any one of the situations you described and some others. Technically it means "this request was accepted from the chan by the leader loop but some time before it was fully committed and applies to the FSM we lost leadership. It may or may not have been committed. If committed it may or may not have been applied to the new leader's FSM yet, if not committed it may or may not commit in the future and later be applied to the new leader's FSM". I think it would be possible to distinguish some of these cases but I'm not convinced it's a good idea for the reasons hinted at above - even if the leader has not seen it commit yet, that doesn't mean it's not committed so the client can't safely take a different action in that case anyway. In fact, even if it's not technically committed (because no follower wrote it to disk yet) it might still be in the future (because at least one follower has the packets in it's recv buffer and will process them before noticing the connection has been closed). Similarly, if it has committed but hasn't been applied to the leader's FSM yet (since that is asynchronous from updating the commit index) we could return an error that indicates that we know it's committed but this is not the same as "OK" since no node has actually applied it to FSM so a subsequent read to this old leader or the new leader might not reflect the change. About the only useful distinction might be a client that doesn't care about reading the value back right away as long as it's been persistent and will eventually be available on all correct replicas, could know not to retry... But since all clients might need to retry for other cases and will need to correctly handle both committed and uncommitted cases, I don't think it would ever be useful to actually have different behaviour in that one case. Do you see cases where that is important and would be a significant advantage to clients?
Right applying to the FSM is not something that is exclusive to leaders and I think would not stop due to loss of leadership. Once the leader processes enough responses to know it's committed it will pushed the log onto the FSM apply chan before it gets a chance to loop again and decide to step down. In that case when it does step down it still doesn't know if the FSM actually applied it or not, but it will certainly and likely even before this node becomes a follower and starts replication from the new leader. Again, it's hard to reason about how a client would make good use of that information with all these subtleties!
Yep exactly that. Great call out. I really like your proposed wording. I'll update this PR to use that. |
That's what I did. Since I only hit this issue in my testing, I changed my test data so that the inserted records include explicit primary keys. When this error occurs, my test resends the record. Thanks to the use of explicit primary keys, all inserts are now idempotent, and my test case ( |
It isn't clear what
ErrLeadershipLost
means in terms of write success or failure. This aims to clarify that it's indeterminate and there is nothing stronger we can say about that in general.It could be possible to avoid this case in some types of leadership loss or shutdown, but we can't make a stronger guarantee in all cases so there will always be at least some cases where the result is "we don't know if this committed or not".
For example, if the leader in a three node cluster has been partitioned from both followers, it will eventually lose leadership. When it does, any in-flight commands are in an indeterminate state. If they have been received and written to disk by at least one follower, then it is likely that that follower will become the new leader and "commit" them in it's new term. If none of the followers has received the request before the partition then the write will be lost completely. Since the partitioned leader can't communicate with the followers, it can't find out which outcome is right and so must return
ErrLeadershipLost
which implies "I don't have enough information to say whether this committed or not but I can't make progress as I'm not the leader any more".