-
Notifications
You must be signed in to change notification settings - Fork 76
document consistency guarantees #5
Comments
Hi Tobias, There are three consistency levels that a server can operate in: low, medium, or high.
Writes are always consistent, but reads are optional based on the SummitDB uses Finn to handle all raft logic, Finn in turn relies on Hashicorp Raft. The primary code that handles applying commands is done by nodeApplier.Apply. This function handles read or write commands. Write commands are always sent through the raft log. For read commands the
When a server receives a command that it cannot apply because it is not the leader, that server will suggest that the client try it on the leader node:
This is only a suggestion based on what is known about the cluster at the time the command was attempted to be applied. When the client retries the command on the suggested server, and for whatever reason that that server is no longer the leader, then another I feel the solution I have in place is working well, but if you see a fault in my logic please let me know right away. I spent quite a bit of time around this specific problem. Thanks for your feedback. |
I'm closing this issue for now. Please feel free to reopen this issue if you run into any problems regarding this topic. |
Thanks @tidwall and apologies for not getting back to you earlier. I didn't realize you were sending reads through the Raft log in
The interesting bit here is that a node may think it's the leader but isn't in fact the leader. I assume that this is the case with |
That's right. When set to Even in the case when a client sends a read to a node that is perceived the leader by both the node and the client (but in fact is in the middle of a leadership change), the node will attempt to apply the command to Raft, which in turn will fail with "not the leader" error. The node will then discover the new leader and notify the client with a In the case with You ask very good questions. I should probably create a wiki page describing the process in more detail. Until then I think I'll keep this issue open in case others might find it useful. |
I forgot to respond to this:
In both cases "dropped" and "apply multiple times" should be covered because |
I'm not sure I follow. More concretely, assume the following:
If one instead allows all nodes to propose commands (by relaying to the leader) there are various situations in which the These are hopefully concerns which are handled by the underlying Raft implementation (i.e. not in your code), which I haven't looked at. |
The underlying implementation is Hashicorp Raft. SummitDB sends commands using Raft.Apply, which the Hashicorp doc state:
This is the basic flow that summitdb uses for handling a client command: future := raft.Apply(cmd, timeout)
if err := future.Error(); err != nil{
if isNotLeaderError(err){
// figure out who the leader is.
respond("-TRY leader_addr")
} else{
respond("-ERR "+err.Error())
}
} else{
respond("+OK")
}
As I understand it
I'm placing a fair amount of trust on the Hopefully I'm not misinterpreting their documentation. I haven't run into any consistency issues... yet |
Ah, I see. One caveat here is that when you receive an error, the proposal could still have applied. That is, the burden of figuring out what happened is shifted to the client. That's reasonable if that situation is rare enough. However, I think even in the case in which |
Correct me if I'm wrong, but wouldn't the log entry for that command be uncommitted at the point that the leader steps down? Then when a new leader goes online, the term is incremented. The followers join the new leader and any uncommitted log entries are rolled backed to match the new leaders committed log. In the meantime the client would have received an error? |
Consider the following:
The key here is that the node which is elected leader doesn't purge its log (it can only commit previous entries when it manages to commit an entry in its own term, but that's a technicality that doesn't touch this argument at all). |
Got it and I'm in full agreement. After perusing the Raft paper and the Hashicorp implementation it's clear that there can be fringe cases where a client sends a command to a server, receive an error, and yet the command is fully applied to the logs. While rare it's possible that a log entry may be duplicated if the client retries the command. A retry is quite often not a problem for most summitdb commands such as SET, DEL, GET. But there are commands such as APPEND and BITOP that it would be very bad if it were to be applied more than once.
What is certain is that when a client sends a command to a leader and the leader provides a successful response, that command is fully replicated to the cluster. So false-positives don't seem to be possible. False-negatives on the other hand... The raft paper (pg. 13, sec. 8) suggests serializing every command. Which may be the way to go. I'll need to investigate the various options. SummitDB has the FENCE command which generates a unique token for distributed tasks. It's sorta like a log index, but for the client application. It's intended to help solve the Redis distributed locking problem. Yet I can see it as being helpful for generating a unique serial number for commands too. Anyhow... there's lots to think about here. Thanks a ton for your insights and time on the matter. |
The leadership changes section made me nervous because it looks like there is some reliance on the nodes perceived Raft status (which may not be the Raft status). I traced the code and indeed on proposal, the node's local Raft status is determined to decide whether that node is the leader or not. During network partitions however, a node may believe it is (still) the leader when in fact it isn't. To accommodate that, one usually employs a system of time-based leadership leases (which you then pay for with mandatory downtime under some scenarios as the above), but I didn't see that here.
I haven't dug deeper, but there are likely issues in this project when reads or writes take place in this state, jeopardizing correctness. If those issues are handled anywhere, I'd appreciate a pointer.
The text was updated successfully, but these errors were encountered: