document consistency guarantees #5

tbg · 2016-10-20T09:30:50Z

The leadership changes section made me nervous because it looks like there is some reliance on the nodes perceived Raft status (which may not be the Raft status). I traced the code and indeed on proposal, the node's local Raft status is determined to decide whether that node is the leader or not. During network partitions however, a node may believe it is (still) the leader when in fact it isn't. To accommodate that, one usually employs a system of time-based leadership leases (which you then pay for with mandatory downtime under some scenarios as the above), but I didn't see that here.

I haven't dug deeper, but there are likely issues in this project when reads or writes take place in this state, jeopardizing correctness. If those issues are handled anywhere, I'd appreciate a pointer.

tidwall · 2016-10-20T10:58:00Z

Hi Tobias,

There are three consistency levels that a server can operate in: low, medium, or high.

In low, all reads are served regardless of leadership status.
In medium, only the leader serves reads, but there is a soft leader check. This can cause a read to be stale in small cases, like during a leadership change.
In high, the leader will apply the read command through the raft log just like a write, except the command will contain only enough data to increment the log index. This ensures that the read happens in sync with any write command and that there's a consensus prior to responding to the client.

Writes are always consistent, but reads are optional based on the --consistency flag. By default SummitDB sets this flag to high.

SummitDB uses Finn to handle all raft logic, Finn in turn relies on Hashicorp Raft.

The primary code that handles applying commands is done by nodeApplier.Apply. This function handles read or write commands. Write commands are always sent through the raft log. For read commands the mutate param is nil and the logic is then handed by Node. raftLevelGuard. This function determines how to handle the read based on the required consistency level. When the level is High the read is sent through the log. not the entire read command is sent through the log though, only enough data to increment the log index

The leadership changes section made me nervous because it looks like there is some reliance on the nodes perceived Raft status (which may not be the Raft status).

When a server receives a command that it cannot apply because it is not the leader, that server will suggest that the client try it on the leader node:

> SET x y
-TRY 127.0.0.1:7481

This is only a suggestion based on what is known about the cluster at the time the command was attempted to be applied.

When the client retries the command on the suggested server, and for whatever reason that that server is no longer the leader, then another TRY response will occur. This could go on until the correct leader handles the command. In most cases a TRY response would happen infrequently and typically following leadership changes.

I feel the solution I have in place is working well, but if you see a fault in my logic please let me know right away. I spent quite a bit of time around this specific problem.

Thanks for your feedback.

tidwall · 2016-11-05T02:44:48Z

I'm closing this issue for now. Please feel free to reopen this issue if you run into any problems regarding this topic.

tbg · 2016-11-05T10:01:57Z

Thanks @tidwall and apologies for not getting back to you earlier. I didn't realize you were sending reads through the Raft log in high, but that makes sense.

When a server receives a command that it cannot apply because it is not the leader, that server will suggest that the client try it on the leader node:

The interesting bit here is that a node may think it's the leader but isn't in fact the leader. I assume that this is the case with medium for which you may serve a stale read. For high this shouldn't matter, correct? Even if you propose to Raft from the non-leader (who thinks it's the leader) the command will be assigned a log index. I'm asking because that case is sometimes subtle depending on how the implementation works. Nodes need to hold on to pending proposals and periodically resubmit them until they show up in a log entry, which then necessitates handling of entry duplication (only relevant for non-idempotent writes). If that doesn't happen, proposals could be dropped (i.e. they never show up in the Raft log) or apply multiple times (giving unexpected results that the client didn't ask for). Anyway, you've addressed my fundamental concerns, thanks for that.

tidwall · 2016-11-05T10:45:44Z

I assume that this is the case with medium for which you may serve a stale read. For high this shouldn't matter, correct?

That's right. When set to high the leader increments the log index for all read commands ensuring no stale reads.

Even in the case when a client sends a read to a node that is perceived the leader by both the node and the client (but in fact is in the middle of a leadership change), the node will attempt to apply the command to Raft, which in turn will fail with "not the leader" error. The node will then discover the new leader and notify the client with a -TRY response.

In the case with medium, being perceived as the leader by both the node and client is enough to process the read.

You ask very good questions. I should probably create a wiki page describing the process in more detail. Until then I think I'll keep this issue open in case others might find it useful.

tidwall · 2016-11-05T10:59:01Z

I forgot to respond to this:

Nodes need to hold on to pending proposals and periodically resubmit them until they show up in a log entry, which then necessitates handling of entry duplication (only relevant for non-idempotent writes). If that doesn't happen, proposals could be dropped (i.e. they never show up in the Raft log) or apply multiple times (giving unexpected results that the client didn't ask for).

In both cases "dropped" and "apply multiple times" should be covered because high requires a new log index for all read/write commands prior to responding to the client.

tbg · 2016-11-05T11:53:10Z

In both cases "dropped" and "apply multiple times" should be covered because high requires a new log index for all read/write commands prior to responding to the client.

I'm not sure I follow. More concretely, assume the following:

node 1 thinks it's the leader, but it isn't (i.e. a majority of the cluster is at a higher term and has elected a different leader)
a client wants to execute Increment(key1) on node 1 at consistency level high.
node 1 proposes the command. What that means exactly depends on the implementation. I would assume it adds it to its local log and tries to send appends to what it still thinks are its followers.
its appends are denied, node 1 learns of the real leader, and it then discards its uncommitted log (including the proposal our client is waiting for).
client never hears back about its proposal, unless there's some clever mechanism that notifies the client when the proposal is dropped (i.e. it would receive a TRY message).

If one instead allows all nodes to propose commands (by relaying to the leader) there are various situations in which the Increment could end up in the logs multiple times.

These are hopefully concerns which are handled by the underlying Raft implementation (i.e. not in your code), which I haven't looked at.

tidwall · 2016-11-05T13:27:53Z

node 1 proposes the command. What that means exactly depends on the implementation. I would assume it adds it to its local log and tries to send appends to what it still thinks are its followers.

The underlying implementation is Hashicorp Raft. SummitDB sends commands using Raft.Apply, which the Hashicorp doc state:

"Apply is used to apply a command to the FSM in a highly consistent manner. This returns a future that can be used to wait on the application. An optional timeout can be provided to limit the amount of time we wait for the command to be started. This must be run on the leader or it will fail."

This is the basic flow that summitdb uses for handling a client command:

future := raft.Apply(cmd, timeout)
if err := future.Error(); err != nil{
    if isNotLeaderError(err){
        // figure out who the leader is.
        respond("-TRY leader_addr")
    } else{
        respond("-ERR "+err.Error())
    }
} else{
    respond("+OK")
}

client never hears back about its proposal, unless there's some clever mechanism that notifies the client when the proposal is dropped (i.e. it would receive a TRY message).

As I understand it Raft.Apply should gracefully handle the dropped command scenario that you describe. ApplyFuture.Error() is then immediately called and it waits for Raft to fully process the command. The godoc states:

"ApplyFuture is used for Apply() and can returns the FSM response."

I'm placing a fair amount of trust on the Apply function as it specifically states: "Apply is used to apply a command to the FSM in a highly consistent manner" and "This must be run on the leader or it will fail".

Hopefully I'm not misinterpreting their documentation. I haven't run into any consistency issues... yet

tbg · 2016-11-05T13:42:01Z

Ah, I see. One caveat here is that when you receive an error, the proposal could still have applied. That is, the burden of figuring out what happened is shifted to the client. That's reasonable if that situation is rare enough. However, I think even in the case in which -TRY is returned the command could have applied (and the client would follow the advice to retry and apply it again). This (or something like this) would happen if the leader manages to append the command on some followers, but then steps down and can't complete the proposal, unless the Raft implementation gets very fancy (I don't think so, but haven't checked).

tidwall · 2016-11-05T14:17:08Z

This (or something like this) would happen if the leader manages to append the command on some followers, but then steps down and can't complete the proposal

Correct me if I'm wrong, but wouldn't the log entry for that command be uncommitted at the point that the leader steps down? Then when a new leader goes online, the term is incremented. The followers join the new leader and any uncommitted log entries are rolled backed to match the new leaders committed log. In the meantime the client would have received an error?

tbg · 2016-11-07T00:35:17Z

Consider the following:

Node 1 is leader in Term 11 and successfully sends appends to the other two nodes (but does not hear back)
Node 2 campaigns and wins the election for Term 2 (all nodes have the same log, in particular containing the just-received entry). In particular, Node 1 votes for Node 2 (and will presumably tell the client "I'm not the leader")
Node 3 commits a new command (which in particular also commits the first entry by leader completeness property)

The key here is that the node which is elected leader doesn't purge its log (it can only commit previous entries when it manages to commit an entry in its own term, but that's a technicality that doesn't touch this argument at all).

tidwall · 2016-11-07T16:24:36Z

Got it and I'm in full agreement. After perusing the Raft paper and the Hashicorp implementation it's clear that there can be fringe cases where a client sends a command to a server, receive an error, and yet the command is fully applied to the logs. While rare it's possible that a log entry may be duplicated if the client retries the command.

A retry is quite often not a problem for most summitdb commands such as SET, DEL, GET. But there are commands such as APPEND and BITOP that it would be very bad if it were to be applied more than once.

Diego Ongaro states:

the only outcomes that clients see are "don't know" and "completed"

What is certain is that when a client sends a command to a leader and the leader provides a successful response, that command is fully replicated to the cluster. So false-positives don't seem to be possible.

False-negatives on the other hand... The raft paper (pg. 13, sec. 8) suggests serializing every command. Which may be the way to go. I'll need to investigate the various options.

SummitDB has the FENCE command which generates a unique token for distributed tasks. It's sorta like a log index, but for the client application. It's intended to help solve the Redis distributed locking problem. Yet I can see it as being helpful for generating a unique serial number for commands too.

Anyhow... there's lots to think about here.

Thanks a ton for your insights and time on the matter.

tbg mentioned this issue Oct 20, 2016

Implemention of Redis Like Datastructures cockroachdb/cockroach#587

Closed

tidwall closed this as completed Nov 5, 2016

tidwall reopened this Nov 5, 2016

tidwall added the enhancement label Nov 5, 2016

tbg changed the title ~~invalid assumptions about leader status~~ document consistency guarantees Nov 5, 2016

glycerine mentioned this issue Jan 19, 2017

length option to FENCE, to support batch updates and dedup #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document consistency guarantees #5

document consistency guarantees #5

tbg commented Oct 20, 2016

tidwall commented Oct 20, 2016 •

edited

Loading

tidwall commented Nov 5, 2016

tbg commented Nov 5, 2016 •

edited

Loading

tidwall commented Nov 5, 2016 •

edited

Loading

tidwall commented Nov 5, 2016

tbg commented Nov 5, 2016

tidwall commented Nov 5, 2016

tbg commented Nov 5, 2016

tidwall commented Nov 5, 2016 •

edited

Loading

tbg commented Nov 7, 2016 •

edited

Loading

tidwall commented Nov 7, 2016

document consistency guarantees #5

document consistency guarantees #5

Comments

tbg commented Oct 20, 2016

tidwall commented Oct 20, 2016 • edited Loading

tidwall commented Nov 5, 2016

tbg commented Nov 5, 2016 • edited Loading

tidwall commented Nov 5, 2016 • edited Loading

tidwall commented Nov 5, 2016

tbg commented Nov 5, 2016

tidwall commented Nov 5, 2016

tbg commented Nov 5, 2016

tidwall commented Nov 5, 2016 • edited Loading

tbg commented Nov 7, 2016 • edited Loading

tidwall commented Nov 7, 2016

tidwall commented Oct 20, 2016 •

edited

Loading

tbg commented Nov 5, 2016 •

edited

Loading

tidwall commented Nov 5, 2016 •

edited

Loading

tidwall commented Nov 5, 2016 •

edited

Loading

tbg commented Nov 7, 2016 •

edited

Loading