Summary:
Detail:
Summary:
Detail:
Summary:
- Fixed:
- 63e69b9a restore replication progress when a leader starts up (#884).
- c7725c79 Do not report snapshot.last_log_id to metrics until snapshot is finished building/installing.
- f469878c AsyncReadExt::read_buf() only reads at most 2MB per call.
- 2c715d6e End
tick_loop()
when the receiver is gone.; by Ivan Schréter
Detail:
-
Fixed: 63e69b9a restore replication progress when a leader starts up (#884); by 张炎泼; 2023-06-29
As a leader, the replication progress to itself should be restored upon startup.
And if this leader is the only node in a cluster, it should re-apply all of the logs to state machine at once.
- Fix: #883
-
Fixed: c7725c79 Do not report snapshot.last_log_id to metrics until snapshot is finished building/installing; by 张炎泼; 2023-10-18
Before this commit
RaftMetrics.snapshot
contains the last log id of a snapshot that is going to install. Therefore there is chance the metrics is updated but the store does not.In this commit,
RaftMetrics.snapshot
will only be updated when a snapshot is finished building or installing, by adding a new fieldsnpashot
toIOState
for tracking persisted snapshot meta data.- Fix: #912
-
Fixed: f469878c AsyncReadExt::read_buf() only reads at most 2MB per call; by 张炎泼; 2023-11-08
When streaming a snapshot chunk, it should repeatly
read_buf()
untilsnapshot_max_chunk_size
is full or read EOF. -
Fixed: 2c715d6e End
tick_loop()
when the receiver is gone.; by Ivan Schréter; 2023-11-13Currently,
tick_loop()
would keep printing the trace message every tick even when the receiver (Raft main loop) is gone in this form:INFO openraft::core::tick: .../tick.rs:70: Tick fails to send, receiving end quit: channel closed
If the tick message fails to send, then terminate the loop, since every future message will fail to send as well.
Also adjust the trace message to better describe what happened.
Summary:
- Added:
- e39da9f0 define custom
Entry
type for raft log. - 87d62d56 add feature flag storage-v2 to enable
RaftLogStorage
andRaftStateMachine
. - 229f3368 add backoff strategy for unreachable nodes.
- f0dc0eb7 add
RaftMetrics::purged
to report the last purged log id. - 37e96482 add
Wait::purged()
to wait for purged to become the expected. - 0b419eb6 add
RaftMetrics.vote
,Wait::vote()
. - ee7b8853 new
RaftNetwork
API with argumentRCPOption
. - 1ee82cb8
RaftNetwork::send_append_entries()
can returnPartialSuccess
. - e9eed210 leader lease.
- 269d221c add
SnapshotPolicy::Never
. - 9e7195a1 add
Raft::purge_log()
.
- e39da9f0 define custom
- Improved:
- dbac91d5 send
AppendEntries
response before committing entries. - 6769cdd0 move state machine operations to another task.
- dcd18c53 getting a snapshot does not block
RaftCore
task. - 6eeb5246 reduce rate to flush metrics.
- 1f3bf203 create a channel
notify
specifically for interal messages. - 54154202 move receiving snapshot chunk to sm::Worker.
- fa4085b9 build snapshot in anohter task.
- 47048a53
IntoIterator::IntoIter
should beSend
. - 8dae9ac6 impl
Clone
forRaft
does not requireClone
for its type parameters.
- dbac91d5 send
- Fixed:
- cd31970d trait
RaftLogId
should be public. - 26dc8837
ProgressEntry::is_log_range_inflight()
checks a log range, not a log entry. - 04e40606 if the application does not persist snapshot, build a snapshot when starting up.
- 3433126c
compat07::SnapshotMeta
should decode v08SnapshotMeta
. - b97edb49 incorrect debug level log.
- d012705d replication should be able to shutdown when replicating snapshot to unreachable node.
- f505d7e6
Raft::add_learner()
should block forever.
- cd31970d trait
- Changed:
- a92499f2
StoreBuilder
does not need to run a test, it only needs to build a store. - 6e9d3573 remove
Clone
from traitAppData
. - 285e6225 instead of a slice,
RaftStorage::append_to_log()
now accepts anIntoIterator
. - e0569988 remove unused trait
RaftStorageDebug
. - 88f947a6 remove defensive check utilities.
- eaf45dfa move
RaftStateMachine
out ofRaftStorage
. - 9f8ae43e
RaftMetrics.replication
type toBTreeMap<NodeId, Option<LogId>>
. - 84539cb0 move snapshot type definition from storage traits to
RaftTypeConfig
. - e78bbffe remove unused error
CommittedAdvanceTooMany
.
- a92499f2
Detail:
-
Added: e39da9f0 define custom
Entry
type for raft log; by 张炎泼; 2023-03-16This commit introduces a new feature that allows applications to define a custom type for Raft log entries in
RaftTypeConfig
. By settingEntry = MyEntry
, whereMyEntry
implements theRaftEntry
trait, an application can now define its own log entry type that reflects its architecture. However, the default implementation, theEntry
type is still available.This change provides more flexibility for applications to tailor the Raft log entry to their specific needs.
-
Fix: #705
-
Changes:
RaftStorage::append_to_log()
andRaftStorage::apply_to_state_machine()
accepts slice of entries instead of slice of entry references.
-
-
Added: 87d62d56 add feature flag storage-v2 to enable
RaftLogStorage
andRaftStateMachine
; by 张炎泼; 2023-04-19storage-v2
: enablesRaftLogStorage
andRaftStateMachine
as the v2 storage This is a temporary feature flag, and will be removed in the future, when v2 storage is stable. This feature disablesAdapter
, which is for v1 storage to be used as v2. V2 storage separates log store and state machine store so that log IO and state machine IO can be parallelized naturally. -
Added: 229f3368 add backoff strategy for unreachable nodes; by 张炎泼; 2023-04-21
Implements a backoff strategy for temporarily or permanently unreachable nodes. If the
Network
implementation returnsUnreachable
error, Openraft will backoff for a while before sending next RPC to this target. This mechanism prevents error logging flood.Adds a new method
backoff()
toRaftNetwork
to let an application return a customized backoff policy, the default provided backoff just constantly sleep 500ms.Adds an
unreachable_nodes
setting to the testing routerTypedRaftRouteryped
to emulate unreachable nodes. Add new errorUnreachable
and anRPCError
variantUnreachable
.- Fix: #462
-
Added: f0dc0eb7 add
RaftMetrics::purged
to report the last purged log id; by 张炎泼; 2023-05-01 -
Added: 37e96482 add
Wait::purged()
to wait for purged to become the expected; by 张炎泼; 2023-05-01 -
Added: 0b419eb6 add
RaftMetrics.vote
,Wait::vote()
; by 张炎泼; 2023-05-02The latest approved value of
Vote
, which has been saved to disk, is referred to asRaftMetrics.vote
. Additionally, a newvote()
method has been included inWait
to enable the application to wait forvote
to reach the anticipated value. -
Added: ee7b8853 new
RaftNetwork
API with argumentRCPOption
; by 张炎泼; 2023-05-02-
RaftNetwork
introduced 3 new APIappend_entries
,install_snapshot
andvote
which accept an additional argumentRPCOption
, and deprecated the old APIsend_append_entries
,send_install_snapshot
andsend_vote
. -
The old API will be removed in
0.9
. An application can still implement the old API without any changes. Openraft calls only the new API and the default implementation will delegate to the old API. -
Implementing the new APIs will disable the old APIs.
-
The new APIs accepts an additional argument
RPCOption
, to enable an application control the networking behaviors based on the parameters inRPCOption
.The
hard_ttl()
andsoft_ttl()
inRPCOption
sets the hard limit and the moderate limit of the duration for which an RPC should run. Once thesoft_ttl()
ends, the RPC implementation should start to gracefully cancel the RPC, and once thehard_ttl()
ends, Openraft will terminate the ongoing RPC at once. -
Fix: #819
-
-
Added: 1ee82cb8
RaftNetwork::send_append_entries()
can returnPartialSuccess
; by 张炎泼; 2023-05-03If there are too many log entries and the
RPCOption.ttl
is not sufficient, an application can opt to only send a portion of the entries, withAppendEntriesResponse::PartialSuccess(Option<LogId>)
, to inform Openraft with the last replicated log id. Thus replication can still make progress.For example, it tries to send log entries
[1-2..3-10]
, the application is allowed to send just[1-2..1-3]
and returnPartialSuccess(1-3)
,The returned matching log id must be greater than or equal to the first log id(
AppendEntriesRequest::prev_log_id
) of the entries to send. If no RPC reply is received,RaftNetwork::send_append_entries
must return anRPCError
to inform Openraft that the first log id(AppendEntriesRequest::prev_log_id
) may not match on the remote target node.- Fix: #822
-
Added: e9eed210 leader lease; by 张炎泼; 2023-05-19
The leader records the most recent time point when an RPC is initiated towards a target node. The highest timestamp associated with RPCs made to a quorum serves as the starting time point for a leader lease.
Improve: use tokio::Instant to replace TimeState
Use
Instant
for timekeeping instead of a customTimeState
struct. Because multiple components need to generate timestamp, not only theRaftCore
, e.g., theReplicationCore
. And generating a timestamp is not in the hot path, therefore caching it introduces unnecessary complexity. -
Added: 269d221c add
SnapshotPolicy::Never
; by 张炎泼; 2023-05-24With
SnapshotPolicy::Never
, Openraft will not build snapshots automatically based on a policy. Instead, the application has full control over when snapshots are built. In this scenario, the application can call theRaft::trigger_snapshot()
API at the desired times to manually trigger Openraft to build a snapshot.Rename integration tests:
-
log_compaction -> snapshot_building
-
snapshto -> snapshot_streaming
-
Fix: #851
-
-
Added: 9e7195a1 add
Raft::purge_log()
; by 张炎泼; 2023-05-24This method allows users to purge logs when required. It initiates the log purge up to and including the given
upto
log index.Logs that are not included in a snapshot will NOT be purged. In such scenario it will delete as many log as possible. The
max_in_snapshot_log_to_keep
config is not taken into account when purging logs.Openraft won't purge logs at once, e.g. it may be delayed by several seconds, because if it is a leader and a replication task has been replicating the logs to a follower, the logs can't be purged until the replication task is finished.
- Fix: #852
-
Improved: dbac91d5 send
AppendEntries
response before committing entries; by 张炎泼; 2023-04-04When a follower receives an append-entries request that includes a series of log entries to append and the log id that the leader has committed, it responds with an append-entries response after committing and applying the entries.
However, this is not strictly necessary. The follower could simply send the response as soon as the log entries have been appended and flushed to disk, without waiting for them to be committed.
-
Improved: 6769cdd0 move state machine operations to another task; by 张炎泼; 2023-04-13
State machine operations, such as applying log entries, building/installing/getting snapshot are moved to
core::sm::Worker
, which is run in a standalone task other than the one runningRaftCore
. In this way, log io operation(mostly appending log entries) and state machine io operations(mostly applying log entries) can be paralleled.-
Log io are sitll running in
RaftCore
task. -
Snapshot receiving/streaming are removed from
RaftCore
. -
Add
IOState
toRaftState
to track the applied log id.This field is used to determine whether a certain command, such as sending a response, can be executed after a specific log has been applied.
-
Refactor:
leader_step_down()
can only be run when the response of the second change-membership is sent. Before this commit, updating thecommitted
is done atomically with sending back response. Since thie commit, these two steps are done separately, because applying log entries are moved to another task. Thereforeleader_step_down()
must wait for these two steps to be finished.
-
-
Improved: dcd18c53 getting a snapshot does not block
RaftCore
task; by 张炎泼; 2023-04-16RaftCore
no longer blocks on receiving a snapshot from the state-machine worker while replicating a snapshot. Instead, it sends theReceiver
to the replication task and the replication task blocks on receiving the snapshot. -
Improved: 6eeb5246 reduce rate to flush metrics; by 张炎泼; 2023-04-23
The performance increases by 40% with this optimization:
clients put/s ns/op Changes 64 652,000 1,532 Reduce metrics report rate 64 467,000 2,139 -
Improved: 1f3bf203 create a channel
notify
specifically for interal messages; by 张炎泼; 2023-04-25tx_notify
will be used for components such as state-machine worker or replication stream to send back notification when an action is done.tx_api
is left for receiving only external messages, such as append-entries request or client-write request.A
Balancer
is added to prevent one channel from starving the others.The benchmark shows a better performance with 64 clients:
clients put/s ns/op Changes 64 730,000 1,369 This commit 64 652,000 1,532 Previous commit -
Improved: 54154202 move receiving snapshot chunk to sm::Worker; by 张炎泼; 2023-04-27
Receiving snapshot chunk should not be run in RaftCore task. Otherwise it will block RaftCore.
In this commit this task is moved to sm::Worker, running in another task. The corresponding responding command will not be run until sm::Worker notify RaftCore receiving is finished.
-
Improved: fa4085b9 build snapshot in anohter task; by 张炎泼; 2023-05-02
Before this commit, snapshot is built in the
sm::Worker
, which blocks other state-machine writes, such as applying log entries.This commit parallels applying log entries and building snapshot: A snapshot is built in another
tokio::task
.Because building snapshot is a read operation, it does not have to block the entire state machine. Instead, it only needs a consistent view of the state machine or holding a lock of the state machine.
- Fix: #596
-
Improved: 47048a53
IntoIterator::IntoIter
should beSend
; by 张炎泼; 2023-06-16The
RaftStateMachine::apply()
andRaftLogStorage::append_to_log()
method contains aSend
bound on theIntoIterator
passed to it. However, the actual iterator returned fromIntoIterator
doesn't have it. Thus, it's impossible to iterate across awaits in the implementation.The correct API should be:
async fn apply<I>(&mut self, entries: I) -> Result<...> where I: IntoIterator<Item = C::Entry> + Send, I::IntoIter: Send;
Thanks to schreter
- Fix: #860
-
Improved: 8dae9ac6 impl
Clone
forRaft
does not requireClone
for its type parameters; by 张炎泼; 2023-06-16Thanks to xDarksome
- Fix: #870
-
Fixed: cd31970d trait
RaftLogId
should be public; by 张炎泼; 2023-03-21 -
Fixed: 26dc8837
ProgressEntry::is_log_range_inflight()
checks a log range, not a log entry; by 张炎泼; 2023-04-12This bug causes replication tries to send pruged log.
-
Fixed: 04e40606 if the application does not persist snapshot, build a snapshot when starting up; by 张炎泼; 2023-04-15
-
Fixed: 3433126c
compat07::SnapshotMeta
should decode v08SnapshotMeta
; by 张炎泼; 2023-04-15 -
Fixed: b97edb49 incorrect debug level log; by 张炎泼; 2023-04-22
This results in unnecessary debug log output.
-
Fixed: d012705d replication should be able to shutdown when replicating snapshot to unreachable node; by 张炎泼; 2023-05-01
If a replication is sending a snapshot, it should periodically verify the input channel's status. When the input channel is closed during replication rebuilding, it should immediately exit the loop instead of attempting retries indefinitely.
- Fix: #808
-
Fixed: f505d7e6
Raft::add_learner()
should block forever; by 张炎泼; 2023-05-20The
Raft::add_learner()
method, when invoked with theblocking
parameter set totrue
, should block forever until the learner synchronizes its logs with the leader.In its current implementation,
add_learner()
calls theRaft::wait()
method, which has a default timeout of500ms
. To achieve the desired blocking behavior, the default timeout should be increased significantly.- Fix: #846
-
Changed: a92499f2
StoreBuilder
does not need to run a test, it only needs to build a store; by 张炎泼; 2023-03-21StoreBuilder::run_test()
is removed, and a new methodbuild()
is added. To test aRaftStorage
implementation, just implementingStoreBuilder::build()
is enough now. It returns a store instance and a guard, for disk backed store to clean up the data when the guard is dropped. -
Changed: 6e9d3573 remove
Clone
from traitAppData
; by 张炎泼; 2023-03-26Application data
AppData
does not have to beClone
.Upgrade tip:
Nothing needs to be done. The default
RaftEntry
implementationEntry
provided by openraft is stillClone
, if the AppData in it isClone
. -
Changed: 285e6225 instead of a slice,
RaftStorage::append_to_log()
now accepts anIntoIterator
; by 张炎泼; 2023-03-27Using an
IntoIterator
is more generic than using a slice, and could avoid potential memory allocation for the slice.Upgrade tip:
Update the method signature in the
RaftStorage
implementation and ensure that it compiles without errors. The method body may require minimal modifications as as the new input type is just a more general type. -
Changed: e0569988 remove unused trait
RaftStorageDebug
; by 张炎泼; 2023-04-10RaftStorageDebug
has only one methodget_state_machine()
, and state machine is entirely a user defined struct. Obtaining a state machine does not imply anything about the struct or behavior of it. -
Changed: 88f947a6 remove defensive check utilities; by 张炎泼; 2023-04-11
Most defensive checks are replaced with
debug_assert!
embedded in Engine.StoreExt
as aRaftStorage
wrapper that implements defensive checks are no longer needed.StoreExt
are mainly used for testing and it is very slow so that can not be used in production.- Remove structs:
StoreExt
,DefensiveStoreBuilder
- Remove traits:
Wrapper
,DefensiveCheckBase
,DefensiveCheck
,
- Remove structs:
-
Changed: eaf45dfa move
RaftStateMachine
out ofRaftStorage
; by 张炎泼; 2023-04-01In Raft, the state machine is an independent storage component that operates separately from the log store. As a result, accessing the log store and accessing the state machine can be naturally parallelized.
This commit replaces the type parameter
RaftStorage
inRaft<.., S: RaftStorage>
with two type parameters:RaftLogStorage
andRaftStateMachine
.-
Add:
RaftLogReaderExt
to provide additional log access methods based on aRaftLogReader
implementation. Some of the methods are moved fromStorageHelper
to this trait. -
Add:
Adapter
to let application use the seperated log state machine framework without rewritingRaftStorage
implementation. -
Refactor: shorten type names for the 2 example crates
Use an adapter to wrap
RaftStorage
:// Before: let store = MyRaftStorage::new(); Raft::new(..., store); // After: let store = MyRaftStorage::new(); let (log_store, sm) = Adaptoer::new(store); Raft::new(..., log_store, sm);
-
-
Changed: 9f8ae43e
RaftMetrics.replication
type toBTreeMap<NodeId, Option<LogId>>
; by 张炎泼; 2023-04-24The
RaftMetrics.replication
used to be of typeReplicationMetrics{ replication: BTreeMap<NodeId, ReplicationTargetMetrics> }
which contained an atomic log index value for each ReplicationTargetMetrics stored in theBTreeMap
. The purpose of this type was to reduce the cost of copying a metrics instance. However, since the metrics report rate has been significantly reduced, this cost is now negligible. As a result, these complicated implementations have been removed. When reporting metrics, they can simply be cloned from the progress information maintained byEngine
.Replace usage of
RaftMetrics.replication.data().replication.get(node_id)
withRaftMetrics.replication.get(node_id)
. -
Changed: 84539cb0 move snapshot type definition from storage traits to
RaftTypeConfig
; by 张炎泼; 2023-04-26Similar to
NodeId
orEntry
,SnapshotData
is also a data type that is specified by the application and needs to be defined inRaftTypeConfig
, which is a collection of all application types.Public types changes:
- Add
SnapshotData
toRaftTypeConfig
:pub trait RaftTypeConfig { /// ... type SnapshotData: AsyncRead + AsyncWrite + AsyncSeek + Send + Sync + Unpin + 'static; }
- Remove associated type
SnapshotData
fromstorage::RaftStorage
. - Remove associated type
SnapshotData
fromstorage::v2::RaftStateMachine
.
Corresponding API changes:
- Change
storage::RaftSnapshotBuilder<C: RaftTypeConfig, SNAPSHOT_DATA>
toRaftSnapshotBuilder<C>
- Change
storage::Snapshot<NID: NodeId, N: Node, SNAPSHOT_DATA>
tostorage::Snapshot<C>
Upgrade tip:
Update generic type parameter in application types to pass compilation.
- Add
-
Changed: e78bbffe remove unused error
CommittedAdvanceTooMany
; by 张炎泼; 2023-05-14Upgrade tip:
Do not use it.
-
Improved: 23f4a73b AppDataResponse does not need a Clone trait bound; by 张炎泼; 2023-03-09
- Fix: #703
-
Improved: 664635e0 loosen validity check with RaftState.snapshot_last_log_id(); by 张炎泼; 2023-03-10
A application may not persist snapshot. And when it restarted, the last-purged-log-id is not
None
butsnapshot_last_log_id()
is None. This is a valid state and should not emit error. -
Improved: 54aea8a2 fix: delay election if a greater last log id is seen; by 张炎泼; 2023-03-14
If this node sees a greater last-log-id on another node, it will be less likely to be elected as a leader. In this case, it is necessary to sleep for a longer period of time
smaller_log_timeout
so that other nodes with a greater last-log-id have a chance to elect themselves.Fix: such as state should be kept until next election, i.e., it should be a field of
Engine
instead of afield
ofinternal_server_state
. And this value should be greater than theelection_timeout
of every other node.
-
Changed: 9ddb5715 RaftState: make
RaftState.vote
private. Accesses vote via 2 new public methods:vote_ref()
andvote_last_modified()
.; by 张炎泼; 2023-03-12 -
Changed: 3b4f4e18 move log id related traits to mod
openraft::log_id
; by 张炎泼; 2023-03-14Move trait
RaftLogId
,LogIdOptionExt
andLogIndexOptionExt
fromopenraft::raft_types
to modopenraft::log_id
-
Changed: 342d0de2 rename variants in ChangeMembers, add
AddVoters
; by 张炎泼; 2023-03-01Rename
ChangeMembers::AddVoter
toAddVoterIds
, because it just updates voter ids.Rename
ChangeMembers::RemoveVoter
toRemoveVoters
.Add
ChangeMembers::AddVoters(BTreeMap)
to add voters with correspondingNode
, i.e., it adds nodes as learners and update the voter-ids in aMembership
.
- Added: 50821c37 impl PartialEq for Entry; by 张炎泼; 2023-03-02
-
Fixed: 97fa1581 discard blank log heartbeat, revert to the standard heartbeat; by 张炎泼; 2023-03-04
The blank log heartbeat design has two problems:
-
The heartbeat that sends a blank log introduces additional I/O, as a follower has to persist every log to maintain correctness.
-
Although
(term, log_index)
serves as a pseudo time in Raft, measuring whether a node has caught up with the leader and is capable of becoming a new leader, leadership is not solely determined by this pseudo time. Wall clock time is also taken into account.There may be a case where the pseudo time is not upto date but the clock time is, and the node should not become the leader. For example, in a cluster of three nodes, if the leader (node-1) is busy sending a snapshot to node-2(it has not yet replicated the latest logs to a quorum, but node-2 received message from the leader(node-1), thus it knew there is an active leader), node-3 should not seize leadership from node-1. This is why there needs to be two types of time, pseudo time
(term, log_index)
and wall clock time, to protect leadership.In the follow graph:
- node-1 is the leader, has 4 log entries, and is sending a snapshot to node-2,
- node-2 received several chunks of snapshot, and it perceived an active leader thus extended leader lease.
- node-3 tried to send vote request to node-2, although node-2 do not have as many logs as node-3, it should still reject node-3's vote request because the leader lease has not yet expired.
In the obsolete design, extending pseudo time
(term, index)
with atick
, in this case node-3 will seize the leadership from node-2.Ni: Node i Ei: log entry i N1 E1 E2 E3 E4 | v N2 snapshot +-----------------+ ^ | | leader lease | N3 E1 E2 E3 | vote-request ---------------+----------------------------> clock time now
The original document is presented below for reference.
-
-
Fixed: b5caa44d Wait::members() should not count learners as members; by 张炎泼; 2023-03-04
Wait::members()
waits until membership becomes the expected value. It should not check against all nodes. Instead, it should only check voters, excluding learners.
- Added: b3c2ff7e add Membership methods: voter_ids(), learner_ids(), get_node(); by 张炎泼; 2023-02-28
-
Fixed: 86e2ccd0 a single Candidate should be able to vote itself.; by 张炎泼; 2022-01-20
A Candidate should check if it is the only member in a cluster before sending vote request. Otherwise a single node cluster does work.
-
Fixed: 4015cc38 a Candidate should revert to Follower at once when a higher vote is seen; by 张炎泼; 2022-02-03
When a Candidate saw a higher vote, it store it at once. Then no more further granted votes are valid to this candidate, because vote they granted are changed.
Thus it was wrong to compare
last_log_id
before deciding if to revert to Follower. The right way is to revert to Follower at once and stop the voting procedure. -
Fixed: 1219a880 consistency issue between ReplicationCore.last_log_id and last_log_state.last_log_id; by 张炎泼; 2022-02-28
-
Fixed: efdc321d a leader should report leader metrics with value
Update::AsIs
instead ofUpdate::Update(None)
. Otherwise it mistakenly purges metrics about replication; by 张炎泼; 2022-04-01 -
Fixed: 797fb9b1 update replication metrics only when the replication task stopped, to provide a consistent view of RaftMetrics; by 张炎泼; 2022-06-04
-
Fixed: 918b48bc #424 wrong range when searching for membership entries:
[end-step, end)
.; by 张炎泼; 2022-07-03The iterating range searching for membership log entries should be
[end-step, end)
, not[start, end)
. With this bug it will return duplicated membership entries.- Related: #424
-
Fixed: 8594807c metrics has to be updated last; by 张炎泼; 2022-07-13
Otherwise the application receives updated metrics while the internal raft state is still stale.
-
Fixed: 59ddc982 avoid creating log-id with uninitialized
matched.leader_id
.; by 张炎泼; 2022-07-26When waiting for a newly added learner to become up to date, it tries to compare last-log-id and the reported
matched
replication state. But thematched
may have not yet receive any update and is uninitialized, in such case, it tries to create a temp LogId withleader_id(0, 0)
, which is illegal.The fix is simple: do not use log-id. Just calculating replication lag by log index.
Add test to reproduce it: openraft/tests/membership/t99_issue_471_adding_learner_uses_uninit_leader_id.rs
- Fix: #471
-
Fixed: 43dd8b6f when leader reverts to follower, send error to waiting clients; by 张炎泼; 2022-08-06
When a leader reverts to follower, e.g., if a higher vote is seen, it should inform waiting clients that leadership is lost.
-
Fixed: 71a290cd when handling append-entries, if
prev_log_id
is purged, it should not treat it as a conflict; by 张炎泼; 2022-08-14when handling append-entries, if
prev_log_id
is purged, it should not treat it as a conflict log and should not delete any log.This bug is caused by using
committed
aslast_applied
.committed
may be smaller thanlast_applied
when a follower just starts up.The solution is merging
committed
andlast_applied
into one field:committed
, which is always greater than or equal the actually committed(applied). -
Fixed: 674e78aa potential inconsistency when installing snapshot; by 张炎泼; 2022-09-21
The conflicting logs that are before
snapshot_meta.last_log_Id
should be deleted before installing a snapshot.Otherwise there is chance the snapshot is installed but conflicting logs are left in the store, when a node crashes.
-
Fixed: 4ea66acd stop tick task when shutting down Raft; by Matthias Wahl; 2022-09-27
-
Fixed: 56486a60 Error after change_membership:
assertion failed: value > prev
: #584; by 张炎泼; 2022-10-29Problem:
Error occurs after calling
change_membership()
:assertion failed: value > prev
, when changing membership by converting a learner to a voter.Because the replication streams are re-spawned, thus progress reverts to zero. Then a reverted progress causes the panic.
Solution:
When re-spawning replications, remember the previous progress.
- Fix: #584
-
Fixed: 678af4a8 when responding ForwardToLeader, make
leader_id
a None if the leader is no longer in the cluster; by 张炎泼; 2022-11-02 -
Fixed: 0023cff1 delay leader step down; by 张炎泼; 2022-11-06
When a membership that removes the leader is committed, the leader continue to work for a short while before reverting to a learner. This way, let the leader replicate the
membership-log-is-committed
message to followers.Otherwise, if the leader step down at once, the follower might have to re-commit the membership log again.
After committing the membership log that does not contain the leader, the leader will step down in the next
tick
. -
Fixed: ff9a9335 it should make a node non-leader when restarting single node cluster; by 张炎泼; 2022-12-03
A node should not set
server_state
toLeader
when just starting up, even when it's the only voter in a cluster. It still needs several step to initialize leader related fields to become a leader.- Fix: #607
-
Fixed: 0e7ab5a7 workaround cargo leaking SSL_CERT_FILE issue; by 张炎泼; 2022-12-09
On Linux: command
cargo run
pollutes environment variables: It leaksSSL_CERT_FILE
andSSL_CERT_DIR
to the testing sub progress it runs. Which causereqwest
spending ~50 ms loading the certificates for every RPC.We just extend the RPC timeout to work around.
- Fix: #550
-
Fixed: cc8af8cd last_purged_log_id is not loaded correctly; by 张炎泼; 2023-01-08
-
Fix:
last_purged_log_id
should beNone
, but notLogId{index=0, ..}
when raft startup with a store with log at index 0.This is fixed by adding another field
next_purge
to distinguishlast_purged_log_id
valueNone
andLogId{index=0, ..}
, becauseRaftState.log_ids
storesLogId
but notOption<LogId>
. -
Add a wrapper
Valid<RaftState>
ofRaftState
to check if the state is valid every time accessing it. This check is done only whendebug_assertions
is turned on.
-
-
Fixed: 9dbbe14b check_is_leader() should return at once if encountering StorageError; by 张炎泼; 2023-02-12
Refactor: ExtractFatal is not used any more. Fatal error should only be raised by Command executor, no more by API handler. There is no need to extract Fatal error from an API error.
-
Fixed: a80579ef a stepped down leader should ignore replication progress message; by 张炎泼; 2023-02-12
-
Fixed: c8fccb22 when adding a learner, ensure the last membership is committed; by 张炎泼; 2023-02-19
Previously, when adding a learner to a Raft cluster, the last membership was not always marked as committed, which could cause issues when a follower tried to truncate logs by reverting to the last committed membership. To prevent this issue, we have updated the code to ensure the last membership is committed when adding a learner.
In addition to this fix, we have also made several refactoring changes, including refining method names for trait
Coherent
, renamingMembership::next_safe()
tonext_coherent()
for consistency, and updating enumChangeMembers
to include more variants for adding and removing learners. We have also removedRaftCore::add_learner()
in favor of usingchange_membership()
for all membership operations, and added aChangeHandler
to build new membership configurations for change-membership requests.Finally, we have updated the
Membership
API with a new methodnew_with_nodes()
for building a new membership configuration, and moved the validation check out into a separate function,ensure_valid()
. Validation is now done only when needed.
-
Changed: 86e2ccd0
Wait::log_at_least()
useOption<u64>
as the input log index, instead of using u64; by 张炎泼; 2022-01-20 -
Changed: 71a290cd remove
RaftState.last_applied
, usecommitted
to represent the already committed and applied log id; by 张炎泼; 2022-08-14 -
Changed: 2254ffc5 add sub error types of ReplicationError; by 张炎泼; 2022-01-20
-
Add sub errors such as Timeout and NetworkError.
-
Remove ReplicationError::IO, use StorageError instead.
-
-
Changed: f08a3e6d RaftNetwork return
RPCError
instead of anyhow::Error; by 张炎泼; 2022-01-23-
When a remote error encountered when replication, the replication will be stopped at once.
-
Fix: #140
-
-
Changed: d55fa625 add ConfigError sub error; remove anyhow; by 张炎泼; 2022-01-23
- Fix: #144
-
Changed: 58f2491f
RaftStorage
: useVote
to replaceHardState
; by 张炎泼; 2022-01-25-
Rename: save_hard_state() and read_hard_state() to save_vote() and read_vote().
-
Replace
term, node_id
pair withVote
in RaftCore and RPC struct-s.
-
-
Changed: a68a9a9a use
term, node_id, index
to identify a log entry; by 张炎泼; 2022-01-26 -
Changed: 0b753622
Raft::add_learner()
accepts optional argNode
.; by 张炎泼; 2022-02-17When adding a learner, an optional
Node
can be provided to store additional info of a node in Membership.A common usage if to store node address in the Membership so that an application does not need another component to get address of a node when implementing
RaftNetwork
. -
Changed: 5ba730c9 Replace replication state in RaftMetrics with a reference to atomic values; by Ivan Schréter; 2022-02-22
-
Changed: a76f41ac Extract RaftLogReader, RaftSnapshotBuilder from RaftStorage, split RaftNetwork and RaftNetworkFactory; by Ivan Schréter; 2022-02-22
RaftStorage is now refactored to:
- RaftLogReader to read data from the log in parallel tasks independent of the main Raft loop
- RaftStorage to modify the log and the state machine (implements also RaftLogReader) intended to be used in the main Raft loop
- RaftSnapshotBuilder to build the snapshot in background independent of the main Raft loop
The RaftStorage API offers to create new RaftLogReader or RaftSnapshotBuilder on it.
RaftNetwork is also refactored to:
- RaftNetwork responsible for sending RPCs
- RaftNetworkFactory responsible for creating instances of RaftNetwork for sending data to a particular node
-
Changed: f40c2055 Add a
RaftTypeConfig
trait to configure common types; by Ivan Schréter; 2022-02-25 -
Changed: 650e2352 Membership remove redundant field
learners
: the node ids that are inMembership.nodes
but not inMembership.configs
are learners; by 张炎泼; 2022-03-07 -
Changed: 81cd3443 EffectiveMembership.log_id to Option; by 张炎泼; 2022-04-05
-
Changed: 67375a2a RaftStorage: use
EffectiveMembership
instead ofOption<_>
; by 张炎泼; 2022-04-05 -
Changed: ffc82682 rename ReplicationMetrics and methods in MetricsChangeFlags; by 张炎泼; 2022-04-05
-
Change: rename ReplicationMetrics to ReplicationTargetMetrics
-
Change: rename LeaderMetrics to ReplicationMetrics
-
-
Changed: 7b1d4660 rename RaftMetrics.leader_metrics to replication; by 张炎泼; 2022-04-06
-
Changed: 30b485b7 rename State to ServerState; by 张炎泼; 2022-04-16
-
Changed: ca8a09c1 rename InitialState to RaftState; by 张炎泼; 2022-04-16
-
Changed: 8496a48a add error
Fatal::Panicked
, storing RaftCore panic; by 张炎泼; 2022-05-09Changes:
-
Add
committed_membership
to RaftState, to store the previous committed membership config. -
Change:
RaftStorage::get_membership()
returns a vec of at most 2 memberships. -
Change:
RaftStorage::last_membership_in_log()
returns a vec of at most 2 memberships.
-
-
Changed: 1f645feb add
last_membership
toSnapshotMeta
; by 张炎泼; 2022-05-12 -
Changed: bf4e0497 Make serde optional; by devillve084; 2022-05-22
-
Changed: b96803cc
external_request()
replace the 1st arg ServerState with RaftState; by 张炎泼; 2022-06-08This change let user do more things with a external fn request.
-
Changed: d81c7279 after shutdown(), it should return an error when accessing Raft, instead of panicking.; by devillve084; 2022-06-16
-
Changed: 0de003ce remove
RaftState.last_log_id
andRaftState.last_purged_log_id
; by 张炎泼; 2022-06-22Remove these two fields, which are already included in
RaftState.log_ids
; uselast_log_id()
andlast_purged_log_id()
instead. -
Changed: 7f00948d API: cleanup APIs in Membership and EffectiveMembership; by 张炎泼; 2022-06-29
-
Refactor: move impl of
QuorumSet
fromMembership
toEffectiveMembership
.Add a field
EffectiveMembership.quorum_set
, to store aQuorumSet
built from theMembership
config. This quorum set can have a different structure from theMembership
, to optimized quorum check. -
Refactor: impl methods in
Membership
orEffectiveMembership
with Iterator if possible. -
Refactor: use term
voter
andlearner
for methods and fields.
-
-
Changed: 01a16d08 remove
tx
fromspawn_replication_stream()
; by 张炎泼; 2022-07-01Replication should not be responsible invoke the callback when replication become upto date. It makes the logic dirty. Such a job can be done by watching the metrics change.
- Change: API: AddLearnerResponse has a new field
membership_log_id
which is the log id of the membership log that contains the newly added learner.
- Change: API: AddLearnerResponse has a new field
-
Changed: 6b9ae52f remove error
AddLearnerError::Exists
; by 张炎泼; 2022-07-01Even when the learner to add already exists, the caller may still want to block until the replication catches up. Thus it does not expect an error.
And
Exists
is not an issue the caller has to deal with, it does not have to be an error. -
Changed: d7afc721 move default impl methods in
RaftStorage
toStorageHelper
.; by 张炎泼; 2022-07-01get_initial_state()
get_log_id()
get_membership()
last_membership_in_log()
In the trait
RaftStorage
, these methods provide several default methods that users do not need to care about. It should no longer be methods that user may need to implement.To upgrade:
If you have been using these methods, replace
sto.xxx()
withStorageHelper::new(&mut sto).xxx()
. -
Changed: a010fddd Stop replication to removed node at once when new membership is seen; by 张炎泼; 2022-07-12
Before this commit, when membership changes, e.g., from a joint config
[(1,2,3), (3,4,5)]
to uniform config[3,4,5]
(assuming the leader is3
), the leader stops replication to1,2
when[3,4,5]
is committed.This is an unnecessarily complicated solution. It is OK for the leader to stop replication to
1,2
as soon as config[3,4,5]
is seen, instead of when config[3,4,5]
is committed.- If the leader(
3
) finally committed[3,4,5]
, it will eventually stop replication to1,2
. - If the leader(
3
) crashes before committing[3,4,5]
:- And a new leader sees the membership config log
[3,4,5]
, it will continue to commit it and finally stop replication to1,2
. - Or a new leader does not see membership config log
[3,4,5]
, it will re-establish replication to1,2
.
- And a new leader sees the membership config log
In any case, stopping replication at once is OK.
One of the considerations about this modification is: The nodes, e.g.,
1,2
do not know they have been removed from the cluster:-
Removed node will enter the candidate state and keeps increasing its term and electing itself. This won't affect the working cluster:
-
The nodes in the working cluster have greater logs; thus, the election will never succeed.
-
The leader won't try to communicate with the removed nodes thus it won't see their higher
term
.
-
-
Removed nodes should be shut down finally. No matter whether the leader replicates the membership without these removed nodes to them, there should always be an external process that shuts them down. Because there is no guarantee that a removed node can receive the membership log in a finite time.
Changes:
-
Change: remove config
remove_replication
, since replication will be removed at once. -
Refactor: Engine outputs
Command::UpdateReplicationStream
to inform the Runtime to update replication, when membership changes. -
Refactor: remove
ReplicationState.failures
, replication does not need count failures to remove it. -
Refactor: remove
ReplicationState.matched
: the matched log id has been tracked byEngine.state.leader.progress
. -
Fix: #446
- If the leader(
-
Changed: 2d1aff03 error InProgress: add field
committed
; by 张炎泼; 2022-07-15- Refactor: Simplify Engine command executor
-
Changed: 8c7f0857 remove ClientWriteRequest; by 张炎泼; 2022-08-01
Remove struct
ClientWriteRequest
.ClientWriteRequest
is barely a wrapper that does not provide any additional function.Raft::client_write(ClientWriteRequest)
is changed toRaft::client_write(app_data: D)
, whereD
is application definedAppData
implementation. -
Changed: 565b6921
ErrorSubject::Snapshot(SnapshotSignature)
; by 张炎泼; 2022-08-02Change
ErrorSubject::Snapshot(SnapshotMeta)
toErrorSubject::Snapshot(SnapshotSignature)
.SnapshotSignature
is the same asSnapshotMeta
except it does not includeMembership
information. This way errors do not have to depend on typeNode
, which is used inMembership
and it is a application specific type.Then when a user-defined generic type
NodeData
is introduced, error types do not need to change.- Part of: #480
-
Changed: e4b705ca Turn
Node
into a trait (#480); by Heinz N. Gies; 2022-08-03Structs that depend on
Node
now have to implementtrait Node
, or use a predefined basic implementationBasicNode
. E.g.,struct Membership
now has two type parameters:impl<NID, N> Membership<NID, N> where N: Node, NID: NodeId
. -
Changed: c836355a
Membership.nodes
removeOption
from value; by 张炎泼; 2022-08-04Before this commit, the value of
Membership.nodes
isOption<N: Node>
:Membership.nodes: BTreeMap<NID, Option<N>>
The value does not have to be an
Option
. If an application does not need openraft to store theNode
data, it can just implementtrait Node
with an empty struct, or just useBasicNode
as a placeholder.- Using
Option<N>
as the value is a legacy and since #480 is merged, we do not need theOption
any more.
- Using
-
Changed: 70e3318a SnapshotMeta.last_log_id from LogId to Option of LogId; by 张炎泼; 2022-08-17
SnapshotMeta.last_log_id
should be the same type asStateMachine.last_applied
.By making
SnapshotMeta.last_log_id
an Option of LogId, a snapshot can be build on an empty state-machine(in whichlast_applied
is None). -
Changed: d0d04b28 only purge logs that are in snapshot; by 张炎泼; 2022-08-28
Let
snapshot+logs
be a complete state of a raft node.The Assumption before is
state_machine+logs
is a complete state of a raft node. This requires state machine to persist the state every time applying a log, which would be an innecessary overhead.-
Change: remove ENV config entries. Do not let a lib be affected by environment variables.
-
Change: remove
Config.keep_unsnapshoted_log
: now by default, logs not included in snapshot won't be deleted.Rename
Config.max_applied_log_to_keep
tomax_in_snapshot_log_to_keep
.
-
-
Changed: 3111e7e6 RaftStorage::install_snapshot() does not need to return state changes; by 张炎泼; 2022-08-28
The caller of
RaftStorage::install_snapshot()
knows about what changes have been made, the return value is unnecessary. -
Changed: a12fd8e4 remove error MissingNodeInfo; by 张炎泼; 2022-11-02
Because in a membership the type
Node
is not anOption
any more,MissingNodeInfo
error will never occur. -
Changed: dbeae332 rename
IntoOptionNodes
toIntoNodes
; by 张炎泼; 2022-11-02 -
Changed: e8ec9c50 EffectiveMembership::get_node() should return an Option; by 张炎泼; 2022-11-02
EffectiveMembership::get_node()
should return anOption<&Node>
instead of a&Node
. Otherwise it panic if the node is not found. -
Changed: 93116312 remove error NodeNotFound; by 张炎泼; 2022-12-28
A node is stored in
Membership
thus it should always be found. Otherwise it is a bug of openraft. In either case, there is no need for an application to deal withRPCError::NodeNotFound
error.An application that needs such an error should define it as an application error.
-
Migration guide: if you do have been using it, you could just replace
NodeNotFound
withNetworkError
. -
Fix: #623
-
-
Changed: e1238428 RaftState: add field snapshot_meta; by 张炎泼; 2022-12-30
Snapshot meta should be part of the
RaftState
. Move it fromEngine
toRaftState
-
Changed: 2dd81018 make Raft::new() async and let it return error during startup; by 张炎泼; 2023-01-02
-
Change: move startup process from
RaftCore::do_main()
toRaft::new()
, so that an error during startup can be returned earlier.Upgrade guide: application has to consume the returned future with
Raft::new().await
, and the error returned by the future. -
Refactor: move id from
Engine.id
toEngine.config.id
, so that accessing constant attribute does not depend on a reference toEngine
.
-
-
Changed: 3d5e0016 A restarted leader should enter leader state at once, without another round of election; by 张炎泼; 2023-01-04
-
Test: single-node restart test does not expect the node to run election any more.
-
Refactor: add VoteHandler to handle vote related operations.
-
Change: make ServerState default value
Learner
. -
Fix: #607
-
-
Changed: 77e87a39 remove InitializeError::NotAMembershipEntry error; by 张炎泼; 2023-02-12
Such an error can only be caused by internal calls. An application do not need to handle it.
-
Changed: fbb3f211 add RaftError as API return error type.; by 张炎泼; 2023-02-12
Add
RaftError<E>
as error type returned by everyRaft::xxx()
API. RaftError has two variants: Fatal error or API specific error. This way every API error such as AppendEntriesError does not have to include anFatal
in it.Upgrade tip:
The affected types is mainly
trait RaftNetwork
, an application should replace AppendEntriesError, VoteError, InstallSnapshotError withRaftError<_>
,RaftError<_>
, andRaftError<_, InstallSnapshotError>
.So is for other parts, e.g.,
Raft::append_entries()
now returnsResult<AppendEntriesResponse, RaftError<_>>
, an application should also rewrite error handling that calls these APIs.See changes in examples/.
-
Changed: d1b3b232 remove RaftNetworkFactory::ConnectionError and AddLearnerError::NetworkError; by 张炎泼; 2023-02-12
RaftNetworkFactory::new_client()
does not return an error because openraft can only ignore it. Therefore it should not create a connection but rather a client that will connect when required. Thus there is chance it will build a client that is unable to send out anything, e.g., in case the Node network address is configured incorrectly.Because of the above change, And
AddLearnerError
will not include a NetworkError any more, because when adding a learner, the connectivity can not be effectively detected.Upgrade tip:
Just update the application network implementation so that it compiles.
-
Changed: 0161a3d2 remove AddLearnerResponse and AddLearnerError; by 张炎泼; 2023-02-17
In openraft adds a learner is done by committing a membership config log, which is almost the same as committing any log.
AddLearnerResponse
contains a fieldmatched
to indicate the replication state to the learner, which is not included inClientWriteResponse
. This information can be retrieved viaRaft::metrics()
.Therefore to keep the API simple, replace
AddLearnerResponse
withClientWriteResponse
.Behavior change: adding a learner always commit a new membership config log, no matter if it already exists in membership. To avoid duplicated add, an application should check existence first by examining
Raft::metrics()
- Fix: #679
Upgrade tips:
- Replace AddLearnerResponse with ClientWriteResponse
- Replace AddLearnerError with ClientWriteError
Passes the application compilation.
See the changes in examples/.
-
Changed: 9906d6e9 remove non-blocking membership change; by 张炎泼; 2023-02-18
When changing membership in nonblocking mode, the leader submits a membership config log but does not wait for the log to be committed.
This is useless because the caller has to assert the log is committed, by periodically querying the metrics of a raft node, until it is finally committed. Which actually makes it a blocking routine.
API changes:
- Removes
allow_lagging
parameter fromRaft::change_membership()
- Removes error
LearnerIsLagging
Upgrade tip:
Adjust API calls to make it compile.
Refactor: move
leader_append_entries()
toLeaderHandler
. - Removes
-
Changed: f591726a trait IntoNodes adds two new method has_nodes() and node_ids(); by 张炎泼; 2023-02-19
trait IntoNodes
converts typesT
such asVec
orBTreeSet
intoBTreeMap<NID, Node>
.This patch changes the functionality of the
IntoNodes
trait to provide two new methodshas_nodes()
andnode_ids()
, in addition to the existinginto_nodes()
method. Thehas_nodes()
method returns true if the typeT
contains anyNode
objects, andnode_ids()
returns aVec
of theNodeId
objects associated with theNode
objects inT
.Refactor:
The patch also refactors the
Membership::next_safe()
method to return anErr(LearnerNotFound)
if it attempts to build aMembership
object containing avoter_id
that does not correspond to anyNode
. -
Changed: 55217aa4 move default implemented method from trait
RaftLogReader
toStorageHelper
; by 张炎泼; 2023-02-21Function
get_log_entries()
andtry_get_log_entry()
are provided by traitRaftLogReader
with default implementations. However, they do not need to be part of this trait and an application does not have to implement them.Therefore in this patch they are moved to
StorageHelper
struct, which provides additional storage access methods that are built based on theRaftStorage
trait. -
Changed: 0a1dd3d6 replace EffectiveMembership with StoredMembership in RaftStorage; by 张炎泼; 2023-02-26
EffectiveMembership
is a struct used at runtime, which contains additional information such as an optimizedQuorumSet
implementation that has different structure from aMembership
.To better separate concerns, a new struct called
StoredMembership
has been introduced specifically for storage purpose. It contains only the information that needs to be stored in storage. Therefore,StoredMembership
is used instead ofEffectiveMembership
in RaftStorage.Upgrade tip:
Replace
EffectiveMembership
withStoredMembership
in an application.Fields in
EffectiveMembership
are made private and can be accessed via corresponding methods such as:EffectiveMembership.log_id
andEffectiveMembership.membership
should be replaced withEffectiveMembership::log_id()
andEffectiveMembership::membership()
.
-
Added: 966eb287 use a version to track metrics change; by 张炎泼; 2022-03-03
Add
Versioned<D>
to track changes of anArc<D>
.In openraft, some frequently updated object such metrics are wrapped in an
Arc
, and some modification is made in place: by storing anAtomicU64
. -
Added: 80f89134 Add support for external requests to be executed inside of Raft core loop; by Ivan Schréter; 2022-03-05
The new feature is also exposed via
RaftRouter
test fixture and tested in the initialization test (in addition to the original checks). -
Added: 2a5c1b9e add feature-flag:
bt
enables backtrace; by 张炎泼; 2022-03-12 -
Added: 16406aec add error NotAllowed; by 张炎泼; 2022-04-06
-
Added: a8655446 InitialState: add
last_purged_log_id
; by 张炎泼; 2022-04-07 -
Added: 6f20e1fc add trait
RaftPayload
RaftEntry
to access payload and entry without the need to know about user data, i.e.,AppData
orAppDataResponse
.; by 张炎泼; 2022-04-07 -
Added: 67c870e2 add err: NotAMembershipEntry, NotInMembers; by 张炎泼; 2022-04-16
-
Added: 675a0f8f Engine stores log ids; by 张炎泼; 2022-04-16
-
Added: 2262c79f LogIdList: add method purge() to delete log ids; by 张炎泼; 2022-05-08
-
Added: ff898cde Engine: add method: purge_log(); by 张炎泼; 2022-05-08
-
Added: 4d0918f2 Add rocks based example; by Heinz N. Gies; 2022-07-05
-
Added: 86eb2981 Raft::enable_tick() to enable or disable election timeout; by 张炎泼; 2022-07-31
-
Added: 956177df use blank log for heartbeat (#483); by 张炎泼; 2022-08-01
- Feature: use blank log for heartbeat
Heartbeat in standard raft is the way for a leader to assert it is still alive.
- A leader send heartbeat at a regular interval.
- A follower that receives a heartbeat believes there is an active leader thus it rejects election request(
send_vote
) from another node unreachable to the leader, for a short period.
Openraft heartbeat is a blank log
Such a heartbeat mechanism depends on clock time. But raft as a distributed consensus already has its own pseudo time defined very well. The pseudo time in openraft is a tuple
(vote, last_log_id)
, compared in dictionary order.Why it works
To refuse the election by a node that does not receive recent messages from the current leader, just let the active leader send a blank log to increase the pseudo time on a quorum.
Because the leader must have the greatest pseudo time, thus by comparing the pseudo time, a follower automatically refuse election request from a node unreachable to the leader.
And comparing the pseudo time is already done by
handle_vote_request()
, there is no need to add another timer for the active leader.Other changes:
-
Feature: add API to switch timeout based events:
Raft::enable_tick()
: switch on/off election and heartbeat.Raft::enable_heartbeat()
: switch on/off heartbeat.Raft::enable_elect()
: switch on/off election.
These methods make some testing codes easier to write. The corresponding
Config
entries are also added:Config::enable_tick
Config::enable_heartbeat
Config::enable_elect
-
Refactor: remove Engine
Command::RejectElection
. Rejecting election now is part ofhandle_vote_req()
as blank-log heartbeat is introduced. -
Refactor: heartbeat is removed from
ReplicationCore
. Instead, heartbeat is emitted byRaftCore
. -
Fix: when failed to sending append-entries, do not clear
need_to_replicate
flag. -
CI: add test with higher network delay.
-
Doc: explain why using blank log as heartbeat.
-
Fix: #151
-
Added: b6817758 add
Raft::trigger_elect()
andRaft::trigger_heartbeat()
to let user manually trigger a election or send a heartbeat log; by 张炎泼; 2022-08-06 -
Added: f437cda0 add Raft::trigger_snapshot() to manually trigger to build snapshot at once; by 张炎泼; 2022-08-07
-
Added: eae08515 Added sled store example based on rocks example; by kus; 2022-08-16
-
Added: 07a2a677 adding a snapshot finalize timeout config; by Zach Schoenberger; 2022-11-09
-
Added: 2877be0c add config: send_snapshot_timeout; by Zach Schoenberger; 2022-11-09
-
Added: 541e9d36 add "Inflight" to store info about inflight replication data; by 张炎泼; 2023-01-17
-
Added: 4a85ee93 feature flag "single-term-leader": standard raft mode; by 张炎泼; 2023-02-13
With this feature on: only one leader can be elected in each term, but reduce LogId size from
LogId:{term, node_id, index}
toLogId{term, index}
.Add
CommittedLeaderId
as the leader-id type used inLogId
: The leader-id used inLogId
can be different(smaller) from leader-id used inVote
, depending onLeaderId
definition.CommittedLeaderId
is the smallest data that can identify a leader after the leadership is granted by a quorum(committed).Change: Vote stores a LeaderId in it.
// Before pub struct Vote<NID> { pub term: u64, pub node_id: NID, pub committed: bool, } // After pub struct Vote<NID> { #[cfg_attr(feature = "serde", serde(flatten))] pub leader_id: LeaderId<NID>, pub committed: bool, }
Upgrade tip:
If you manually serialize
Vote
, i.e. without usingserde
, the serialization part should be rewritten.Otherwise, nothing needs to be done.
- Fix: #660
-
Added: 4f4b05f6 add v07-v08 compatible store rocksstore-compat07; by 张炎泼; 2023-02-25
-
Changed: 1bd22edc remove AddLearnerError::Exists, which is not actually used; by 张炎泼; 2022-09-30
-
Changed: c6fe29d4 change-membership does not return error when replication lags; by 张炎泼; 2022-10-22
If
blocking
istrue
,Raft::change_membership(..., blocking)
will block until repliication to new nodes become upto date. But it won't return an error when proposing change-membership log.-
Change: remove two errors:
LearnerIsLagging
andLearnerNotFound
. -
Fix: #581
-
-
Fixed: 2896b98e changing membership should not remove replication to all learners; by 张炎泼; 2022-09-30
When changing membership, replications to the learners(non-voters) that are not added as voter should be kept.
E.g.: with a cluster of voters
{0}
and learners{1, 2, 3}
, changing membership to{0, 1, 2}
should not remove replication to node3
.Only replications to removed members should be removed.
- Added: 9a22bb03 add rocks-store as a
RaftStorage
implementation based on rocks-db; by 张炎泼; 2023-02-22
-
Changed: 25e94c36 InstallSnapshotResponse: replies the last applied log id; Do not install a smaller snapshot; by 张炎泼; 2022-09-22
A snapshot may not be installed by a follower if it already has a higher
last_applied
log id locally. In such a case, it just ignores the snapshot and respond with its locallast_applied
log id.This way the applied state(i.e.,
last_applied
) will never revert back.
-
Fixed: 21684bbd potential inconsistency when installing snapshot; by 张炎泼; 2022-09-22
The conflicting logs that are before
snapshot_meta.last_log_id
should be deleted before installing a snapshot.Otherwise there is chance the snapshot is installed but conflicting logs are left in the store, when a node crashes.
- Added: 568ca470 add Raft::remove_learner(); by 张炎泼; 2022-09-02
-
Added: ea696474 add feature-flag:
bt
enables backtrace; by 张炎泼; 2022-03-12--features bt
enables backtrace when generating errors. By default errors does not contain backtrace info.Thus openraft can be built on stable rust by default.
To use on stable rust with backtrace, set
RUSTC_BOOTSTRAP=1
, e.g.:RUSTUP_TOOLCHAIN=stable RUSTC_BOOTSTRAP=1 make test
- Changed: f99ade30 API: move default impl methods in RaftStorage to StorageHelper; by 张炎泼; 2022-07-04
-
Fixed: 44381b0c when handling append-entries, if prev_log_id is purged, it should not delete any logs.; by 张炎泼; 2022-08-14
When handling append-entries, if the local log at
prev_log_id.index
is purged, a follower should not believe it is a conflict and should not delete all logs. It will get committed log lost.To fix this issue, use
last_applied
instead ofcommitted
:last_applied
is always the committed log id, whilecommitted
is not persisted and may be smaller than the actually applied, when a follower is restarted.
-
Fixed: 30058c03 #424 wrong range when searching for membership entries:
[end-step, end)
.; by 张炎泼; 2022-07-03The iterating range searching for membership log entries should be
[end-step, end)
, not[start, end)
. With this bug it will return duplicated membership entries.- Bug: #424
-
Fixed: d836d85c if there may be more logs to replicate, continue to call send_append_entries in next loop, no need to wait heartbeat tick; by lichuang; 2022-01-04
-
Fixed: 5a026674 defensive_no_dirty_log hangs tests; by YangKian; 2022-01-08
-
Fixed: 8651625e save leader_id if a higher term is seen when handling append-entries RPC; by 张炎泼; 2022-01-10
Problem:
A follower saves hard state
(term=msg.term, voted_for=None)
when amsg.term > local.term
when handling append-entries RPC.This is quite enough to be correct but not perfect. Correct because:
-
In one term, only an established leader will send append-entries;
-
Thus, there is a quorum voted for this leader;
-
Thus, no matter what
voted_for
is saved, it is still correct. E.g. when handling append-entries, a follower node could save hard state(term=msg.term, voted_for=Some(ANY_VALUE))
.
The problem is that a follower already knows the legal leader for a term but still does not save it. This leads to an unstable cluster state: The test sometimes fails.
Solution:
A follower always save hard state with the id of a known legal leader.
-
-
Fixed: 1a781e1b when lack entry, the snapshot to build has to include at least all purged logs; by 张炎泼; 2022-01-18
-
Fixed: a0a94af7 span.enter() in async loop causes memory leak; by 张炎泼; 2022-06-17
It is explained in: https://onesignal.com/blog/solving-memory-leaks-in-rust/
-
Changed: c9c8d898 trait RaftStore: remove get_membership_config(), add last_membership_in_log() and get_membership() with default impl; by drdr xp; 2022-01-04
Goal: minimize the work for users to implement a correct raft application.
Now RaftStorage provides default implementations for
get_membership()
andlast_membership_in_log()
.These two methods just can be implemented with other basic user impl methods.
- fix: #59
-
Changed: abda0d10 rename RaftStorage methods do_log_compaction: build_snapshot, delete_logs_from: delete_log; by 张炎泼; 2022-01-15
-
Changed: a52a9300 RaftStorage::get_log_state() returns last purge log id; by 张炎泼; 2022-01-16
-
Change:
get_log_state()
returns thelast_purged_log_id
instead of thefirst_log_id
. Because there are some cases in which log are empty: When a snapshot is install that covers all logs, or whenmax_applied_log_to_keep
is 0.Returning
None
is not clear about if there are no logs at all or all logs are deleted.In such cases, raft still needs to maintain log continuity when repilcating. Thus the last log id that once existed is important. Previously this is done by checking the
last_applied_log_id
, which is dirty and buggy.Now an implementation of
RaftStorage
has to maintain thelast_purged_log_id
in its store. -
Change: Remove
first_id_in_log()
,last_log_id()
,first_known_log_id()
, because concepts are changed. -
Change: Split
delete_logs()
into two method for clarity:delete_conflict_logs_since()
for deleting conflict logs when the replication receiving end find a conflict log.purge_logs_upto()
for cleaning applied logs -
Change: Rename
finalize_snapshot_installation()
toinstall_snapshot()
. -
Refactor: Remove
initial_replicate_to_state_machine()
, which does nothing more than a normal applying-logs. -
Refactor: Remove
enum UpdateCurrentLeader
. It is just a wrapper of Option.
-
-
Changed: 7424c968 remove unused error MembershipError::Incompatible; by 张炎泼; 2022-01-17
-
Changed: beeae721 add ChangeMembershipError sub error for reuse; by 张炎泼; 2022-01-17
-
Fixed: 4d58a51e a non-voter not in joint config should not block replication; by drdr xp; 2021-08-31
-
Fixed: eed681d5 race condition of concurrent snapshot-install and apply.; by drdr xp; 2021-09-01
Problem:
Concurrent snapshot-install and apply mess up
last_applied
.finalize_snapshot_installation
runs in theRaftCore
thread.apply_to_state_machine
runs in a separate tokio task(thread).Thus there is chance the
last_applied
being reset to a previous value:-
apply_to_state_machine
is called and finished in a thread. -
finalize_snapshot_installation
is called inRaftCore
thread and finished withlast_applied
updated. -
RaftCore
thread finished waiting forapply_to_state_machine
, and updatedlast_applied
to a previous value.
RaftCore: -. install-snapshot, .-> replicate_to_sm_handle.next(), | update last_applied=5 | update last_applied=2 | | v | task: apply 2------------------------' --------------------------------------------------------------------> time
Solution:
Rule: All changes to state machine must be serialized.
A temporary simple solution for now is to call all methods that modify state machine in
RaftCore
thread. But this way it blocksRaftCore
thread.A better way is to move all tasks that modifies state machine to a standalone thread, and send update request back to
RaftCore
to update its fields such aslast_applied
-
-
Fixed: a48a3282 handle-vote should compare last_log_id in dictionary order, not in vector order; by drdr xp; 2021-09-09
A log
{term:2, index:1}
is definitely greater than log{term:1, index:2}
in raft spec. Comparing log id in the way ofterm1 >= term2 && index1 >= index2
blocks election: no one can become a leader. -
Fixed: 228077a6 a restarted follower should not wait too long to elect. Otherwise the entire cluster hangs; by drdr xp; 2021-11-19
-
Fixed: 6c0ccaf3 consider joint config when starting up and committing.; by drdr xp; 2021-12-24
-
Change: MembershipConfig support more than 2 configs
-
Makes fields in MembershipConfig privates. Provides methods to manipulate membership.
-
Fix: commit without replication only when membership contains only one node. Previously it just checks the first config, which results in data loss if the cluster is in a joint config.
-
Fix: when starting up, count all nodes but not only the nodes in the first config to decide if it is a single node cluster.
-
-
Fixed: b390356f first_known_log_id() should returns the min one in log or in state machine; by drdr xp; 2021-12-28
-
Fixed: cd5a570d clippy warning; by lichuang; 2022-01-02
-
Changed: deda6d76 remove PurgedMarker. keep logs clean; by drdr xp; 2021-09-09
Changing log(add a PurgedMarker(original SnapshotPointer)) makes it diffeicult to impl
install-snapshot
for a RaftStore without a lock protecting both logs and state machine.Adding a PurgedMarker and installing the snapshot has to be atomic in storage layer. But usually logs and state machine are separated store. e.g., logs are stored in fast flash disk and state machine is stored some where else.
To get rid of the big lock, PurgedMarker is removed and installing a snaphost does not need to keep consistent with logs any more.
-
Changed: 734eec69 VoteRequest: use last_log_id:LogId to replace last_log_term and last_log_index; by drdr xp; 2021-09-09
-
Changed: 74b16524 introduce StorageError. RaftStorage gets rid of anyhow::Error; by drdr xp; 2021-09-13
StorageError
is anenum
of DefensiveError and StorageIOError. An error a RaftStorage impl returns could be a defensive check error or an actual io operation error.Why:
anyhow::Error is not enough to support the flow control in RaftCore. It is typeless thus RaftCore can not decide what next to do depending on the returned error.
Inside raft, anyhow::Error should never be used, although it could be used as
source()
of some other error types. -
Changed: 46bb3b1c
RaftStorage::finalize_snapshot_installation
is no more responsible to delete logs included in snapshot; by drdr xp; 2021-09-13A RaftStorage should be as simple and intuitive as possible.
One should be able to correctly impl a RaftStorage without reading the guide but just by guessing what a trait method should do.
RaftCore is able to do the job of deleting logs that are included in the state machine, RaftStorage should just do what is asked.
-
Changed: 2cd23a37 use structopt to impl config default values; by drdr xp; 2021-09-14
-
Changed: ac4bf4bd InitialState: rename last_applied_log to last_applied; by drdr xp; 2021-09-14
-
Changed: 74283fda RaftStorage::do_log_compaction() do not need to delete logs any more raft-core will delete them.; by drdr xp; 2021-09-14
-
Changed: 112252b5 RaftStorage add 2 API: last_id_in_log() and last_applied_state(), remove get_last_log_id(); by drdr xp; 2021-09-15
-
Changed: 7f347934 simplify membership change; by drdr xp; 2021-09-16
-
Change: if leadership is lost, the cluster is left with the joint config. One does not receive response of the change-membership request should always re-send to ensure membership config is applied.
-
Change: remove joint-uniform logic from RaftCore, which brings a lot complexity to raft impl. This logic is now done in Raft(which is a shell to control RaftCore).
-
Change: RaftCore.membership is changed to
ActiveMembership
, which includes a log id and a membership config. Making this change to let raft be able to check if a membership is committed by comparing the log index and its committed index. -
Change: when adding a existent non-voter, it returns an
Ok
value instead of anErr
. -
Change: add arg
blocking
toadd_non_voter
andchange_membership
. A blockingchange_membership
still wait for the two config change log to commit.blocking
only indicates if to wait for replication to non-voter to be up to date. -
Change: remove
non_voters
. Merge it intonodes
. Now both voters and non-voters share the same replication handle. -
Change: remove field
ReplicationState.is_ready_to_join
, it can be just calculated when needed. -
Change: remove
is_stepping_down
,membership.contains()
is quite enough. -
Change: remove
consensus_state
.
-
-
Changed: df684131 bsearch to find matching log between leader and follower; by drdr xp; 2021-12-17
-
Refactor: simplify algo to find matching log between leader and follower. It adopts a binary-search like algo:
The leader tracks the max matched log id(
self.matched
) and the least unmatched log id(self.max_possible_matched_index
).The follower just responds if the
prev_log_id
inAppendEntriesRequest matches the log at
prev_log_id.index
in its store.Remove the case-by-case algo.
-
Change: RaftStorage adds 2 new API:
try_get_log_entries()
,first_id_in_log()
andfirst_known_log_id()
.These a are not stable, may be removed soon.
-
Fix: the timeout for
Wait()
should be a total timeout. Otherwise aWait()
never quits. -
Fix: when send append-entries request, if a log is not found, it should retry loading, but not enter snapshot state. Because a log may be deleted by RaftCore just after Replication read
prev_log_id
from the store. -
Refactor: The two replication loop: line-rate loop and snapshot loop should not change the
ReplicationState
, but instead returning an error. Otherwise it has to check the state everywhere. -
Refactor: simplify receiving RaftCore messages: split
drain_raft_rx()
intoprocess_raft_event()
andtry_drain_raft_rx()
. -
Feature: a store impl has to add an initial log at index 0 to make the store mathematics complete.
-
Feature: add
ReplicationError
to describe all errors that is emitted when replicating entries or snapshot.
-
-
Changed: 6625484c remove EntryNormal; by drdr xp; 2021-12-23
-
Changed: 61551178 remove EntryMembership; by drdr xp; 2021-12-23
-
Changed: c61b4c49 remove ConflictOpt, which is a wrapper of log_id; add matched log id in AppendEntriesResponse; by drdr xp; 2021-12-23
-
Changed: 3511e439 rename MembershipConfig to Membership; by drdr xp; 2021-12-27
-
Changed: b43c085a track committed log id instead of just a commit index; by drdr xp; 2021-12-29
-
Changed: 8506102f remove unused field SnapshotMeta::membership; by drdr xp; 2021-12-29
-
Dependency: 7848c219 update pretty_assertions requirement from 0.7.2 to 1.0.0; by dependabot[bot]; 2021-09-28
Updates the requirements on pretty_assertions to permit the latest version.
updated-dependencies:
- dependency-name: pretty_assertions dependency-type: direct:production ...
Signed-off-by: dependabot[bot] support@github.com
-
Dependency: cd080192 update tracing-subscriber requirement from 0.2.10 to 0.3.3; by dependabot[bot]; 2021-11-30
Updates the requirements on tracing-subscriber to permit the latest version.
updated-dependencies:
- dependency-name: tracing-subscriber dependency-type: direct:production ...
Signed-off-by: dependabot[bot] support@github.com
-
Added: 1451f962 Membership provides method is_majority() and simplify quorum calculation for voting; by drdr xp; 2021-12-25
-
Added: a2a48c56 make DefensiveCheck a reuseable trait; by drdr xp; 2021-12-26
-
Defensive checks in
MemStore
are moved out into a traitDefensiveCheck
. -
Let user impl a base
RaftStorage
. Then raft wraps it with aStoreExt
thus the defensive checks apply to every impl ofRaftStorage
.
-
-
Changed: 79a39970 to get last_log and membership, Storage should search for both logs and state machines.; by drdr xp; 2021-08-24
Why:
depending on the impl, a RaftStore may have logs that are included in the state machine still present. This may be caused by a non-transactional impl of the store, e.g. installing snapshot and removing logs are not atomic.
Thus when searching for last_log or last membership, a RaftStore should search for both logs and state machine, and returns the greater one that is found.
-
Test: add test to prove these behaviors, which includes:
get_initial_state()
andget_membership()
. -
Refactor: Make store tests a suite that could be applied to other impl.
-
-
Changed: 07d71c67 RaftStore::delete_logs_from() use range instead of (start, end); by drdr xp; 2021-08-28
-
Changed: 1c46a712 RaftStore::get_log_entries use range as arg; add try_get_log_entry() that does not return error even when defensive check is on; by drdr xp; 2021-08-28
-
Added: 420cdd71 add defensive check to MemStore; by drdr xp; 2021-08-28
-
Added: ab6689d9 RaftStore::get_last_log_id() to get the last known log id in log or state machine; by drdr xp; 2021-08-29
-
Fixed: 6d53aa12 too many(50) inconsistent log should not live lock append-entries; by drdr xp; 2021-08-31
-
Reproduce the bug that when append-entries, if there are more than 50 inconsistent logs, the responded
conflict
is always set toself.last_log
, which blocks replication for ever. Because the next time append-entries use the sameprev_log_id
, it actually does not search backward for the first consistent log entry.
The test to reproduce it fakes a cluster of node 0,1,2: R0 has 100 uncommitted log at term 2. R2 has 100 uncommitted log at term 3.
R0 ... 2,99 2,100 R1 R2 ... 3,99, 3,00
Before this fix, brings up the cluster, R2 becomes leader and will never sync any log to R0.
The fix is also quite simple:
-
Search backward instead of searching forward, to find the last log entry that matches
prev_log_id.term
, and responds this log id to the leader to let it send nextappend_entries
RPC since this log id. -
If no such matching term is found, use the first log id it sees, e.g., the entry at index
prev_log_id.index - 50
for nextappend_entries
.
-
-
Fixed: 9540c904 when append-entries, deleting entries after prev-log-id causes committed entry to be lost; by drdr xp; 2021-08-31
Problem:
When append-entries, raft core removes old entries after
prev_log_id.index
, then append new logs sent from leader.Since deleting then appending entries are not atomic(two calls to
RaftStore
), deleting consistent entries may cause loss of committed entries, if server crashes after the delete.E.g., an example cluster state with logs as following and R1 now is the leader:
R1 1,1 1,2 1,3 R2 1,1 1,2 R3
Committed entry
{1,2}
gets lost after the following steps:- R1 to R2:
append_entries(entries=[{1,2}, {1,3}], prev_log_id={1,1})
- R2 deletes 1,2
- R2 crash
- R2 is elected as leader with R3, and only see 1,1; the committed entry 1,2 is lost.
Solution:
The safe way is to skip every entry that are consistent to the leader. And delete only the inconsistent entries.
Another issue with this solution is that:
Because we can not just delete
log[prev_log_id.index..]
, the commit index:- must be update only after append-entries,
- and must point to a log entry that is consistent to leader.
Or there could be chance applying an uncommitted entry:
R0 1,1 1,2 3,3 R1 1,1 1,2 2,3 R2 1,1 1,2 3,3
- R0 to R1
append_entries: entries=[{1,2}], prev_log_id = {1,1}, commit_index = 3
- R1 accepted this
append_entries
request but was not aware of that entry {2,3} is inconsistent to leader. Updating commit index to 3 allows it to apply an uncommitted entries{2,3}
.
- R1 to R2:
-
Changed: a1a05bb4 rename Network methods to send_xxx; by drdr xp; 2021-08-23
-
Changed: f168696b rename RaftStorage::Snapshot to RaftStorage::SnapsthoData; by drdr xp; 2021-08-23
-
Changed: fea63b2f rename CurrentSnapshotData to Snapshot; by drdr xp; 2021-08-23
-
Changed: fabf3e74 rename RaftStorage::create_snapshot() to RaftStorage::begin_receiving_snapshot; by drdr xp; 2021-08-23
-
Changed: 90329fbf RaftStorage: merge append_entry_to_log and replicate_to_log into one method append_to_log; by drdr xp; 2021-08-24
-
Changed: daf2ed89 RaftStorage: remove apply_entry_to_state_machine; by drdr xp; 2021-08-24
-
Changed: a18b98f1 SnapshotPointer do not store any info.; by drdr xp; 2021-08-24
Use SnapshotPointer to store membership is a bad idea. It brings in troubles proving the consistency, e.g.:
-
When concurrent
do_log_compaction()
is called(it is not possible for now, may be possible in future. The correctness proof involving multiple component is a nightmare.) -
Proof of correctness of consistency between
StateMachine.last_membership
andSnapshotPointer.membership
is complicated.
What we need is actually:
- At least one committed log is left in the log storage,
- and info in the purged log must be present in the state machine, e.g. membership
-
-
Changed: 5d0c0b25 rename SnapshotPointer to PurgedMarker; by drdr xp; 2021-08-24
-
Changed: 72b02249 rename replicate_to_state_machine to apply_to_state_machine; by drdr xp; 2021-08-24
-
Fixed: eee8e534 snapshot replication does not need to send a last 0 size chunk; by drdr xp; 2021-08-22
-
Fixed: 8cd24ba0 RaftCore.entries_cache is inconsistent with storage. removed it.; by drdr xp; 2021-08-23
-
When leader changes,
entries_cache
is cleared. Thus there may be cached entries wont be applied to state machine. -
When applying finished, the applied entries are not removed from the cache. Thus there could be entries being applied more than once.
-
-
Fixed: 2eccb9e1 install snapshot req with offset GE 0 should not start a new session.; by drdr xp; 2021-08-22
A install-snapshot always ends with a req with data len to be 0 and offset GE 0. If such a req is re-sent, e.g., when timeout, the receiver will try to install a snapshot with empty data, if it just finished the previous install snapshot req(
snapshot_state
is None) and do not reject a install snapshot req with offset GE 0. Which results in afatal storage error
, since the storage tries to decode an empty snapshot data.- feature: add config
install_snapshot_timeout
.
- feature: add config
-
Fixed: beb0302b leader should not commit when there is no replication to voters.; by drdr xp; 2021-08-18
When there is no replication to voters but there are replications to non-voters, the leader did not check non-voters for a quorum but just commits a log at once.
This cause the membership change log from a single node always commits. E.g. start node 0, and non-voter 1, 2; then
change_membership({0, 1, 2})
, It just commits the joint-log at once. But according to raft paper, it should await a quorum of {0} and a quorum of {0, 1, 2}.
-
Changed: 6350514c change-membership should be log driven but not channel driven; by drdr xp; 2021-08-18
A membership change involves two steps: the joint config phase and the final config phase. Each phase has a corresponding log involved.
Previously the raft setup several channel to organize this workflow, which makes the logic hard to understand and introduces complexity when restarting or leadership transferred: it needs to re-establish the channels and tasks.
According to the gist of raft, all workflow should be log driven. Thus the new approach:
- Write two log(the joint and the final) at once it receives a change-membership request.
- All following job is done according to just what log is committed.
This simplifies the workflow and makes it more reliable and intuitive to understand.
Related changes:
-
When
change_membership
is called, append 2 logs at once. -
Introduce universal response channel type to send back a message when some internal task is done:
ResponseTx
, and a universal response error type:ResponseError
. -
Internal response channel is now an
Option<ResponseTx>
, since the first step of membership change does not need to respond to the caller. -
When a new leaser established, if the last log is a joint config log, append a final config log to let the partial change-membership be able to complete.
And the test is added.
-
Removed membership related channels.
-
Refactor: convert several func from async to sync.
- Changed: 8b59966d MembershipConfig.member type is changed form HashSet BTreeSet; by drdr xp; 2021-08-17
-
Changed: adc24f55 pass all logs to apply_entry_to_state_machine(), not just Normal logs.; by drdr xp; 2021-08-16
Pass
Entry<D>
toapply_entry_to_state_machine()
, not just the onlyEntryPayload::Normal(normal_log)
.Thus the state machine is able to save the membership changes if it prefers to.
Why:
In practice, a snapshot contains info about all applied logs, including the membership config log. Before this change, the state machine does not receive any membership log thus when making a snapshot, one needs to walk through all applied logs to get the last membership that is included in state machine.
By letting the state machine remember the membership log applied, the snapshto creation becomes more convenient and intuitive: it does not need to scan the applied logs any more.
-
Changed: 82a3f2f9 use LogId to track last applied instead of using just an index.; by drdr xp; 2021-07-19
It provides more info by Using LogId to track last applied log. E.g. when creating a snapshot, it need to walk through logs to find the term of the last applied log, just like it did in memstore impl.
Using LogId{term, index} is a more natural way in every aspect.
changes: RaftCore: change type of
last_applied
from u64 to LogId.
-
Fixed: fc8e92a8 typo; by drdr xp; 2021-07-12
-
Fixed: 447dc11c when finalize_snapshot_installation, memstore should not load membership from its old log that are going to be overridden by snapshot.; by drdr xp; 2021-07-13
-
Fixed: dba24036 after 2 log compaction, membership should be able to be extract from prev compaction log; by drdr xp; 2021-07-14
-
Changed: 7792cccd add CurrentSnapshotData.meta: SnapshotMeta, which is a container of all meta data of a snapshot: last log id included, membership etc.; by drdr xp; 2021-07-13
-
Changed: 0c870cc1 reduce one unnecessary snapshot serialization; by drdr xp; 2021-07-14
-
Change:
get_current_snapshot()
: remove double-serialization: convert MemStoreSnapshot to CurrentSnapshotData instead of serializing MemStoreSnapshot:Before:
MemStoreSnapshot.data = serialize(state-machine) CurrentSnapshotData.data = serialize(MemStoreSnapshot)
After:
MemStoreSnapshot.data = serialize(state-machine) CurrentSnapshotData.data = MemStoreSnapshot.data
when
finalize_snapshot_installation
, extract snapshot meta info fromInstallSnapshotRequest
. Reduce one unnecessary deserialization. -
Change: InstallSnapshotRequest: merge
snapshot_id
,last_log_id
,membership
into one fieldmeta
. -
Refactor: use SnapshotMeta(
snapshot_id
,last_log_id
,membership
) as a container of metadata of a snapshot. Reduce parameters. -
Refactor: remove redundant param
delete_through
fromfinalize_snapshot_installation
.
-
-
Changed: 954c67a9 InstallSnapshotRequest: merge last_included{term,index} into last_included; by drdr xp; 2021-07-08
-
Changed: 933e0b32 use snapshot-id to identify a snapshot stream; by drdr xp; 2021-07-09
A snapshot stream should be identified by some id, since the server end should not assume messages are arrived in the correct order. Without an id, two
install_snapshot
request belonging to different snapshot data may corrupt the snapshot data, explicitly or even worse, silently.-
Add SnapshotId to identify a snapshot stream.
-
Add SnapshotSegmentId to identify a segment in a snapshot stream.
-
Add field
snapshot_id
to snapshot related data structures. -
Add error
RaftError::SnapshotMismatch
. -
Storage::create_snapshot()
does not need to return and id. Since the receiving end only keeps one snapshot stream session at most. Instead,Storage::do_log_compaction()
should build a unique id everytime it is called. -
When the raft node receives an
install_snapshot
request, the id must match to continue. A request with a different id should be rejected. A new id with offset=0 indicates the sender has started a new stream. In this case, the old unfinished stream is dropped and cleaned. -
Add test for
install_snapshot
API.
-
-
Changed: 85859d07 CurrentSnapshotData: merge
term
andindex
intoincluded
.; by drdr xp; 2021-07-09 -
Changed: 5eb9d3af RaftCore: replace
snapshot_index
withsnapshot_last_included: LogId
. Keep tracks of both snapshot last log term and index.; by drdr xp; 2021-07-09Also
SnapshotUpdate::SnapshotComplete
now contains an LogId instead of an u64 index. -
Changed: 9c5f3d7e RaftCore: merge last_log_{term,index} into last_log: LogId; by drdr xp; 2021-07-09
-
Changed: 58d8e3a2 AppendEntriesRequest: merge prev_log_{term,index} into prev_log: LogId; by drdr xp; 2021-07-10
-
Changed: 9e4fb64f InitialState: last_log_{term,index} into last_log: LogId; by drdr xp; 2021-07-10
-
Changed: 24e38130 Entry: merge term and index to log_id: LogId; by drdr xp; 2021-07-11
-
Added: 8e0b0df9 report snapshot metrics to RaftMetrics::snapshot, which is a LogId: (term, index) that a snapshot includes; by drdr xp; 2021-07-09
- Add:
Wait.snapshot()
to watch snapshot changes. - Test: replace
sleep()
withwait_for_snapshot()
to speed up tests.
- Add:
- Dependency: b351c87f upgrade tokio from 1.7 to 1.8; by drdr xp; 2021-07-08
-
Fixed: cf4badd0 leader should re-create and send snapshot when
threshold/2 < last_log_index - snapshot < threshold
; by drdr xp; 2021-07-08The problem:
If
last_log_index
advancessnapshot.applied_index
too many, i.e.:threshold/2 < last_log_index - snapshot < threshold
(e.g.,10/2 < 16-10 < 20
in the test that reproduce this bug), the leader tries to re-create a new snapshot. But whenlast_log_index < threshold
, it won't create, which result in a dead loop.Solution:
In such case, force to create a snapshot without considering the threshold.
- Dependency: 70e1773e adapt to changes of rand-0.8: gen_range() accepts a range instead of two args; by drdr xp; 2021-06-21
-
Added: 32a67e22 add metrics about leader; by drdr xp; 2021-06-29
In LeaderState it also report metrics about the replication to other node when report metrics.
When switched to other state, LeaderState will be destroyed as long as the cached replication metrics.
Other state report an
None
to raft core to override the previous metrics data.At some point the raft core, without knonwning the state, just report metrics with an
Update::Ignore
, to indicate that leave replication metrics intact.
- Fixed: d60f1e85 client_read has using wrong quorum=majority-1; by drdr xp; 2021-07-02
-
Added: 1ad17e8e move wait_for_xxx util into metrics.; by drdr xp; 2021-06-16
Introduce struct
Wait
as a wrapper of the metrics channel to impl wait-for utils:log()
: wait for log to apply.current_leader()
: wait for known leader.state()
: wait for the role.members()
: wait for membership_config.members.next_members()
: wait for membership_config.members_after_consensus.
E.g.:
// wait for ever for raft node's current leader to become 3: r.wait(None).current_leader(2).await?;
The timeout is now an option arg to all wait_for_xxx functions in fixtures. wait_for_xxx_timeout are all removed.
-
Added: 3388f1a2 link to discord server.; by Anthony Dodd; 2021-05-21
-
Added: bcc246cc a pull request template.; by Anthony Dodd; 2021-05-26
-
Added: ea539069 wait_for_nodes_log(); by drdr xp; 2021-05-24
-
Added: 668ad478 some wait_for func:; by drdr xp; 2021-05-24
- wait_for_log()
- wait_for_log_timeout()
- wait_for_state()
- wait_for_state_timeout()
-
Fixed: 89bb48f8 last_applied should be updated only when logs actually applied.; by drdr xp; 2021-05-20
-
Fixed: e9f40450 usage of get_storage_handle; by drdr xp; 2021-05-23
-
Fixed: 22cd1a0c clippy complains; by drdr xp; 2021-05-23
-
Fixed: 6202138f a conflict is expected even when appending empty enties; by drdr xp; 2021-05-24
-
Fixed: f449b64a discarded log in replication_buffer should be finally sent.; by drdr xp; 2021-05-22
Internally when replication goes to LaggingState(a non-leader lacks a lot logs), the ReplicationCore purges
outbound_buffer
andreplication_buffer
and then sends all committed logs found in storage.Thus if there are uncommitted logs in
replication_buffer
, these log will never have chance to be replicated, even when replication goes back to LineRateState. Since LineRateState only replicates logs fromReplicationCore.outbound_buffer
andReplicationCore.replication_buffer
.This test ensures that when replication goes to LineRateState, it tries to re-send all logs found in storage(including those that are removed from the two buffers.
-
Fixed: 6d680484 #112 : when a follower is removed, leader should stops sending log to it.; by drdr xp; 2021-05-21
A leader adds all follower replication states to a hashset
nodes
, when the leader is established. But the leader does not do it when membership changed. Thus when a follower is removed, the leader can not stop replication to it because the follower is not innodes
.The solution is to move replication state from
non_voters
tonodes
. So that next time a follower is removed the leader is able to remove the replication fromnodes
. -
Fixed: 39690593 a NonVoter should stay as NonVoter instead of Follower after restart; by drdr xp; 2021-05-14
-
Fixed: d882e743 when calc quorum, the non-voter should be count; by drdr xp; 2021-06-02
Counting only the follower(nodes) as quorum for new config(c1) results in unexpected log commit. E.g.: change from 012 to 234, when 3 and 4 are unreachable, the first log of joint should not be committed.
-
Fixed: a10d9906 when handle_update_match_index(), non-voter should also be considered, because when member change a non-voter is also count as a quorum member; by drdr xp; 2021-06-16
-
Fixed: 11cb5453 doc-include can only be used in nightly build; by drdr xp; 2021-06-16
-
Simplify CI test: test all in one action.
-
Disable clippy: it suggests inappropriate assert_eq to assert conversion which is used in a macro.
-
Add makefile
-
Build only with nightly rust. Add rust-toolchain to specify toolchain version.
-
- Dependency: 919d91cb upgrade tokio from 1.0 to 1.7; by drdr xp; 2021-06-16
- Fixed #105 where function
set_target_state
missingelse
condition. - Fixed #106 which ensures that counting of replicas to determine a new commit value only considers entries replicated as part of the current term.
- Fixed a bug where Learner nodes could be restarted and come back as voting members.
The big news for this release is that we are now based on Tokio 1.0! Big shoutout to @xu-cheng for doing all of the heavy lifting for the Tokio 1.0 update, along with many other changes which are part of this release.
It is important to note that 0.6.0 does include two breaking changes from 0.5: the new RaftStorage::ShutdownError
associated type, and Tokio 1.0. Both of these changes are purely code related, and it is not expected that they will negatively impact running systems.
- Updated to Tokio 1.0!
- BREAKING: this introduces a
RaftStorage::ShutdownError
associated type. This allows for the Raft system to differentiate between fatal storage errors which should cause the system to shutdown vs errors which should be propagated back to the client for application specific error handling. These changes only apply to theRaftStorage::apply_entry_to_state_machine
method. - A small change to Raft startup semantics. When a node comes online and successfully recovers state (the node was already part of a cluster), the node will start with a 30 second election timeout, ensuring that it does not disrupt a running cluster.
- #89 removes the
Debug
bounds requirement on theAppData
&AppDataResponse
types. - The
Raft
type can now be cloned. The clone is very cheap and helps to facilitate async workflows while feeding client requests and Raft RPCs into the Raft instance. - The
Raft.shutdown
interface has been changed slightly. Instead of returning aJoinHandle
, the method is now async and simply returns a result. - The
ClientWriteError::ForwardToLeader
error variant has been modified slightly. It now exposes the data (generic typeD
of the type) of the original client request directly. This ensures that the data can actually be used for forwarding, if that is what the parent app wants to do. - Implemented #12. This is a pretty old issue and a pretty solid optimization. The previous implementation of this algorithm would go to storage (typically disk) for every process of replicating entries to the state machine. Now, we are caching entries as they come in from the leader, and using only the cache as the source of data. There are a few simple measures needed to ensure this is correct, as the leader entry replication protocol takes care of most of the work for us in this case.
- Updated / clarified the interface for log compaction. See the guide or the updated
do_log_compaction
method docs for more details.
- #97 adds the new
Raft.current_leader
method. This is a convenience method which builds upon the Raft metrics system to quickly and easily identify the current cluster leader.
- Fixed #98 where heartbeats were being passed along into the log consistency check algorithm. This had the potential to cause a Raft node to go into shutdown under some circumstances.
- Fixed a bug where the timestamp of the last received heartbeat from a leader was not being stored, resulting in degraded cluster stability under some circumstances.
- Updated async-raft dependency to
0.6.0
& updated storage interface as needed.
- Fixed #76 by moving the process of replicating log entries to the state machine off of the main task. This ensures that the process never blocks the main task. This also includes a few nice optimizations mentioned below.
- Added
#[derive(Serialize, Deserialize)]
toRaftMetrics
,State
.
- Fixed #82 where client reads were not behaving correctly for single node clusters. Single node integration tests have been updated to ensure this functionality is working as needed.
- Fixed #79 ... for real this time! Add an integration test to prove it.
- Fixed #79. The Raft core state machine was not being properly updated in response to shutdown requests. That has been addressed and shutdowns are now behaving as expected.
ChangeMembershipError::NodeNotLeader
now returns the ID of the current cluster leader if known.- Fix off-by-one error in
get_log_entries
during the replication process. - Added
#[derive(Serialize, Deserialize)]
toConfig
,ConfigBuilder
&SnapshotPolicy
.
The only thing which hasn't changed is that this crate is still an implementation of the Raft protocol. Pretty much everything else has changed.
- Everything is built directly on Tokio now.
- The guide has been updated.
- Docs have been updated.
- The
Raft
type is now the primary API of this crate, and is a simple struct with a few public methods. - Lots of fixes to the implementation of the protocol, ranging from subtle issues in joint consensus to non-voter syncing.
- Implemented
Error
forconfig::ConfigError
Added a few convenience derivations.
- Derive
Eq
onmessages::MembershipConfig
. - Derive
Eq
onmetrics::State
. - Derive
PartialEq
&Eq
onmetrics::RaftMetrics
. - Update development dependencies.
- Fixed bug #41 where nodes were not starting a new election timeout task after comign down from leader state. Thanks @lionesswardrobe for the report!
A few QOL improvements.
- Fixed an issue where the value for
current_leader
was not being set toNone
when becoming a candidate. This isn't really a bug per se, as no functionality depended on this value as far as Raft is concerned, but it is an issue that impacts the metrics system. This value is now being updated properly. - Made the
messages::ClientPayload::new_base
constructorpub(crate)
instead ofpub
, which is what the intention was originally, but I was apparently tired:)
. - Implemented #25. Implementing Display+Error for the admin error types.
A few bug fixes.
- Fixed an issue where a node in a single-node Raft was not resuming as leader after a crash.
- Fixed an issue where hard state was not being saved after a node becomes leader in a single-node Raft.
- Fixed an issue where the client request pipeline (a
Stream
with theactix::StreamFinish
) was being closed after an error was returned during processing of client requests (which should not cause the stream to close). This was unexpected and undocumented behavior, very simple fix though.
This changeset introduces a new AppDataResponse
type which represents a concrete data type which must be sent back from the RaftStorage
impl from the ApplyEntryToStateMachine
handler. This provides a more direct path for returning application level data from the storage impl. Often times this is needed for responding to client requests in a timely / efficient manner.
AppDataResponse
type has been added (see above).- A few handlers have been updated in the
RaftStorage
type. The handlers are now separated based on where they are invoked from the Raft node. The three changed handlers are:AppendEntryToLog
: this is the same. It is the initial step of handling client requests to apply an entry to the log. This is still where application level errors may be safely returned to the client.ReplicateToLog
: this is for replicating entries to the log. This is part of the replication process.ApplyEntryToStateMachine
: this is for applying an entry to the state machine as the final part of a client request. This is where the newAddDataResponse
type must be returned.ReplicateToStateMachine
: this is for replicating entries to the state machine. This is part of the replication process.
Overhauled the election timeout mechanism. This uses an interval job instead of juggling a rescheduling processes. Seems to offer quite a lot more stability. Along with the interval job, we are using std::time::Instants for performing the comparisons against the last received heartbeat.
Another backwards incompatible change to the RaftStorage
trait. It is now using associated types to better express the needed trait constraints. These changes were the final bit of work needed to get the entire actix-raft system to work with a Synchronous RaftStorage
impl. Async impls continue to work as they have, the RaftStorage
impl block will need to be updated to use the associated types though. The recommend pattern is as follows:
impl RaftStorage<..., ...> for MyStorage {
type Actor = Self;
type Context = Context<Self>; // Or SyncContext<Self>;
}
My hope is that this will be the last backwards incompatible change needed before a 1.0 release. This crate is still young though, so we will see.
- Made a few backwards incompatible changes to the
RaftStorage
trait. Overwrite its third type parameter withactix::SyncContext<Self>
to enable sync storage. - Also removed the
RaftStorage::new
constructor, as it is a bit restrictive. Just added some docs instead describing what is needed.
- Added a few addition top-level exports for convenience.
- Changes to the README for docs.rs.
- Changes to the README for docs.rs.
- Initial release!