-
Notifications
You must be signed in to change notification settings - Fork 9
Home
FOROS is an open source ROS2 framework that can be used to provide redundancy for availability-critical nodes. This framework helps eliminate single points of failure in the system by using the RAFT consensus algorithm to easily organize nodes with the same mission into a cluster. This framework can tolerate fail-stop failures equal to the cluster size minus quorum.
Cluster size (N) | Quorum (Q = N / 2 + 1) | Number of fault tolerant nodes (N - Q) |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 2 | 1 |
4 | 3 | 1 |
5 | 3 | 2 |
This framework makes it easy to organize nodes with the same mission into a cluster. Also, when a cluster is configured, one active node is automatically elected through consensus-based leader election.
Specifically, nodes created using FOROS can have one of the following states: Follower, Candidate, or Leader, with a default state of Follower. If there is no Leader, the Follower is changed to a Candidate and the Candidate receiving a majority of the votes becomes the Leader. The detailed state transition is like the state machine below.
Fortunately, this complex leader election process is all handled within the FOROS framework, and developers only need to consider Active and Standby states. It also provides a mechanism to filter all ROS2 topics sent by Standby nodes and all ROS2 service requests received by Standby nodes.
The active node can use the FOROS API to store sequential runtime data in the cluster. This feature allows active nodes to decentralize their data, so that data can be easily restored even if other nodes become active nodes as a result.
The process of storing data is as follows. 1) When the active node requests to store data, it forwards the request to other nodes. 2.1) If the majority of nodes accept the request, the request succeeds. 2.2) If the majority of nodes do not accept the request, the request fails.
ROS2 applications typically create node instances for messaging. In this case, you can use FOROS's ClusterNode
class instead of rclcpp::Node
class to create nodes belonging to a specific cluster. Unlike rclcpp::Node
that receives a node name as an argument, ClusterNode
receives a cluster name, node ID, and IDs of all nodes in the cluster as arguments.
auto node = akit::failover::foros::ClusterNode::make_shared(
"Test_cluster", // Cluster Name
0, // Node ID
std::initializer_list<uint32_t>{0, 1, 2} // Node IDs in the given cluster
);
FOROS uses leveldb to manage data internally and provides APIs for committing data, querying data, and registering data change callbacks.
Data is managed as a class called Command
. The basic usage is as follows.
example: Create 1 byte of data stored with the value 1 and use getter
auto command = akit::failover::foros::Command::make_shared(
std::initializer_list<uint8_t>{1});
command()->data(); // raw data getter
Active node can request to save byte-array data by using the commit_command
API of ClusterNode
class and receive the request result through the callback function.
example: Request to commit 1-byte data with value 1 stored
node->commit_command(
akit::failover::foros::Command::make_shared(std::initializer_list<uint8_t>{1}), // Create 1 byte of data stored with the value 1
[&](akit::failover::foros::CommandCommitResponseSharedFuture response_future) { // Response Callback
auto response = response_future.get();
if (response->result() == true) { // On success, print data ID and data contents
RCLCPP_INFO(logger, "commit completed: %lu %d", response->id(), response->command()->data()[0]);
}
}
);
Any node can query data of specific ID using get_command
API of ClusterNode
class.
example: Query data with ID 0
auto command = node->get_command(0);
Let's check leader election and log replication operation in the environment below through demo.
Cluster Size | Quorum | Fault Tolerance | Node IDs |
---|---|---|---|
4 | 3 | 1 | { 0, 1, 2, 3 } |
Let's check the leader election status by launching and shutting down redundant nodes.
Step | Action | # of running nodes | Result |
---|---|---|---|
1 | Launch node 0 | 1 | |
2 | Launch node 1 | 2 | |
3 | Launch node 2 | 3 | Leader elected (node 1) |
4 | Launch node 3 | 4 | |
5 | Terminate node 1 | 3 | Leader terminated -> Leader re-elected (node 2) |
6 | Terminate node 2 | 2 | Leader terminated -> Nodes that exceed fault tolerance are shut down -> No leader |
7 | Launch node 2 | 3 | Leader elected (node 3) |
Let's set up all redundant nodes to periodically store one byte of data and check the data commit process.
Step | Action | # of running nodes | Result | Enable data commit |
---|---|---|---|---|
1 | Launch node 0 | 1 | X | |
2 | Lacunh node 1 | 2 | X | |
3 | Launch node 2 | 3 | Leader elected (node 1) | O |
4 | Launch node 4 | 4 | O | |
5 | Terminate node 1 | 3 | Leader terminated -> Leader re-elected (node 2) | O |
6 | Launch node 1 | 4 | Sync data not received while node 1 is terminated | O |