Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support in-library Multi-Raft #131

Closed
akiradeveloper opened this issue Jan 3, 2021 · 5 comments
Closed

Support in-library Multi-Raft #131

akiradeveloper opened this issue Jan 3, 2021 · 5 comments
Labels

Comments

@akiradeveloper
Copy link
Owner

Multi-Raft is a data-sharding in Raft where the key space is split into N groups and each group has individual Raft log. This technique is used in KV-store like TiKV.

Multi-Raft in TiKV

The current lol 0.6.2 can achieve multi-raft by use socket port as the group ID however this isn't efficient in the following reasons:

  1. Wasting heartbeat: A node can be a leader for group A and B. In this case, we can only send heartbeats in only one group. In other words, the two groups can share the liveness of nodes.
  2. State machines can't share the same data like configuration.

Also, keeping connections to 1000 nodes in cluster may let the system be unstable.

Since the implementation will be very hard and there is no demand for this feature at the moment the priority isn't high but at least worth to be noted.

@akiradeveloper
Copy link
Owner Author

Here is my initial design sketch.

スクリーンショット 2021-02-01 16 55 39

We define MultiRaftAdapter so requests to each Raft group are directed properly. For the receiver side this looks ok to me but question is on the sender side.

The text in red is a example of request to send. Here the problem is RaftCore knows it's group id. Ideally, each RaftCore can be ignorant of the group id: the knowledge about the group id is managed outside e.g. in the MultiRaftAdapter.

@akiradeveloper
Copy link
Owner Author

One another possible motivation is that this extension will benefit for performance of a normal application (not only sharded like kvs). The application may have two consensus in the cluster in which the two are independent. For example, one log for distributed sequencer and one for application log; they are logically independent. If we have multi-raft distinguished by group_id, we have use gid=1 for the former and gid=2 for the latter.

Now I am willing to do this. But it will be a big change that will take a long long time.

@LegNeato
Copy link

I would love to use this feature fwiw

@akiradeveloper
Copy link
Owner Author

akiradeveloper commented Apr 14, 2021

With k8s we can deploy many Raft nodes in separate namespace. (Well, container technology is revolution in distributed computing) This means we don't need to distinguish nodes by their ports any more when they are located in the same physical server.

However, as they are independent Raft nodes they have own Tonic server (and Tokio runtime). Apparently, this consumes memory footprint for each instance. But more importantly, there will be created a hundreds of thread pools and OS scheduler needs to take care of thousands of threads, the overhead will not be negligible in my experience. In conclusion we can't put many (100~) nodes in the same server and the number of shards is restricted therefore.

With this feature in-library Multi-Raft, only one Raft node is put per server and shards coexist in the same Raft node where they are distinguished by 64bits IDs. As a result, we can make 10000~ shards to boost the performance.

@akiradeveloper
Copy link
Owner Author

スクリーンショット 2021-04-16 11 29 19

This is a very simple example: There are 3 nodes and 3 shards each 2 replicas.

Imagine we split the key space into more shards (like 10000) each maintaining 2 replicas as well.

Other than the resource problem (memory, too many pooled threads), this incurs massive number of heartbeat. The number of heartbeat per interval is proportional to the number of Raft node so in this case the order of 10k. If the heartbeat interval is 100ms, there will be 100 heartbeats in 1 ms, 1 heartbeat in 10us which involves TCP send/recv and protobuf encode/decode. I think this will be totally useless.

My idea is separate the heartbeat layer from Raft node level to phy. node level.

Repository owner locked and limited conversation to collaborators Sep 9, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

2 participants