-
Notifications
You must be signed in to change notification settings - Fork 938
orchestrator-raft: next steps. #246
Comments
thanks @renecannao @dbussink @sjmudd |
|
|
|
@shlomi-noach We chatted about this briefly on Slack, but I wanted to bring the conversation here. I think that https://serf.io would pair well with Raft (both Hashicorp products) to solve the dynamic group problem. It allows you to run hooks on membership events, and a node can join the group by simply contact any of the members, and serf handles all the membership propagation. This keeps Orchestrator standalone, without needing to add an extra service like Consul for discovery. It would also significantly simplify the Kubernetes installation, since static ips won't need to be assigned per Orchestrator node at config time. |
@derekperkins Will this pair with Raft or replace Raft? Can you please elaborate? |
This would pair with, not replace raft. All that Serf manages is membership, so this would allow you to dynamically populate IP addresses that are now specified in |
How will If not, then the dynamic nature of the group is IMHO unsolvable. Consider a group of three. Two nodes are joining and two are gone. Have the two nodes died and we are at quorum 3/5, such that a single node dying would take us down? Or have they left and we are at 3/3 such that a single node dying would leave a happy quorum of 2/3? |
cc @enisoc |
That's a good point. I believe that we could hook into the event system of Kubernetes to notify Orchestrator on scaling events. Alternatively, it would be simple to poll the Stateful Set / Replica Set to see how many replicas were configured. Then Orchestrator would just have to expose an endpoint to set the total Node count to set quorum sizes. If that's required, then that begs the question of whether or not serf is necessary. Orchestrator could stay agnostic and just provide add/drop endpoints, or a full |
This further introduces complications. When you add/remove nodes, you may only communicate that to the leader. You need to announce to the leader "this node has left". (BTW this will require an upgrade of the raft library, but aside the point) And will kubernetes or your external tool orchestrate that well? What if the leader itself is one of the nodes to leave? It will take a few seconds for a new leader to step up. Will you initiate the Another question which leaves me confused is how you would bootstrap it in the first place. The first node to run -- will it be the leader of a single node group? (you must enable that). Which one of the nodes, as you bootsrap a new cluster? How do you then disable the option for being the leader of a single node group? (because you don't want that; and at this time it seems to me like something that is fixed, cannot be changed dynamically). I'm seeing a lot of open questions here, which get moved from one place to another. I'm wondering, can we look at something that works similarly? Do you have a consul setup where kubernetes is free to remove and add nodes? Do you have an external script to run those change updates? Does that work? If we can mimick something that exists that would be best. |
I'm going to chat later with someone who has more experience with distributed systems, who may be able to clarify some points. |
I had a discussion with @oritwas who has worked on, and still works on distributed systems and has implemented paxos in the past. Disclosure: my partner. She was able to further present failure and split brain scenarios with dynamic consensus groups. In her experience, going static was the correct approach, with possibly hiding host changes behind proxy/DNS. (which is what ClusterIP is doing). Pursuing closing the gaps with dynamic memberships complicated algorithm and code to extents where it was not worthwhile to code or feasible to maintain, especially at small group sizes, where a rolling restart would prove simpler and more reliable. This would be especially true with |
I don't know how they do it, or what guarantees they make, but etcd-operator is able to dynamically manage members in an etcd cluster: https://github.com/coreos/etcd-operator#resize-an-etcd-cluster |
Had a discussion with @lefred from Oracle, about InnoDB Cluster's group communication. It seems there's a safe way to add/remove nodes:
I need to dig into this. |
The orchestrator-raft PR will soon be merged. It was out in the open for a few months now.
Noteworthy docs are:
There's already an interesting list of (mostly operational) issues/enhancements to
orchestrator/raft
that would not make it into #183. The next comment will list some of those enhancements, and I will add more comments as time goes by. Once I've reviewed all potential enhancements I'll create an Issue per enhancement.The text was updated successfully, but these errors were encountered: