Add initial implementaion of zookeeper-based discovery service. #1057

imotov · 2011-06-23T00:50:51Z

This is an initial implementation of a ZooKeeper-based discovery plugin.

Usage:

Download ZooKeeper 3.3.3 from http://zookeeper.apache.org/releases.html
Unzip the ZooKeeper archive into a directory, rename conf/zoo-sample.cnf into zoo.cnf and modify dataDir= line to point to a directory on your machine. Start ZooKeeper by running bin/zkServer.sh start.
Install ZooKeeper plugin to ES
Assuming that you are running ZooKeeper on the port 2181 (default), add the following lines to config/elasticsearch.yml file

zookeeper:
    enabled: true
    host: localhost:2181
discovery:
    type: zoo_keeper

Start ES

…de startup

otisg · 2011-06-29T20:57:41Z

Interesting. Is the main idea here to make ES use ZK instead of its own Zen?

imotov · 2011-06-29T21:10:37Z

Yes, this is an alternative to Zen discovery. It requires some initial setup but it should prevent split brain condition that sometimes occurs on unreliable networks and overloaded nodes. It still uses small portion of Zen Discovery for state publishing (I am planning to add ZooKeeper-based state publishing in the next iteration). But master election and fault detection is done using ZooKeeper. It's also possible to post ES settings into a ZooKeeper node.

kimchy · 2011-06-30T00:24:52Z

Heya, even with this zookeeper module, there is still work left to be done when it comes to handling cases where there aren't enough masters in the cluster (or zookeeper decides so). I am working on it now in the zen discovery, which can then be applied to the ZK module, will ping when I push it. Basically, what I am working on now will also make the zookeeper one a bit less relevant since zen will be more resilent to split brains (what we talked about before on minimum master nodes in a cluster).

One interesting aspect in ZK is the option to break down the cluster state into its discrete elements, and only apply and listen to delate changes (where now, the whole state is published in zen), though, it would be interesting to see at what cluster size (indices / shards wise) it really makes a difference, especially with the improvements in master and local gateawy.

imotov · 2011-06-30T02:07:39Z

ZooKeeper uses a fixed list of ZooKeeper nodes, so it’s quite easy for it to decide if quorum is present or not. It doesn’t have to do discovery, only leader election, which simplifies things. In other words split brain is solved on ZooKeeper level and cluster can work even if there is only one potential master available.

I think fixing split brain in Zen would be great. As you remember, I tried to do it by implementing quorum in Zen Discovery. I went through several iterations starting with preventing master election when quorum is not met and ending with an attempt to implement a modified version of Bully algorithm that wouldn’t cause master reelection every time a node with a higher ID shows up. I still have some code from these attempts: (https://github.com/imotov/elasticsearch/commits/election-quorum). This code is pretty useless from implementation perspective, but it has a number of potentially interesting tests that are simulating different network failure scenarios.

At the end, I came to the conclusion that for an election algorithm to be robust it has to have several phases with nodes committing to a potential master and then a master announcing its election and some retry logic when nodes die in the middle of the process. Otherwise, the same node might be counted in quorum twice by two potential masters or election might never end. And these phases complicate things a lot. On the top of it, the whole idea of specifying quorum in a config file was somewhat clashing with the idea of easy and dynamic cluster resizing by simply adding or shutting down a node and waiting for the green cluster state.

With ZooKeeper Discovery, I don’t have to specify quorum, I can easily modify recover_after_data_nodes value (or any other global settings) in one place and I don’t have to worry about updating ping.unicast.hosts list. So, for me, it looked like an easy solution for the problems that we had.

imotov · 2011-06-30T02:30:16Z

By the way, here is a short description of the ZooKeeper Discovery process:

In short, ES ZooKeeper Discovery plug-in is using ZooKeeper as an atomic, reliable, distributed database for node fault detection, master election and sharing common settings. It uses the fact that ZooKeeper can maintain a tree of nodes in which each node can have an associated value as well as child node, these nodes can be created and read atomically and clients can be notified about updates.

The ZooKeeper discovery process can be divided in 3 logical steps.

1. Initialization and Registration

Upon ES node start, the ZooKeeper discovery plugin connects to ZooKeeper and creates the following persistent node if it doesn’t already exist:

/es/cluster name/nodes

Then ES node creates an ephemeral ZK node /es/cluster name/nodes/node-id with serialized content of local DiscoveryNode.

2. Master discovery

After registration ES node first tries to read ZK node /es/cluster name/leader and starts watching this node. Existence of this node indicates that cluster already has a master elected. In this case ES node stops the master discovery process and awaits for the cluster state to send to it by the master.

If ZK node /es/cluster name/leader doesn’t exist and ES node is eligible to be elected as master, ES node tries to create ephemeral ZK node /es/cluster name/leader. This is an atomic operation and if multiple ES nodes try to create this node at the same time only one of them succeeds. If ZK node creation was successful, ES node that created the ZK node assumes role of the cluster master. If ZK node creation failed, ES node goes back to trying to read ZK node /es/cluster name/leader.

If ZK node /es/cluster name/leader doesn’t exist and ES is not eligible to be elected as a master, it sets NO_MASTER_BLOCK (if it’s not initial discovery) and waits for node creation watcher to be triggered. As soon as the ZK node is created, ES node restarts the master discovery process.

The watcher that is set during read operation on the leader ZK node is also triggered when node disappears. Leader node disappearance is causing the restarts the master discovery process. Because the /es/cluster name/leader node is ephemeral, it disappears as soon as the ES master node that created it disconnects from the system or stops responding to pings. It ensures that cluster master is always available

3. Node List Update

If ES node is elected as a master, it reads the list of children of the /es/cluster name/nodes node and starts watching for any changes in the list. Then ES updates the list of nodes in the cluster state to match the list of nodes in ZK. When the watcher on the /es/cluster name/nodes node is triggered the Node List Update operation is repeated. In other words ephemeral nodes in es/cluster name/nodes perform the role of node fault detections.

A Few Important ZooKeeper Notes:

A watcher set during any operation is only triggered once. For example, if ES node reads list of nodes and sets a watcher for the node list changes, this watcher will be triggered only once even if two children nodes are added. ES uses trigger only as indicator that change occurred, rereads complete list of children and sets the watcher at the same time.

There are two exceptions that could be caused by loss of connectivity with ZK cluster: ConnectionLossException and SessionExpiredException. The first exception can be thrown during any network hiccups and ZK client simply retries to reconnect if it occurs. When SessionExpiredException is thrown, the ZK session gets closed. In this case, all ephemeral nodes associated with disconnected node are removed it disconnected node has to restart registration and discovery process.

There is a potential issue that can occur when Master looses connection with ZK but doesn’t loose connection with other nodes. In this case ZK may decide that master is dead and remove its ephemeral node cause reelection, but the disconnected master might not know about it yet and might try to publish state to other nodes. It will not be a problem when we switch to ZK-based state publishing. Meanwhile, I am thinking of adding ZK connection check before publishing state to prevent this problem.

saj · 2012-12-20T00:36:58Z

G'day @imotov, what became of this feature?

imotov · 2012-12-20T01:11:28Z

@saj this: https://github.com/sonian/elasticsearch-zookeeper

…#1057) Removes the last pieces of ActionRequest from PersistentTaskRequest and renames it into PersistTaskParams, which is now just an interface that extends NamedWriteable and ToXContent.

Removes the last pieces of ActionRequest from PersistentTaskRequest and renames it into PersistTaskParams, which is now just an interface that extends NamedWriteable and ToXContent.

imotov added 3 commits June 22, 2011 16:39

Add initial implementaion of zookeeper-based discovery service.

8f4193d

Remove unnecessary zookeeper.enabled setting

b65ed38

Fix possible race condition when current master stops during other no…

9d8b91e

…de startup

imotov closed this Jul 18, 2011

saj mentioned this pull request Dec 17, 2012

minimum_master_nodes does not prevent split-brain if splits are intersecting #2488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial implementaion of zookeeper-based discovery service. #1057

Add initial implementaion of zookeeper-based discovery service. #1057

imotov commented Jun 23, 2011

otisg commented Jun 29, 2011

imotov commented Jun 29, 2011

kimchy commented Jun 30, 2011

imotov commented Jun 30, 2011

imotov commented Jun 30, 2011

saj commented Dec 20, 2012

imotov commented Dec 20, 2012

Add initial implementaion of zookeeper-based discovery service. #1057

Add initial implementaion of zookeeper-based discovery service. #1057

Conversation

imotov commented Jun 23, 2011

otisg commented Jun 29, 2011

imotov commented Jun 29, 2011

kimchy commented Jun 30, 2011

imotov commented Jun 30, 2011

imotov commented Jun 30, 2011

1. Initialization and Registration

2. Master discovery

3. Node List Update

A Few Important ZooKeeper Notes:

saj commented Dec 20, 2012

imotov commented Dec 20, 2012