-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow cluster startup with zen discovery. #5232
Comments
Please note that this happens on an empty cluster with no indexes at all and slow means > 30 minutes More details are here: |
@bluelu I suspect that this might happen because the join request was being to aggressive in its timeouts to wait for the cluster to ack the join, and got into continuous joining state. Now, the timeout is bigger (#6342), so I think it will help. If not, I would love to help in figuring out what takes too long. |
Thanks. We will observe both when we test and will let you know. Still the issue with the traffic remains (#6295) (e.g. when you have more than 500 nodes and a few thousand indexes...). Then also larger installations will run quite nicely :-) |
@dakrone We see this with 1.2.1. Will test 1.3.0 when it lands. |
@avleen have you had a chance to test this on 1.3? |
Hey Clinton! We haven't upgraded to 1.3 yet. We plan on doing that in about On Fri, Aug 1, 2014 at 5:33 AM, Clinton Gormley notifications@github.com
|
Unfortunately we haven't yet have time to update our cluster and test with the new version. We also tried this on a smaller cluster, and didn't have the issue that it was starting slowly, so we must test it with more servers. |
We upgraded to 1.3. It starts to take 15 or 30 seconds for each node to join again. Not 15 to
|
Further update. Is there a 15s timeout here which might be coming into play? On Tue, Aug 12, 2014 at 10:07 AM, Avleen Vig avleen@gmail.com wrote:
|
/cc @kimchy |
improve_zen branch just landed in master/1.x, this includes a lot of improvements when it comes to forming a cluster. We ran 100s nodes test (with not data in the one I refer to, just to see how quickly a cluster is formed) and the results were very good (less than 30s to form 100s nodes clusters). Even with improve_zen, the logic is still similar though, when a node joins, a full cluster state cycle is needed (more lightweight, batched, but still). Maybe the 30s come from the publish timeout, thats the one that can explain it, but I then don't understand why one of the nodes that are part of the cluster not answering in the proper time. Maybe next time you can set |
@miccon @avleen just pinging to hear about your experiences with cluster startup on v1.4. |
I'm working together with @miccon. Except for the slow allocation of unallocated shards (#6372) , the joining of nodes is now very fast. This issue can be closed. |
thanks for letting us know @bluelu |
This happened a while ago but we just upgraded to 1.4.2 and found that joining is pretty much instant now. Thanks everyone! |
@avleen happy to hear. |
When a cluster with a large number of nodes starts up, the joining of the nodes becomes slow, as the cluster state update is blocking. The master node adds the nodes one by one and waits after every join (zen-disco-receive) for the updated cluster state to be distributed.
This issue occurs in elasticsearch version 1.0 and is related to #3736 introducing the wait during the processing of the cluster state updates.
When setting discovery.zen.publish_timeout:0 the startup of the cluster is as fast as before, as the master node is not waiting for the individual updates to be acked by the client nodes.
A solution to the problem might be that the updates of the cluster state would be processed on the master and only distributed after all have been applied. Or that the master would not wait for the state to be acked by the client nodes during startup.
The text was updated successfully, but these errors were encountered: