Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A method to reduce the time cost to update cluster state #46941

Closed
kkewwei opened this issue Sep 20, 2019 · 8 comments · Fixed by #47817
Closed

A method to reduce the time cost to update cluster state #46941

kkewwei opened this issue Sep 20, 2019 · 8 comments · Fixed by #47817
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes)

Comments

@kkewwei
Copy link
Contributor

kkewwei commented Sep 20, 2019

ES_VERSION: 5.6.8
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
       As it's known, Updating cluster state on master node will cost too much time, which seriously affects the size and stability of the cluster. In out product, updating cluster state will cost 15s+ with the cluster of 50 nodes and 3,000 indices, 60,000 shard, the experience is very poor when we want to create index and delete index.
       To find out why it cost so much time on updating cluste state, I get the thread stack about updateTask, such that:

"elasticsearch[node1][clusterService#updateTask][T#1]" #32 daemon prio=5 os_prio=0 tid=0x00007f5d703a2800 nid=0x8252 runnable [0x00007f5c22b71000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1041)
        at org.elasticsearch.cluster.routing.RoutingNode.shardsWithState(RoutingNode.java:148)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:90)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.getDiskUsage(DiskThresholdDecider.java:320)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canRemain(DiskThresholdDecider.java:265)
        at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(AllocationDeciders.java:105)
        at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.decideMove(BalancedShardsAllocator.java:687)
        at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.moveShards(BalancedShardsAllocator.java:648)
        at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:123)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:329)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:100)

       I try several times and get the same thread stack. it seems that DiskThresholdDecider.sizeOfRelocatingShards will cost too much time, the code is as follow:

 static long sizeOfRelocatingShards(RoutingNode node, RoutingAllocation allocation,
                                       boolean subtractShardsMovingAway, String dataPath) {
      ClusterInfo clusterInfo = allocation.clusterInfo();
      long totalSize = 0;
      for (ShardRouting routing : node.shardsWithState(ShardRoutingState.RELOCATING, 
     ShardRoutingState.INITIALIZING)) {
             ......
      }
      ......
}

       It says that: to test whether the shard can remain stay on the node or not ,we will get the size of relocating shards, then we will get all the shards(about 6,000 shards on one node) of the node, check the shards if is be RELOCATING or INITIALIZING. This is only one shard, there have 60,000 shard need to be test, and will be 60,000 * 6,000 times checkout, which will cost too much times.
       I find that we can use the settings to avoid this check: "cluster.routing.allocation.disk.include_relocations":"false". when i set it to be false, the time to update cluster state decreases from 15s to 3s which has achives better result.
       if we could set the cluster.routing.allocation.disk.include_relocations to be false by default, most of us will ignore the default setting. or if we could reserve the shard state of relocating and initializing about every node in cluster state, so we will not find out the shards every time by checking every time when updating cluster state.

@DaveCTurner DaveCTurner added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Sep 22, 2019

        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:329)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:100)

Since #44433 (i.e. 7.4.0) applyStartedShards no longer calls reroute.

cluster of 50 nodes and 3,000 indices, 60,000 shard ... about 6,000 shards on one node

I don't fully understand how these numbers add up, but I can say that this is an unreasonably large number of shards per node. Since #34892 (i.e. 7.0.0) you are forbidden from having so many shards in a cluster of this size.

if we could set the cluster.routing.allocation.disk.include_relocations to be false by default

I can't recommend this and don't think it will be acceptable as a default. If you set this setting to false you will continue to experience the kinds of overshoot fixed by #46079.

There may be improvements to make in this area, but we need to investigate whether the issue described here still exists in clusters running a more recent version and with a configuration that's within recommended limits.

@DaveCTurner
Copy link
Contributor

The time spent finding the initialising and relocating shards when computing relocations is heavily dependent on the number of shards per node. Measuring no-op reroutes in which there are no shards moving (the common case) and the disk threshold decider is in play:

Shards Nodes Shards per node Reroute time without relocations Reroute time with relocations
60000 10 6000 ~250ms ~15000ms
60000 60 1000 ~250ms ~4000ms
10000 10 1000 ~60ms ~250ms

This shows that the savings are significant when you have too many shards per node, but much less dramatic in clusters that respect the 1000-shards-per-node limit.

We could for instance precompute the INITIALIZING and RELOCATING shards on each RoutingNode rather than scanning for them afresh each time. Given the data above I'm not sure the extra complexity is worth it, but I'm marking this for team discussion to get others' inputs.

@kkewwei
Copy link
Contributor Author

kkewwei commented Sep 25, 2019

It's my pleasure to wait for your reply.

@DaveCTurner
Copy link
Contributor

@kkewwei we discussed this today and think it could be worthwhile to explore adding some precomputation to RoutingNode to track the INITIALIZING and RELOCATING shards for easy access by the disk-based shard allocator. We were not convinced that it is definitely worth pursuing, it depends how simple the change is. Would you be interested in opening a PR for that?

@DaveCTurner
Copy link
Contributor

We also agreed that setting include_relocations to false is a bad idea, so I've opened #47443 to deprecate this setting and will remove it in a future release.

@kkewwei
Copy link
Contributor Author

kkewwei commented Oct 3, 2019

@DaveCTurner, I would like to do it.

@DaveCTurner
Copy link
Contributor

Thanks for volunteering @kkewwei, looking forward to seeing your PR.

DaveCTurner pushed a commit that referenced this issue Oct 31, 2019
Today a couple of allocation deciders iterate through all the shards on a node
to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster
state updates in clusters with very high-density nodes holding many thousands
of shards even if those shards belong to closed or frozen indices. This commit
pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up
this search.

Closes #46941
Relates #48579

Co-authored-by: "hongju.xhj" <hongju.xhj@alibaba-inc.com>
DaveCTurner pushed a commit that referenced this issue Oct 31, 2019
Today a couple of allocation deciders iterate through all the shards on a node
to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster
state updates in clusters with very high-density nodes holding many thousands
of shards even if those shards belong to closed or frozen indices. This commit
pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up
this search.

Closes #46941
Relates #48579

Co-authored-by: "hongju.xhj" <hongju.xhj@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants