Constraints to de-prioritize nodes from becoming shard allocation targets #43350

vigyasharma · 2019-06-18T20:31:34Z

The Problem

In clusters where each node has a reasonable number of shards, when a node is replaced, (or a small number of node(s) added); all shards of newly created indexes get allocated on the new empty node(s). It puts a lot of stress on the single (or few) new node, deteriorating overall cluster performance.

This happens because the index-balance factor in weight function is unable to offset shard-balance when other nodes are filled with shards. It causes the new node to always have minimum weight and get picked as the target for allocation (until the new node has approx. mean number of shards/node in the cluster).

Further, index-balance does kick in after the node is relatively filled (as compared to other nodes), and it moves out the recently allocated shards. Thus for the same shards, which are newly created and usually getting indexing traffic, we end up doing twice the work in allocation.

Current Workaround

A work around for this is to set the total-shards-per-node index/cluster setting , which limits shards of an index on a single node.

This however, is a hard limit enforced by Allocation Deciders. If breached on all nodes, the shards go unassigned causing yellow/red clusters. Configuring this setting requires careful calculation around number of nodes, and must be revised when the cluster is scaled down.

Proposal

We propose an allocation constraint mechanism, that de-prioritizes nodes from getting picked for allocation if they breach certain constraints. Whenever an allocation constraint is breached for a shard (or index) on a node, we add a high positive constant to the node's weight. This increased weight makes the node less preferable for allocating the shard. Unlike deciders, however, this is not a hard filter. If no other nodes are eligible to accept shards (say due to deciders like disk watermarks), the shard can still be allocated on nodes with breached constraints.

Index Shards Per Node Constraint

This constraint controls the number of shards of an index on a single node. indexShardsPerNodeConstraint is breached if the number of shards of an index allocated on a node, exceeds average shards per node for that index.

long expIndexShardsOnNode = node.numShards(index) + numAdditionalShards;
long allowedShardsPerNode = Math.round(Math.ceil(balancer.avgShardsPerNode(index)));
boolean shardPerNodeLimitBreached = (expIndexShardsOnNode - allowedShardsPerNode) > 0;

indexShardsPerNodeConstraint getting breached, causes shards of the newly created index to get assigned to other nodes, thus preventing indexing hot spots. Post unassigned shard allocation, rebalance fills up the node with other index shards, without having to move out the already allocated shards. Since this does not prevent allocation, we do not run into unassigned shards due to breached constraints.

The allocator flow now becomes:

deciders to filter out ineligible nodes -> constraints to de-prioritize certain nodes by increasing their weight -> node selection as allocation target.

Extension

This framework can be extended with other rules by modeling them as boolean constraints. One possiible example is primary shard density on a node, as required by issue #41543. A constraint that requires #primaries on a node <= avg-primaries on a node, could prevent scenarios with all primaries on few nodes and all replicas on others.

Comparing Multiple Constraints

For every constraint breached, we add a high positive constant to the weight function. This is a step function approach where we consider all nodes on the lower step (lower weight) as more eligible than those on a higher step. The high positive constant ensures that node weight doesn't go as high as the next step unless it breaches a constraint.

Since the constant is added for every constraint breached, i.e. c * HIGH_CONSTANT, nodes with one constraints breached are preferred over nodes with two constraints breached and so on. All nodes with the same number of constraints breached simply resort back to the first part of weight function, which is based on shard count.

We could potentially keep different weights for different constraints with some sort of ranking among them. But this can quickly become a maintenance and extension nightmare. Adding new constraints will require going through every other constraint weight and deciding on the right weight and place in priority.

Next Steps

If this idea makes sense and seems like a useful add to the Elasticsearch community, we can raise a PR for these changes.

This change is targeted to solve problems listed in #17213, #37638, #41543, #12279, #29437

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-18T21:16:52Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-06-30T10:39:51Z

Thanks for the suggestion @vigyasharma. You've clearly put some effort into this idea. We discussed this recently as a team and drew two conclusions:

We do not want to weaken the hard *.routing.allocation.total_shards_per_node limits. These limits are for users who know that each node would fail with more than a certain number of shards, so ignoring these limits would result in a cascade of failures.
You have not provided enough surrounding context to judge its other merits.

To expand on this second point, it sounds like you might be misusing the total_shards_per_node settings, and the problem you describe might be addressed by setting them appropriately. If you are setting total_shards_per_node: N but in fact your nodes can tolerate more than N shards then you should be setting this limit higher. Also if you are heavily indexing into an index (necessitating the total_shards_per_node setting) but then stopping the heavy indexing later then you should also be relaxing or even removing the total_shards_per_node setting on this index to reflect the change in resource constraints. It is normal for the resource constraints on an index to change over time as its access pattern changes, and adjusting settings like total_shards_per_node to reflect this will allow the balancer to do its job properly. This would be the case if, for instance, you are using time-based indices to collect logs.

Would you please describe your cluster and its configuration in significantly more detail? Also, would you prepare a failing test case (e.g. an ESAllocationTestCase) that reflects the configuration of your cluster and that your proposal would fix?

vigyasharma · 2019-07-17T12:55:19Z

Thanks @DaveCTurner for discussing this with the team, and apologies for the delay in response (I was on a vacation).

I don't think this should weaken the total_shards_per_node limits. That decider remains untouched for advanced customers to use.

Let me provide more context on where this change helps.

The problem with *.routing.allocation.total_shards_per_node is that you need full context around all indexes and overall cluster workload, to reason about shards_per_node limits. Several large organizations have different teams owning a set of indices in a shared cluster. The cluster itself may be managed by a central admin team, while index creations and settings are owned by each index owner team.

In such scenarios, overall cluster workload may go down due to one team scaling down their traffic, while other teams are still unaware of these changes. This makes revising the setting at index level non trivial and a cluster management overhead.

Most users add this setting at index level to avoid collocating shards on one node during node replacements and new node additions (i.e. the workaround to problem addressed in this issue). If shard balancer can handle this implicitly, it removes the cognitive and management overhead of setting or revising these limits.

Admins can scale down their fleet if net workload is reduced without individual indexes getting impacted.

e.g. Say an index has 10 shards in an index and there are 6 nodes in the cluster

To be resilient to 2 node failures with max index spread: 
    index.routing.allocation.total_shards_per_node = 3

But if the admin reduces the cluster to 4 nodes, the setting must be revised to 5
Similarly, the cluster can't be scaled down to 3 nodes without making it yellow.

DaveCTurner · 2019-07-17T16:05:05Z

Welcome back @vigyasharma, hope you had a good break.

The concerns you describe are broadly addressed by the plan for finer-grained constraints laid out in #17213 (comment) and (as I commented there in reply to you a few months back) predicting the constraints dynamically is out of scope for now. Let's walk before we start to try and run, and let's try and keep this conversation focussed on this particular issue.

Just a reminder that we're waiting for more details about your proposal:

Would you please describe your cluster and its configuration in significantly more detail? Also, would you prepare a failing test case (e.g. an ESAllocationTestCase) that reflects the configuration of your cluster and that your proposal would fix?

DaveCTurner · 2019-08-26T11:52:18Z

Closing this as no further details were provided.

dnhatn added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Jun 18, 2019

DaveCTurner added the team-discuss label Jun 19, 2019

DaveCTurner added feedback_needed and removed team-discuss labels Jun 30, 2019

vigyasharma mentioned this issue Jul 17, 2019

Suboptimal shard allocation with imbalanced shards #17213

Open

DaveCTurner closed this as completed Aug 26, 2019

ragri8 mentioned this issue Aug 26, 2019

Default allocation of new shards create hotspots #45851

Closed

ashwinpankaj mentioned this issue Apr 2, 2021

Constraints to de-prioritize nodes from becoming shard allocation targets opensearch-project/OpenSearch#487

Closed

vigyasharma mentioned this issue Feb 24, 2023

[Segment Replication] Allocation changes to distribute primaries - Benchmark & document improvements using opensearch-benchmark. opensearch-project/OpenSearch#6210

Closed

imRishN mentioned this issue Feb 12, 2024

[Feature Request] [Segment Replication] Balanced primary count across all nodes during rebalancing opensearch-project/OpenSearch#12250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

vigyasharma commented Jun 18, 2019 •

edited

Loading

elasticmachine commented Jun 18, 2019

DaveCTurner commented Jun 30, 2019

vigyasharma commented Jul 17, 2019

DaveCTurner commented Jul 17, 2019

DaveCTurner commented Aug 26, 2019

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

Comments

vigyasharma commented Jun 18, 2019 • edited Loading

The Problem

Current Workaround

Proposal

Index Shards Per Node Constraint

Extension

Comparing Multiple Constraints

Next Steps

elasticmachine commented Jun 18, 2019

DaveCTurner commented Jun 30, 2019

vigyasharma commented Jul 17, 2019

DaveCTurner commented Jul 17, 2019

DaveCTurner commented Aug 26, 2019

vigyasharma commented Jun 18, 2019 •

edited

Loading