Long balance computation should not delay new index primary assignment #115511

pxsalehi · 2024-10-24T10:09:20Z

A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.

Closes ES-9616

…eCompDuringIndexCreation

pxsalehi · 2024-10-24T11:46:31Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

    private TimeValue progressLogInterval;
+    private long maxBalanceComputationTimeDuringIndexCreationMillis;

    public DesiredBalanceComputer(ClusterSettings clusterSettings, ThreadPool threadPool, ShardsAllocator delegateAllocator) {


In a follow up, I'll just remove this constructor and replace its usages. Here it would add too much noise.

👍 I wish we have a dedicated interface for supplying time.
Often it is not clear if we actually submit tasks or just need a time.

Agreed. If we're going to use a LongSupplier however, I think it should specify somewhere in the name what unit it's in (e.g. timeSupplierInMilliseconds), but a dedicated interface would be much better.

Opened #116333.

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

pxsalehi · 2024-10-24T11:48:34Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -339,6 +361,20 @@ public DesiredBalance compute(
                    iterations
                )
            );
+
+            if (assignedSomeNewlyCreatedShards


This seemed to be the simplest fix, AFAICT. Similar to how we interrupt the computation when there is a new cluster state.

I'm still wandering around the balancer code, but I saw this bit of code in the ContinuousComputation instance that makes me wonder if no new computation task will be queued up, if we exit the code like this? I'm thinking if we exit early, with more computation to be done, then we need to make sure another computation is scheduled. I don't fully understand things yet, so I could be wrong.

Yeah, this seems to be an issue. Today, the computer exits in two ways:

Fresh -> reconcile

Not fresh -> no reconciliation but recompute on new input.

It seems to me we need a 3rd option here where we reconcile and recompute?

I am not sure the problem is related to the referenced bit of code, but yeah, the assumption here is that we're deferring to the next allocate call to possibly re-attempt the long computation we broke out of. If we think that could be a problem, then we'd need something like Henning mentioned where we are able to re-trigger a new computation immediately.

Not sure how much of an issue that is in practice. @idegtiarenko @DaveCTurner what do you think?

(w/o further digging into this, seems that returning a flag from the compute to signal we'd want to give the same input another compute round might be a solution).

pxsalehi · 2024-10-24T11:51:44Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -293,6 +310,11 @@ public DesiredBalance compute(
                for (final var shardRouting : routingNode) {
                    if (shardRouting.initializing()) {
                        hasChanges = true;
+                        if (shardRouting.primary()


For now this only considers primaries. Do we need to consider search replicas for serverless (at least one search replica as we did in #113847)? They could have the same issue, I think.

Yeah, sounds reasonable. It would be a good idea to include that.

Ok. Will try in a follow up.

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java

pxsalehi · 2024-10-24T11:55:06Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

@@ -336,6 +341,143 @@ protected long currentNanoTime() {
        }
    }

+    public void testIndexCreationDuringLongDesiredComputation() {


An IT doesn't seem to be straight forward (or feasible w/o larger refactoring). I went with this. If there is a simple way of emulating a long desired balance computation, let me know.

pxsalehi · 2024-10-24T11:59:16Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+
+        var gatewayAllocator = createGatewayAllocator((shardRouting, allocation, unassignedAllocationHandler) -> {
+            if (shardRouting.getIndexName().equals("index-ignored")) {
+                unassignedAllocationHandler.removeAndIgnore(UnassignedInfo.AllocationStatus.NO_ATTEMPT, allocation.changes());


I am not sure if this way is the best to produce an ignored shard, but seems consistent with the suite.

elasticsearchmachine · 2024-10-24T12:08:17Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2024-10-25T08:56:47Z

I suggest to also merge this to the 8.x branch

idegtiarenko · 2024-10-25T09:04:14Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -293,6 +310,11 @@ public DesiredBalance compute(
                for (final var shardRouting : routingNode) {
                    if (shardRouting.initializing()) {
                        hasChanges = true;
+                        if (shardRouting.primary()
+                            && shardRouting.unassignedInfo() != null
+                            && shardRouting.unassignedInfo().reason() == UnassignedInfo.Reason.INDEX_CREATED) {


I wonder if we should consider another reasons?
Could it be an issue for grow/shrink or snapshot restore?
Regardless this could be a followup after a dedicated discussion.

hmm, yeah good point. I was under the impression we're mostly addressing cases where this long wait for the assignment could e.g. affect indexing requests, e.g. when the index is auto-created. I guess pending reads/searches could be a compelling reason for some of the other Reasons.

Yeap, exactly, the index creation happening during the ILM rolover

idegtiarenko · 2024-10-25T09:10:00Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+        // Make sure the computation takes at least a few iterations
+        final int minIterations = between(3, 10);


Could you please explain why this is important?

To just simulate a long computation which is multiple iterations (and each round taking a bit of time).

henningandersen

Left a couple comments. I think we need a 3rd compute-exit option?

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

henningandersen · 2024-10-28T10:08:31Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -339,6 +361,20 @@ public DesiredBalance compute(
                    iterations
                )
            );
+
+            if (assignedSomeNewlyCreatedShards


Yeah, this seems to be an issue. Today, the computer exits in two ways:

Fresh -> reconcile

Not fresh -> no reconciliation but recompute on new input.

It seems to me we need a 3rd option here where we reconcile and recompute?

…eCompDuringIndexCreation

A minor coding style opinion difference is no reason to block a PR! :)

pxsalehi · 2024-11-06T07:55:47Z

I've marked this for merging as there is no blocking issue. If there is any follow ups you'd like to see, please let me know.

…eCompDuringIndexCreation

pxsalehi · 2024-11-06T08:48:05Z

Failure was some windows packaging test: #116299.

pxsalehi · 2024-11-06T09:58:00Z

Unfortunately, :qa:packaging:destructiveDistroTest.default-windows-archive seems to have some setup issues. I've muted some tests, and asked delivery if the entire suite should be muted. It is not related to his change.

elasticsearchmachine · 2024-11-06T09:59:40Z

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 115511

elastic#115511) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616

See #115511 (comment).

#115511) (#116316) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Relates #115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of #116333).

An attempt to use a basic interface for time supplier based on #115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)

…116333) An attempt to use a basic interface for time supplier based on elastic#115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)

#115511) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616

See #115511 (comment).

Relates #115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of #116333).

An attempt to use a basic interface for time supplier based on #115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)

Relates elastic#115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of elastic#116333).

…116333) An attempt to use a basic interface for time supplier based on elastic#115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)

elastic#115511) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616

See elastic#115511 (comment).

Relates elastic#115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of elastic#116333).

…116333) An attempt to use a basic interface for time supplier based on elastic#115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)

Long balance computation should not delay new index primary assignment

f6be547

pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 24, 2024

elasticsearchmachine added the v9.0.0 label Oct 24, 2024

pxsalehi added 2 commits October 24, 2024 13:44

fix test

f5d8137

Merge remote-tracking branch 'upstream/main' into ps241024-longBalanc…

6c4a6f1

…eCompDuringIndexCreation

pxsalehi commented Oct 24, 2024

View reviewed changes

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java Outdated Show resolved Hide resolved

pxsalehi commented Oct 24, 2024

View reviewed changes

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java Show resolved Hide resolved

pxsalehi commented Oct 24, 2024

View reviewed changes

pxsalehi requested review from DaveCTurner, idegtiarenko, DiannaHohensee and henningandersen October 24, 2024 12:07

pxsalehi marked this pull request as ready for review October 24, 2024 12:07

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 24, 2024

idegtiarenko reviewed Oct 25, 2024

View reviewed changes

idegtiarenko previously approved these changes Oct 25, 2024

View reviewed changes

add comments

e8477f0

pxsalehi added v8.17.0 auto-backport Automatically create backport pull requests when merged labels Oct 25, 2024

spotless

15a35eb

henningandersen reviewed Oct 28, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into ps241024-longBalanc…

3be4d07

…eCompDuringIndexCreation

pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 6, 2024

Merge remote-tracking branch 'upstream/main' into ps241024-longBalanc…

53436bf

…eCompDuringIndexCreation

pxsalehi merged commit 8f24e8e into elastic:main Nov 6, 2024
14 of 17 checks passed

elasticsearchmachine added the backport pending label Nov 6, 2024

pxsalehi deleted the ps241024-longBalanceCompDuringIndexCreation branch November 6, 2024 09:59

This was referenced Nov 6, 2024

[8.x] Long balance computation should not delay new index primary assignment #116316

Merged

Refactor DesiredBalanceComputer#hasComputationConverged #116331

Merged

Use a time supplier interface instead of passing ThreadPool #116333

Merged

elasticsearchmachine pushed a commit that referenced this pull request Nov 6, 2024

Refactor hasComputationConverged (#116331)

8db9181

See #115511 (comment).

pxsalehi mentioned this pull request Nov 11, 2024

Do not pass ThreadPool to DesiredBalanceComputer #116590

Merged

elasticsearchmachine pushed a commit that referenced this pull request Nov 11, 2024

Do not pass ThreadPool to DesiredBalanceComputer (#116590)

2cbc657

Relates #115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of #116333).

jozala pushed a commit that referenced this pull request Nov 13, 2024

Refactor hasComputationConverged (#116331)

9dfb331

See #115511 (comment).

jozala pushed a commit that referenced this pull request Nov 13, 2024

Do not pass ThreadPool to DesiredBalanceComputer (#116590)

8c8ad7a

Relates #115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of #116333).

alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024

Refactor hasComputationConverged (elastic#116331)

b74e50a

See elastic#115511 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long balance computation should not delay new index primary assignment #115511

Long balance computation should not delay new index primary assignment #115511

pxsalehi commented Oct 24, 2024 •

edited

Loading

pxsalehi Oct 24, 2024

idegtiarenko Oct 25, 2024

nicktindall Oct 28, 2024

pxsalehi Nov 6, 2024

pxsalehi Oct 24, 2024

DiannaHohensee Oct 25, 2024 •

edited

Loading

henningandersen Oct 28, 2024

pxsalehi Oct 28, 2024

pxsalehi Oct 28, 2024

pxsalehi Oct 24, 2024

idegtiarenko Oct 25, 2024

pxsalehi Oct 25, 2024

pxsalehi Oct 24, 2024

pxsalehi Oct 24, 2024

elasticsearchmachine commented Oct 24, 2024

idegtiarenko commented Oct 25, 2024

idegtiarenko Oct 25, 2024

pxsalehi Oct 25, 2024

idegtiarenko Oct 25, 2024

idegtiarenko Oct 25, 2024

pxsalehi Oct 25, 2024

henningandersen left a comment

henningandersen Oct 28, 2024

pxsalehi commented Nov 6, 2024

pxsalehi commented Nov 6, 2024

pxsalehi commented Nov 6, 2024

elasticsearchmachine commented Nov 6, 2024

		// Make sure the computation takes at least a few iterations
		final int minIterations = between(3, 10);

Long balance computation should not delay new index primary assignment #115511

Long balance computation should not delay new index primary assignment #115511

Conversation

pxsalehi commented Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DiannaHohensee Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 24, 2024

idegtiarenko commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxsalehi commented Nov 6, 2024

pxsalehi commented Nov 6, 2024

pxsalehi commented Nov 6, 2024

elasticsearchmachine commented Nov 6, 2024

💔 Backport failed

pxsalehi commented Oct 24, 2024 •

edited

Loading

DiannaHohensee Oct 25, 2024 •

edited

Loading