Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long balance computation should not delay new index primary assignment #115511

Merged

Conversation

pxsalehi
Copy link
Member

@pxsalehi pxsalehi commented Oct 24, 2024

A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.

Closes ES-9616

@pxsalehi pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 24, 2024
private TimeValue progressLogInterval;
private long maxBalanceComputationTimeDuringIndexCreationMillis;

public DesiredBalanceComputer(ClusterSettings clusterSettings, ThreadPool threadPool, ShardsAllocator delegateAllocator) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up, I'll just remove this constructor and replace its usages. Here it would add too much noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I wish we have a dedicated interface for supplying time.
Often it is not clear if we actually submit tasks or just need a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If we're going to use a LongSupplier however, I think it should specify somewhere in the name what unit it's in (e.g. timeSupplierInMilliseconds), but a dedicated interface would be much better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #116333.

@@ -339,6 +361,20 @@ public DesiredBalance compute(
iterations
)
);

if (assignedSomeNewlyCreatedShards
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to be the simplest fix, AFAICT. Similar to how we interrupt the computation when there is a new cluster state.

Copy link
Contributor

@DiannaHohensee DiannaHohensee Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wandering around the balancer code, but I saw this bit of code in the ContinuousComputation instance that makes me wonder if no new computation task will be queued up, if we exit the code like this? I'm thinking if we exit early, with more computation to be done, then we need to make sure another computation is scheduled. I don't fully understand things yet, so I could be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems to be an issue. Today, the computer exits in two ways:

  1. Fresh -> reconcile
  2. Not fresh -> no reconciliation but recompute on new input.

It seems to me we need a 3rd option here where we reconcile and recompute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the problem is related to the referenced bit of code, but yeah, the assumption here is that we're deferring to the next allocate call to possibly re-attempt the long computation we broke out of. If we think that could be a problem, then we'd need something like Henning mentioned where we are able to re-trigger a new computation immediately.

Not sure how much of an issue that is in practice. @idegtiarenko @DaveCTurner what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(w/o further digging into this, seems that returning a flag from the compute to signal we'd want to give the same input another compute round might be a solution).

@@ -293,6 +310,11 @@ public DesiredBalance compute(
for (final var shardRouting : routingNode) {
if (shardRouting.initializing()) {
hasChanges = true;
if (shardRouting.primary()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this only considers primaries. Do we need to consider search replicas for serverless (at least one search replica as we did in #113847)? They could have the same issue, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds reasonable. It would be a good idea to include that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will try in a follow up.

@@ -336,6 +341,143 @@ protected long currentNanoTime() {
}
}

public void testIndexCreationDuringLongDesiredComputation() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IT doesn't seem to be straight forward (or feasible w/o larger refactoring). I went with this. If there is a simple way of emulating a long desired balance computation, let me know.


var gatewayAllocator = createGatewayAllocator((shardRouting, allocation, unassignedAllocationHandler) -> {
if (shardRouting.getIndexName().equals("index-ignored")) {
unassignedAllocationHandler.removeAndIgnore(UnassignedInfo.AllocationStatus.NO_ATTEMPT, allocation.changes());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this way is the best to produce an ignored shard, but seems consistent with the suite.

@pxsalehi pxsalehi marked this pull request as ready for review October 24, 2024 12:07
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 24, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@idegtiarenko
Copy link
Contributor

I suggest to also merge this to the 8.x branch

@@ -293,6 +310,11 @@ public DesiredBalance compute(
for (final var shardRouting : routingNode) {
if (shardRouting.initializing()) {
hasChanges = true;
if (shardRouting.primary()
&& shardRouting.unassignedInfo() != null
&& shardRouting.unassignedInfo().reason() == UnassignedInfo.Reason.INDEX_CREATED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should consider another reasons?
Could it be an issue for grow/shrink or snapshot restore?
Regardless this could be a followup after a dedicated discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah good point. I was under the impression we're mostly addressing cases where this long wait for the assignment could e.g. affect indexing requests, e.g. when the index is auto-created. I guess pending reads/searches could be a compelling reason for some of the other Reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, exactly, the index creation happening during the ILM rolover

Comment on lines 388 to 389
// Make sure the computation takes at least a few iterations
final int minIterations = between(3, 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why this is important?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To just simulate a long computation which is multiple iterations (and each round taking a bit of time).

idegtiarenko
idegtiarenko previously approved these changes Oct 25, 2024
@pxsalehi pxsalehi added v8.17.0 auto-backport Automatically create backport pull requests when merged labels Oct 25, 2024
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments. I think we need a 3rd compute-exit option?

@@ -339,6 +361,20 @@ public DesiredBalance compute(
iterations
)
);

if (assignedSomeNewlyCreatedShards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems to be an issue. Today, the computer exits in two ways:

  1. Fresh -> reconcile
  2. Not fresh -> no reconciliation but recompute on new input.

It seems to me we need a 3rd option here where we reconcile and recompute?

@pxsalehi pxsalehi dismissed DiannaHohensee’s stale review November 6, 2024 07:54

A minor coding style opinion difference is no reason to block a PR! :)

@pxsalehi pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 6, 2024
@pxsalehi
Copy link
Member Author

pxsalehi commented Nov 6, 2024

I've marked this for merging as there is no blocking issue. If there is any follow ups you'd like to see, please let me know.

@pxsalehi
Copy link
Member Author

pxsalehi commented Nov 6, 2024

Failure was some windows packaging test: #116299.

@pxsalehi
Copy link
Member Author

pxsalehi commented Nov 6, 2024

Unfortunately, :qa:packaging:destructiveDistroTest.default-windows-archive seems to have some setup issues. I've muted some tests, and asked delivery if the entire suite should be muted. It is not related to his change.

@pxsalehi pxsalehi merged commit 8f24e8e into elastic:main Nov 6, 2024
14 of 17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 115511

@pxsalehi pxsalehi deleted the ps241024-longBalanceCompDuringIndexCreation branch November 6, 2024 09:59
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Nov 6, 2024
elastic#115511)

A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.

Closes ES-9616
pxsalehi added a commit that referenced this pull request Nov 7, 2024
#115511) (#116316)

A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.

Closes ES-9616

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
elasticsearchmachine pushed a commit that referenced this pull request Nov 11, 2024
Relates
#115511 (comment).
`ThreadPool` is used here only to get time. (I've extracted this out of
#116333).
elasticsearchmachine pushed a commit that referenced this pull request Nov 12, 2024
An attempt to use a basic interface for time supplier based on
#115511 (comment).
(TLDR: sometimes we pass around a ThreadPool instance just to be able to
get time. It might be more reasonable to separate those use cases)
javanna pushed a commit to javanna/elasticsearch that referenced this pull request Nov 12, 2024
…116333)

An attempt to use a basic interface for time supplier based on
elastic#115511 (comment).
(TLDR: sometimes we pass around a ThreadPool instance just to be able to
get time. It might be more reasonable to separate those use cases)
jozala pushed a commit that referenced this pull request Nov 13, 2024
#115511)

A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.

Closes ES-9616
jozala pushed a commit that referenced this pull request Nov 13, 2024
jozala pushed a commit that referenced this pull request Nov 13, 2024
Relates
#115511 (comment).
`ThreadPool` is used here only to get time. (I've extracted this out of
#116333).
jozala pushed a commit that referenced this pull request Nov 13, 2024
An attempt to use a basic interface for time supplier based on
#115511 (comment).
(TLDR: sometimes we pass around a ThreadPool instance just to be able to
get time. It might be more reasonable to separate those use cases)
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 14, 2024
Relates
elastic#115511 (comment).
`ThreadPool` is used here only to get time. (I've extracted this out of
elastic#116333).
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 14, 2024
…116333)

An attempt to use a basic interface for time supplier based on
elastic#115511 (comment).
(TLDR: sometimes we pass around a ThreadPool instance just to be able to
get time. It might be more reasonable to separate those use cases)
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
elastic#115511)

A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.

Closes ES-9616
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
Relates
elastic#115511 (comment).
`ThreadPool` is used here only to get time. (I've extracted this out of
elastic#116333).
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
…116333)

An attempt to use a basic interface for time supplier based on
elastic#115511 (comment).
(TLDR: sometimes we pass around a ThreadPool instance just to be able to
get time. It might be more reasonable to separate those use cases)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants