-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long balance computation should not delay new index primary assignment #115511
Long balance computation should not delay new index primary assignment #115511
Conversation
…eCompDuringIndexCreation
private TimeValue progressLogInterval; | ||
private long maxBalanceComputationTimeDuringIndexCreationMillis; | ||
|
||
public DesiredBalanceComputer(ClusterSettings clusterSettings, ThreadPool threadPool, ShardsAllocator delegateAllocator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up, I'll just remove this constructor and replace its usages. Here it would add too much noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I wish we have a dedicated interface for supplying time.
Often it is not clear if we actually submit tasks or just need a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If we're going to use a LongSupplier
however, I think it should specify somewhere in the name what unit it's in (e.g. timeSupplierInMilliseconds
), but a dedicated interface would be much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #116333.
...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java
Outdated
Show resolved
Hide resolved
@@ -339,6 +361,20 @@ public DesiredBalance compute( | |||
iterations | |||
) | |||
); | |||
|
|||
if (assignedSomeNewlyCreatedShards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed to be the simplest fix, AFAICT. Similar to how we interrupt the computation when there is a new cluster state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wandering around the balancer code, but I saw this bit of code in the ContinuousComputation instance that makes me wonder if no new computation task will be queued up, if we exit the code like this? I'm thinking if we exit early, with more computation to be done, then we need to make sure another computation is scheduled. I don't fully understand things yet, so I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seems to be an issue. Today, the computer exits in two ways:
- Fresh -> reconcile
- Not fresh -> no reconciliation but recompute on new input.
It seems to me we need a 3rd option here where we reconcile and recompute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the problem is related to the referenced bit of code, but yeah, the assumption here is that we're deferring to the next allocate
call to possibly re-attempt the long computation we broke out of. If we think that could be a problem, then we'd need something like Henning mentioned where we are able to re-trigger a new computation immediately.
Not sure how much of an issue that is in practice. @idegtiarenko @DaveCTurner what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(w/o further digging into this, seems that returning a flag from the compute
to signal we'd want to give the same input another compute round might be a solution).
@@ -293,6 +310,11 @@ public DesiredBalance compute( | |||
for (final var shardRouting : routingNode) { | |||
if (shardRouting.initializing()) { | |||
hasChanges = true; | |||
if (shardRouting.primary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this only considers primaries. Do we need to consider search replicas for serverless (at least one search replica as we did in #113847)? They could have the same issue, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds reasonable. It would be a good idea to include that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will try in a follow up.
...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java
Show resolved
Hide resolved
@@ -336,6 +341,143 @@ protected long currentNanoTime() { | |||
} | |||
} | |||
|
|||
public void testIndexCreationDuringLongDesiredComputation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An IT doesn't seem to be straight forward (or feasible w/o larger refactoring). I went with this. If there is a simple way of emulating a long desired balance computation, let me know.
|
||
var gatewayAllocator = createGatewayAllocator((shardRouting, allocation, unassignedAllocationHandler) -> { | ||
if (shardRouting.getIndexName().equals("index-ignored")) { | ||
unassignedAllocationHandler.removeAndIgnore(UnassignedInfo.AllocationStatus.NO_ATTEMPT, allocation.changes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this way is the best to produce an ignored shard, but seems consistent with the suite.
Pinging @elastic/es-distributed (Team:Distributed) |
I suggest to also merge this to the 8.x branch |
@@ -293,6 +310,11 @@ public DesiredBalance compute( | |||
for (final var shardRouting : routingNode) { | |||
if (shardRouting.initializing()) { | |||
hasChanges = true; | |||
if (shardRouting.primary() | |||
&& shardRouting.unassignedInfo() != null | |||
&& shardRouting.unassignedInfo().reason() == UnassignedInfo.Reason.INDEX_CREATED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should consider another reasons?
Could it be an issue for grow/shrink or snapshot restore?
Regardless this could be a followup after a dedicated discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah good point. I was under the impression we're mostly addressing cases where this long wait for the assignment could e.g. affect indexing requests, e.g. when the index is auto-created. I guess pending reads/searches could be a compelling reason for some of the other Reason
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, exactly, the index creation happening during the ILM rolover
// Make sure the computation takes at least a few iterations | ||
final int minIterations = between(3, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why this is important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To just simulate a long computation which is multiple iterations (and each round taking a bit of time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple comments. I think we need a 3rd compute-exit option?
...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java
Outdated
Show resolved
Hide resolved
@@ -339,6 +361,20 @@ public DesiredBalance compute( | |||
iterations | |||
) | |||
); | |||
|
|||
if (assignedSomeNewlyCreatedShards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seems to be an issue. Today, the computer exits in two ways:
- Fresh -> reconcile
- Not fresh -> no reconciliation but recompute on new input.
It seems to me we need a 3rd option here where we reconcile and recompute?
…eCompDuringIndexCreation
A minor coding style opinion difference is no reason to block a PR! :)
I've marked this for merging as there is no blocking issue. If there is any follow ups you'd like to see, please let me know. |
…eCompDuringIndexCreation
Failure was some windows packaging test: #116299. |
Unfortunately, |
💔 Backport failedThe backport operation could not be completed due to the following error:
You can use sqren/backport to manually backport by running |
elastic#115511) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616
#115511) (#116316) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Relates #115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of #116333).
An attempt to use a basic interface for time supplier based on #115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)
…116333) An attempt to use a basic interface for time supplier based on elastic#115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)
#115511) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616
Relates #115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of #116333).
An attempt to use a basic interface for time supplier based on #115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)
Relates elastic#115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of elastic#116333).
…116333) An attempt to use a basic interface for time supplier based on elastic#115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)
elastic#115511) A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance. Closes ES-9616
Relates elastic#115511 (comment). `ThreadPool` is used here only to get time. (I've extracted this out of elastic#116333).
…116333) An attempt to use a basic interface for time supplier based on elastic#115511 (comment). (TLDR: sometimes we pass around a ThreadPool instance just to be able to get time. It might be more reasonable to separate those use cases)
A long desired balance computation could delay a newly created index shard from being assigned since first the computation has to finish for the assignments to be published and the shards getting assigned. With this change we add a new setting which allows setting a maximum time for a computation in case there are unassigned primary shards. Note that this is similar to how a new cluster state causes early publishing of the desired balance.
Closes ES-9616