Cancel recoveries even if all shards assigned #46520

howardhuanghua · 2019-09-10T07:48:53Z

Issue

Sometimes we need to perform rolling restart for some static configurations to take effect, or rolling restart to upgrade whole cluster. We have tested one of the cluster that has 6 nodes, total 10TB data, 6000+ shards with 1 replica. Before restarting, we have done sync flush, each node cost 10+ mins to get cluster GREEN, all nodes rolling restart cost more than 1 hour. And for 100+ nodes, need more than one day to upgrade.

After sorting related logic, we found an issue, take 3 nodes A, B, C as an example:
Test version: 5.6.4, 6.4.3, 7.3.1.
Related settings:
"index.unassigned.node_left.delayed_timeout": 3m
"cluster.routing.allocation.node_concurrent_recoveries": 30
"indices.recovery.max_bytes_per_sec": 40mb

One node (A) restart flow:

Restart node A. All shards on node A become unassigned.
Before node A gets back, also before delay allocation timeout (3m), no unassigned shards gets relocated.
After node A gets started, all unassigned shards start to recovery from node A, and node A gets throttled (30) soon.
Then some unassigned shards start to be relocated to other nodes (B, C), as they are not throttled. Since peer recovery from remote node would copy segment files and translogs, shard around 30GB would cost 10+ mins (40mb/sec by default).

Solution

With this PR optimization, one node restarting time in above case could reduce from 10+ mins to around 1 min. The main logic:
After restarted node gets back, if it gets throttled, do not relocate unassigned shards to other nodes before delay allocation timeout. Then in most case would not cause segment files copy from remote node.

elasticcla · 2019-09-10T07:48:58Z

Hi @howardhuanghua, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

elasticmachine · 2019-09-10T09:02:15Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-09-10T09:23:11Z

Thanks for the suggestion @howardhuanghua. I think we need to understand the situation you're describing a little more clearly. If your indices are successfully synced-flushed then it shouldn't matter if some of them start to recover onto other nodes because those recoveries should quickly be cancelled by recoveries onto the restarted node. Are you saying this is not the case? Can you share logs from such a restart with logger.org.elasticsearch.gateway.GatewayAllocator: TRACE so we can see why the recoveries are not being cancelled?

Also can you supply tests that support your change?

DaveCTurner · 2019-09-10T14:24:51Z

I think I see an issue:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

Lines 399 to 403 in 67e5ad2

    
           // now allocate all the unassigned to available nodes 
        
           if (allocation.routingNodes().unassigned().size() > 0) { 
        
               removeDelayMarkers(allocation); 
        
               gatewayAllocator.allocateUnassigned(allocation); 
        
           }

We only consider cancelling ongoing recoveries if there are unassigned shards. I think this might explain why Elasticsearch spends time completing a recovery despite the synced-flushed shard on the restarted node. Is it the case that most of the recoveries are cancelled, but the last few run to completion?

howardhuanghua · 2019-09-10T15:13:12Z

@DaveCTurner Yes, only few shards run to completion.
I opened logger.org.elasticsearch.gateway.GatewayAllocator: TRACE in our test cluster. And cannot see any useful log. But after the restarted node gets back, before delay timeout, I could see following INITIALIZING shards, 9.28.82.208 is the restarted node, 9.28.82.74 is the node that should not allocate shard.

lb_backend_server-300@1568044800000_7 2 r INITIALIZING 9.28.82.74 1527044744023702309
cvm_device-10@1568044800000_1 2 r INITIALIZING 9.28.82.208 1527044744023702209
vbcgw_broute_tunnel-60@1567872000000_3 1 r INITIALIZING 9.28.82.208 1527044744023702209
disk_iostat-10@1568044800000_1 2 r INITIALIZING 9.28.82.208 1527044744023702209

DaveCTurner · 2019-09-10T15:34:35Z

@DaveCTurner Yes, only few shards run to completion.

Thanks for confirming. In this case, I think it'd be better to contemplate cancelling recoveries even if there are no unallocated shards, rather than relying on the delayed allocation timeout as you propose.

howardhuanghua · 2019-09-11T01:48:18Z

@DaveCTurner Thanks. I will continue to check the cancelling recoveries.

howardhuanghua · 2019-09-12T16:15:14Z

Hi @DaveCTurner, we have sorted the slow recovery issue and provide new propose.

Why some shards allocated to other nodes?

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

Lines 399 to 403 in 67e5ad2

    
           // now allocate all the unassigned to available nodes 
        
           if (allocation.routingNodes().unassigned().size() > 0) { 
        
               removeDelayMarkers(allocation); 
        
               gatewayAllocator.allocateUnassigned(allocation); 
        
           }

Above logic tries to handle unassigned shards where valid copies of the shard already exist.
But following shard checking logic only considers syncId or sizeMatched between primary and replica.

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Lines 369 to 380 in 67e5ad2

    
           if (replicaSyncId != null && replicaSyncId.equals(primarySyncId)) { 
        
               return Long.MAX_VALUE; 
        
           } else { 
        
               long sizeMatched = 0; 
        
               for (StoreFileMetaData storeFileMetaData : storeFilesMetaData) { 
        
                   String metaDataFileName = storeFileMetaData.name(); 
        
                   if (primaryStore.fileExists(metaDataFileName) && primaryStore.file(metaDataFileName).isSame(storeFileMetaData)) { 
        
                       sizeMatched += storeFileMetaData.length(); 
        
                   } 
        
               } 
        
               return sizeMatched; 
        
           }

This would have follow issues:

syncId could be changed at any time in continuous writing case.
Due to segment merge, primaryStore.fileExists(metaDataFileName) would be false and sizeMatched would be 0. This would cause makeAllocationDecision function returns NOT_TAKEN due to matchingNodes.hasAnyData() is true.

Above issues would cause unassigned shard goes to next allocation step:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

Line 405 in 67e5ad2

shardsAllocator.allocate(allocation);

Before allocation delay timeout, these unassigned shards would be allocated to other nodes.

Why allocating cannot be cancelled?
SyncId would not equal between primary and replica as we have already checked before:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Line 369 in 67e5ad2

if (replicaSyncId != null && replicaSyncId.equals(primarySyncId)) {

This would cause cancelling logic cannot take effect:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Lines 110 to 112 in 67e5ad2

    
           if (currentNode.equals(nodeWithHighestMatch) == false 
        
                   && Objects.equals(currentSyncId, primaryStore.syncId()) == false 
        
                   && matchingNodes.isNodeMatchBySyncID(nodeWithHighestMatch)) {

Allocating to new node cannot be cancelled cause restarting takes long time.

Our propose:
SeqNo has been introduced since 6.x to speed up peer recovery, so we try to stop allocating unassigned shards to other nodes before allocation delay timeout, this would try to use seqNo during recovery process. This is the above PR propose.

We have another propose to solve the issue, for the key point:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Lines 369 to 380 in 67e5ad2

    
           if (replicaSyncId != null && replicaSyncId.equals(primarySyncId)) { 
        
               return Long.MAX_VALUE; 
        
           } else { 
        
               long sizeMatched = 0; 
        
               for (StoreFileMetaData storeFileMetaData : storeFilesMetaData) { 
        
                   String metaDataFileName = storeFileMetaData.name(); 
        
                   if (primaryStore.fileExists(metaDataFileName) && primaryStore.file(metaDataFileName).isSame(storeFileMetaData)) { 
        
                       sizeMatched += storeFileMetaData.length(); 
        
                   } 
        
               } 
        
               return sizeMatched; 
        
           }

There are several levels to check the unassigned shards should be relocated or not:

SyncId, already has.
SeqNo. We could introduce minSeqNo of primary shard which represents minimum checkpoint of primary translogs, currently we could get maxSeqNo from MetadataSnapshot.commitUserData directly but no minSeqNo. Based on minSeqNo, we could compare with current replica local checkpoint to see if we could recover from translogs even syncId is not equal.
minSeqNo implementation:
1). Commit minSeqNo into segment_N file. Load minSeqNo from the file and add it to MetadataSnapshot.commitUserData in fetchData process. This would have issue that translog would be async-cleaned and minSeqNo would not be the latest value.
2). Get minSeqNo from IndexShard(real-time update) and put it into MetadataSnapshot.commitUserData in fetchData process.
3). Instead of checking minSeqNo, we could simplely pick up the restarted node directly in findMatchingNodes method.
Segment matching size compare, already has.

Please help to evaluate, if it's ok, we could provide patch. Thanks.

DaveCTurner · 2019-09-12T16:25:13Z

You are right that we could use sequence numbers to make a better allocation decision in the case that there is no sync id too, but we are already working on this in #46318.

howardhuanghua · 2019-09-16T03:40:29Z

Hi @DaveCTurner, we have checked sequence-number-based replica allocation. It could handle the first phase of rerouting unassigned shards. How do you think we still need to avoid allocating shards to other nodes before delay node left timeout in case of phase 1 has any issues? As the PR described in phase 2.

DaveCTurner · 2019-09-16T06:02:41Z

I'm sorry I don't really understand the question. What is phase 2?

howardhuanghua · 2019-09-16T08:12:08Z

Hi @DaveCTurner, in reroute method, the first phase tries to allocate unassigned shards to a node that already has a data copy:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

Lines 399 to 403 in 67e5ad2

    
           // now allocate all the unassigned to available nodes 
        
           if (allocation.routingNodes().unassigned().size() > 0) { 
        
               removeDelayMarkers(allocation); 
        
               gatewayAllocator.allocateUnassigned(allocation); 
        
           }

Sequence-number-based replica allocation should be in phase 1 to check existing data copy, please correct me if something wrong.

And phase 2 tries to allocate unassigned shards to a node as matched as possible, including new node (no data copy):

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

Line 405 in 67e5ad2

shardsAllocator.allocate(allocation);

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

Line 122 in 67e5ad2

balancer.allocateUnassigned();

So I provided the above PR to avoid allocating shards to other nodes before delay node left timeout in phase 2.

DaveCTurner · 2019-09-16T09:11:43Z

Sorry @howardhuanghua I am perhaps misunderstanding the issue you are trying to fix with this PR. Can you add a test case that this change fixes? I think that would make things clearer. You might find it helpful to look at org.elasticsearch.cluster.routing.DelayedAllocationIT since this suite tests related properties.

DaveCTurner

Thanks @howardhuanghua, I think this test is not realistic because of the skipAllocation flag you've added to the DelayedShardsMockGatewayAllocator. I have left a more detailed comment inline.

DaveCTurner · 2019-09-18T15:58:21Z

test/framework/src/main/java/org/elasticsearch/cluster/ESAllocationTestCase.java

@@ -268,6 +271,9 @@ public void applyFailedShards(RoutingAllocation allocation, List<FailedShard> fa

        @Override
        public void allocateUnassigned(RoutingAllocation allocation) {
+            if (this.skipAllocation) {
+                return;


I don't understand this addition. As far as I can tell this makes this mock allocator behave quite differently from the real GatewayAllocator doesn't it? Your test only fails because of this difference in behaviour: if I comment this line out then your test passes without any changes to the production code. Can you provide a test case using the real allocator? I suggest adding to DelayedAllocationIT rather than here to ensure the test matches the production code more closely.

I just want to simulate the case that if unassigned shard cannot be allocated in GatewayAllocator, then it also needs to be delayed in ShardsAllocator. Suppose we skip GatewayAllocator allocation, without this PR, if I remove the node that has replica shard, unassigned shard will be allocated immediately to the other node, with this PR, it will still be delayed until delayed_timeout.

Considering the sequence based allocation decision could select correct node for allocation, so this PR will have side effects on phase 2 allocation that will delay shard allocation until delayed_timeout. For more infomation please reference #46520 (comment).

howardhuanghua · 2019-09-19T15:14:05Z

Sorry @DaveCTurner, I didn't add comments in time after commiting the test case, let me try to explain our idea clearly.

Let’s see we have a cluster with 3 nodes(A/B/C) and some indices, one of the indices named test has 1 primary shard and 1 replica shard. The test index shard layout is node A(shard p0), node B(shard r0), and node C. Set "index.unassigned.node_left.delayed_timeout" to 5 mins.

The following steps create the scenario:

Write some data to test index.
Do a sync flush.
Take node B down.
Do a force merge to merge multiple segments of p0 to 1.
This make sure p0 has different sync-id with r0, also has different segment files.
And now r0 allocation will be delayed before delayed_timeout:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Line 232 in 67e5ad2

return AllocateUnassignedDecision.delayed(remainingDelayMillis, totalDelayMillis, nodeDecisions);
Startup node B.
It's still before delayed_timeout, since both sync-id and segment files are different between p0 and r0, then makeAllocationDecision method would return NOT_TAKEN:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Line 235 in 67e5ad2

return AllocateUnassignedDecision.NOT_TAKEN;

This cause gatewayAllocator could not handle unassigned r0 (we called phase 1 before):

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

Line 402 in 67e5ad2

gatewayAllocator.allocateUnassigned(allocation);

And next step, unassigned r0 will be handled in shardsAllocator (we called phase 2 before):

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

Line 122 in 67e5ad2

balancer.allocateUnassigned();

In this step, if node B gets throttled by other shards, r0 could be allocated to node C and cannot be cancelled, all the segment files need to be copied from p0.

Our above PR is going to stop allocating r0 to node C before allocation delay timeout in phase 2.

However, if allocation decision is based on sequence number (#46318), r0 would be allocated to node B in above phase 1 if p0 on node A still has complete history operations. If r0 cannot be allocated in phase 1, that means it must have no reusable data copy or complete history operations, we should relocate it to a new node immediately. In this case, our above PR will still wait until delayed_timeout, this may not be appropriate.

So all in all, we plan to only fix the ongoing recoveries cannot be cancelled issue as you mentioned in #46520 (comment). If you think it's OK then we will provide the patch for this issue only. Please provide some advice if you have, thanks a lot.

DaveCTurner · 2019-09-19T16:32:03Z

Thanks @howardhuanghua for your patient explanations. I think I now understand the issue more clearly, and I am more certain that it will be fixed by the seqno-based replica shard allocator that we're working on.

This make sure p0 has different sync-id with r0, also has different segment files.

This is crucial to the issue you are seeing: the primary and replica must have absolutely nothing in common to hit this issue. If they share even a single segment (or a sync id) then we will hit the following code instead, and this will correctly throttle the allocation in the ReplicaShardAllocator rather than returning NOT_TAKEN:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Lines 202 to 216 in 49767fc

    
           } else if (matchingNodes.getNodeWithHighestMatch() != null) { 
        
               RoutingNode nodeWithHighestMatch = allocation.routingNodes().node(matchingNodes.getNodeWithHighestMatch().getId()); 
        
               // we only check on THROTTLE since we checked before before on NO 
        
               Decision decision = allocation.deciders().canAllocate(unassignedShard, nodeWithHighestMatch, allocation); 
        
               if (decision.type() == Decision.Type.THROTTLE) { 
        
                   logger.debug("[{}][{}]: throttling allocation [{}] to [{}] in order to reuse its unallocated persistent store", 
        
                       unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeWithHighestMatch.node()); 
        
                   // we are throttling this, as we have enough other shards to allocate to this node, so ignore it for now 
        
                   return AllocateUnassignedDecision.throttle(nodeDecisions); 
        
               } else { 
        
                   logger.debug("[{}][{}]: allocating [{}] to [{}] in order to reuse its unallocated persistent store", 
        
                       unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeWithHighestMatch.node()); 
        
                   // we found a match 
        
                   return AllocateUnassignedDecision.yes(nodeWithHighestMatch.node(), null, nodeDecisions, true); 
        
               }

Once the ReplicaShardAllocator takes sequence numbers into account we will execute this code if the primary and replica have any operations in common, which they always do, so the shard will always be allocated back to the returning node.

So all in all, we plan to only fix the ongoing recoveries cannot be cancelled issue as you mentioned in #46520 (comment). If you think it's OK then we will provide the patch for this issue only. Please provide some advice if you have, thanks a lot.

Yes, we'd very much appreciate a fix for that :)

howardhuanghua · 2019-09-20T00:45:13Z

Hi @DaveCTurner, thanks for the response. I am glad that we are now on the same page :). I will provide the fix soon.

howardhuanghua · 2019-09-20T16:01:33Z

Hi @DaveCTurner, I have updated the commit, added hasInactiveShards checking in both AllocationService.reroute and GatewayAllocator.innerAllocatedUnassigned. This would solve the case that only has initializing shard, and they are peer recovering to the new node rather than the existing data copy node. Please help to check, thank you.

DaveCTurner

Thanks @howardhuanghua, can you also supply some tests for this change? We need at least an ESIntegTestCase showing that it does cancel the last batch of recoveries. I would guess you could add this to org.elasticsearch.indices.recovery.IndexRecoveryIT.

DaveCTurner · 2019-09-23T14:10:51Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

-        // now allocate all the unassigned to available nodes
-        if (allocation.routingNodes().unassigned().size() > 0) {
+        // now allocate all the unassigned to available nodes or cancel existing recoveries if we have a better match
+        if (allocation.routingNodes().unassigned().size() > 0 || allocation.routingNodes().hasInactiveShards()) {


I wonder: why not remove this condition entirely?

howardhuanghua · 2019-09-24T13:06:18Z

Thanks @DaveCTurner, I have updated the commit.

Removed unassigned shard or initialization shard condition.
Added testCancelNewShardRecoveryAndUsesExistingShardCopy in org.elasticsearch.indices.recovery.IndexRecoveryIT. This IT simulates 3 nodes cluster, one of the data nodes down and back, the new shard recovery should be canceled and uses existing shard copy to recovery.

howardhuanghua · 2019-09-26T00:55:01Z

Hi @DaveCTurner, would you please help to check the updated commit again? Thank you.

DaveCTurner

Thanks @howardhuanghua. The test you provided failed when I ran it locally (see below for details). It's normally a good idea to run tests like this repeatedly since they are not fully deterministic and might not fail every time. That said, it looks like it's doing roughly the right things and I left some ideas for small improvements.

DaveCTurner · 2019-09-30T09:22:05Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        List<RecoveryState> nodeARecoveryStates = findRecoveriesForTargetNode(nodeA, recoveryStates);
+        assertThat(nodeARecoveryStates.size(), equalTo(1));
+        List<RecoveryState> nodeCRecoveryStates = findRecoveriesForTargetNode(nodeC, recoveryStates);
+        assertThat(nodeCRecoveryStates.size(), equalTo(1));


When I ran this test it failed here:

2> REPRODUCE WITH: ./gradlew ':server:integTest' --tests "org.elasticsearch.indices.recovery.IndexRecoveryIT.testCancelNewShardRecoveryAndUsesExistingShardCopy {seed=[ECDF910E1F356F6D:FFC9E32BAD24745B]}" -Dtests.seed=ECDF910E1F356F6D -Dtests.security.manager=true -Dtests.jvms=4 -Dtests.locale=it -Dtests.timezone=America/Rio_Branco -Dcompiler.java=12 -Druntime.java=12 2> java.lang.AssertionError: Expected: <1> but: was <0> at __randomizedtesting.SeedInfo.seed([ECDF910E1F356F6D:FFC9E32BAD24745B]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923)

DaveCTurner · 2019-09-30T09:27:59Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        assertBusy(() -> assertThat(client().admin().indices().prepareSyncedFlush(INDEX_NAME).get().failedShards(), equalTo(0)));
+
+        logger.info("--> slowing down recoveries");
+        slowDownRecovery(shardSize);


slowDownRecovery is for testing the throttling behaviour and is not sufficient here as there is still a chance that the recovery finishes before it is cancelled and this will cause the test to fail. I think we must completely halt the recovery until it has been cancelled. I would do this by either capture the START_RECOVERY action (see testRecoverLocallyUpToGlobalCheckpoint for instance) or one of the subsidiary requests (e.g. CLEAN_FILES as done in testOngoingRecoveryAndMasterFailOver).

DaveCTurner · 2019-09-30T09:29:01Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());
+
+        // do sync flush to gen sync id
+        assertBusy(() -> assertThat(client().admin().indices().prepareSyncedFlush(INDEX_NAME).get().failedShards(), equalTo(0)));


Is an assertBusy necessary here? I think a failure of a synced flush is unexpected and should result in a test failure.

DaveCTurner · 2019-09-30T09:31:35Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        slowDownRecovery(shardSize);
+
+        logger.info("--> stop node B");
+        internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodeB));


It is better to use internalCluster().restartNode() which takes a RestartCallback whose onNodeStopped method is a good place to do things to the cluster while the node is stopped.

DaveCTurner · 2019-09-30T09:33:59Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+
+        logger.info("--> request recoveries");
+        // peer recovery from nodeA to nodeC should be canceled, replica should be allocated to nodeB that has the data copy
+        assertBusy(() -> {


Could we ensureGreen here instead of this assertBusy? I think all of the assertions in here should hold for sure once the cluster is green again.

DaveCTurner · 2019-09-30T09:38:34Z

Also could you merge the latest master, because there are now some conflicts that need resolving.

howardhuanghua · 2019-10-01T07:15:39Z

Thanks @DaveCTurner, I have updated the test case based on your suggestion. During restarting replica shard node, hold peer recovery from primary shard to new node, and check peer recovery source/target on replica shard node stopped, finally make sure cluster green before releasing the held peer recovery. Please help to review again. Thanks for your help!

DaveCTurner

Thanks @howardhuanghua I left a few more comments but this is looking very good.

DaveCTurner · 2019-10-01T07:26:15Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+            new InternalTestCluster.RestartCallback() {
+                @Override
+                public Settings onNodeStopped(String nodeName) throws Exception {
+                    assertBusy(() -> {


😁 I was just about to note the missing wait here.

I think it'd be neater to wait for node A to send its CLEAN_FILES action instead of using an assertBusy. You can use another CountDownLatch for this.

DaveCTurner · 2019-10-01T07:27:01Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+                        RecoveryResponse response = client().admin().indices().prepareRecoveries(INDEX_NAME).execute().actionGet();
+
+                        List<RecoveryState> recoveryStates = response.shardRecoveryStates().get(INDEX_NAME);
+                        List<RecoveryState> nodeARecoveryStates = findRecoveriesForTargetNode(nodeA, recoveryStates);


I think we do not need to say anything about the recoveries on node A. These assertions are true, but not particularly important for this test.

DaveCTurner · 2019-10-01T07:37:22Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+                }
+            });
+
+        // wait for peer recovering from nodeA to nodeB to be finished


It took me some time to work out why this works - I suggest this comment explaining it:

Suggested change

// wait for peer recovering from nodeA to nodeB to be finished

// wait for peer recovery from nodeA to nodeB which is a no-op recovery so it skips the CLEAN_FILES stage and hence is not blocked

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

DaveCTurner · 2019-10-01T07:38:57Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        final String nodeA = internalCluster().startNode();
+
+        logger.info("--> create index on node: {}", nodeA);
+        ByteSizeValue shardSize = createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT)


shardSize is unused:

Suggested change

ByteSizeValue shardSize = createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT)

createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT).getShards()[0].getStats().getStore().size();

DaveCTurner · 2019-10-01T07:39:23Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        logger.info("--> start node B");
+        // force a shard recovery from nodeA to nodeB
+        final String nodeB = internalCluster().startNode();
+        Settings nodeBDataPathSettings = internalCluster().dataPathSettings(nodeB);


nodeBDataPathSettings is unused:

Suggested change

Settings nodeBDataPathSettings = internalCluster().dataPathSettings(nodeB);

DaveCTurner · 2019-10-01T07:41:46Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+
+        logger.info("--> start node C");
+        final String nodeC = internalCluster().startNode();
+        assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());


I'd normally recommend the shorthand

Suggested change

assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());

ensureStableCluster(3);

but I don't think this is necessary:

startNode() calls validateClusterFormed()

anyway it doesn't matter if node C takes a bit longer to join the cluster because we have to wait for its recovery to start which only happens after it's joined.

Therefore I think we can drop this:

Suggested change

assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());

howardhuanghua · 2019-10-01T08:42:45Z

Hi @DaveCTurner, appreciate your patient help! I have updated the test case, please help to check again.

DaveCTurner · 2019-10-01T08:54:54Z

@elasticmachine test this please

DaveCTurner

LGTM thanks @howardhuanghua.

We cancel ongoing peer recoveries if a node joins the cluster with a completely up-to-date copy of a shard, because we can use such a copy to recover a replica instantly. However, today we only look for recoveries to cancel while there are unassigned shards in the cluster. This means that we do not contemplate the cancellation of the last few recoveries since recovering shards are not unassigned. It might take much longer for these recoveries to complete than would be necessary if they were cancelled. This commit fixes this by checking for cancellable recoveries even if all shards are assigned.

howardhuanghua marked this pull request as ready for review September 10, 2019 08:31

DaveCTurner added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 10, 2019

DaveCTurner requested changes Sep 18, 2019

View reviewed changes

DaveCTurner requested changes Sep 23, 2019

View reviewed changes

DaveCTurner mentioned this pull request Sep 23, 2019

Sequence number based replica allocation #46959

Merged

DaveCTurner requested changes Sep 30, 2019

View reviewed changes

howardhuanghua added 4 commits October 1, 2019 13:35

optimize rolling restart

89978bc

add test case for delay allocation with no shard data copy

cac38ad

fix initializing shard cancel logic issue

e436afb

add cancel new shard recovery and uses existing shard copy IT

51d1bb3

howardhuanghua and others added 5 commits October 1, 2019 13:35

remove extra space

be2546b

fix long line issue

28d2e7c

optimize rolling restart

c3e1387

add test case for delay allocation with no shard data copy

bf5248f

fix initializing shard cancel logic issue

dc25223

howardhuanghua force-pushed the optimize_rolling_restart branch from 703895a to dc25223 Compare October 1, 2019 05:48

howardhuanghua added 3 commits October 1, 2019 14:12

update test case

45da6a3

fix compile issue

3d90508

update test case to wait for recovering

0cff47a

DaveCTurner requested changes Oct 1, 2019

View reviewed changes

enhance test case

f83dcb5

DaveCTurner approved these changes Oct 1, 2019

View reviewed changes

DaveCTurner added v7.5.0 v8.0.0 >bug labels Oct 1, 2019

DaveCTurner changed the title ~~Optimize rolling restart efficiency.~~ Cancel recoveries even if all shards assigned Oct 1, 2019

DaveCTurner merged commit af930a7 into elastic:master Oct 1, 2019

DaveCTurner added a commit that referenced this pull request Oct 1, 2019

Fix trailing whitespace introduced in #46520

cc520ba

dliappis mentioned this pull request Aug 3, 2020

Only check inactive replicas in cancelling existing recoveries. #60564

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	// wait for peer recovering from nodeA to nodeB to be finished
	// wait for peer recovery from nodeA to nodeB which is a no-op recovery so it skips the CLEAN_FILES stage and hence is not blocked

	ByteSizeValue shardSize = createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT)
	createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT).getShards()[0].getStats().getStore().size();

	assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());
	ensureStableCluster(3);

Cancel recoveries even if all shards assigned #46520

Cancel recoveries even if all shards assigned #46520

Conversation

howardhuanghua commented Sep 10, 2019 • edited Loading

Issue

Solution

elasticcla commented Sep 10, 2019

elasticmachine commented Sep 10, 2019

DaveCTurner commented Sep 10, 2019

DaveCTurner commented Sep 10, 2019

howardhuanghua commented Sep 10, 2019 • edited Loading

DaveCTurner commented Sep 10, 2019

howardhuanghua commented Sep 11, 2019

howardhuanghua commented Sep 12, 2019

DaveCTurner commented Sep 12, 2019

howardhuanghua commented Sep 16, 2019

DaveCTurner commented Sep 16, 2019

howardhuanghua commented Sep 16, 2019 • edited Loading

DaveCTurner commented Sep 16, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howardhuanghua commented Sep 19, 2019

DaveCTurner commented Sep 19, 2019

howardhuanghua commented Sep 20, 2019

howardhuanghua commented Sep 20, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howardhuanghua commented Sep 24, 2019 • edited Loading

howardhuanghua commented Sep 26, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Sep 30, 2019

howardhuanghua commented Oct 1, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howardhuanghua commented Oct 1, 2019

DaveCTurner commented Oct 1, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

howardhuanghua commented Sep 10, 2019 •

edited

Loading

howardhuanghua commented Sep 10, 2019 •

edited

Loading

howardhuanghua commented Sep 16, 2019 •

edited

Loading

howardhuanghua commented Sep 24, 2019 •

edited

Loading