IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374

bleskes · 2018-07-25T18:16:10Z

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in POST_RECOVERY with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308

…it. primary with the same aId In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in POST_RECOVERY with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. Closes elastic#32308

elasticmachine · 2018-07-25T18:16:12Z

Pinging @elastic/es-distributed

ywelsch

I've suggested an (imo simpler) alternative.

ywelsch · 2018-07-26T09:39:07Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

@@ -420,6 +420,12 @@ private void removeShards(final ClusterState state) {
                    // state may result in a new shard being initialized while having the same allocation id as the currently started shard.
                    logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
                    indexService.removeShard(shardId.id(), "removing shard (stale copy)");
+                } else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) {


I think we're over-special-casing this, while also possibly missing some more murky cases. May I suggest the following alternative instead, which handles both this case as well as the previous (odd) one:

diff --git a/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java b/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java index e6a86d47f55..42a88383937 100644 --- a/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java +++ b/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java @@ -219,10 +219,10 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple removeUnallocatedIndices(event); // also removes shards of removed indices - failMissingShards(state); - removeShards(state); // removes any local shards that doesn't match what the master expects + failMissingShards(state); + updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache createIndices(state); @@ -414,17 +414,12 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple logger.debug("{} removing shard (stale allocation id, stale {}, new {})", shardId, currentRoutingEntry, newShardRouting); indexService.removeShard(shardId.id(), "removing shard (stale copy)"); - } else if (newShardRouting.initializing() && currentRoutingEntry.active()) { + } else if (newShardRouting.primary() && + state.metaData().index(shardId.getIndex()).primaryTerm(shardId.id()) != shard.getPrimaryTerm()) { // this can happen if the node was isolated/gc-ed, rejoins the cluster and a new shard with the same allocation id // is assigned to it. Batch cluster state processing or if shard fetching completes before the node gets a new cluster // state may result in a new shard being initialized while having the same allocation id as the currently started shard. - logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting); - indexService.removeShard(shardId.id(), "removing shard (stale copy)"); - } else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) { - assert currentRoutingEntry.initializing() : currentRoutingEntry; // see above if clause - // this can happen when cluster state batching batches activation of the shard, closing an index, reopening it - // and assigning an initializing primary to this node - logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting); + logger.debug("{} removing shard (stale primary term)", shardId); indexService.removeShard(shardId.id(), "removing shard (stale copy)"); } } @@ -728,6 +723,11 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple */ ShardRouting routingEntry(); + /** + * Returns the current primary term of this shard + */ + long getPrimaryTerm(); + /** * Returns the latest internal shard state. */ diff --git a/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java b/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java index 5e9ba653d40..bfcca1946a6 100644 --- a/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java +++ b/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java @@ -383,6 +383,11 @@ public abstract class AbstractIndicesClusterStateServiceTestCase extends ESTestC return shardRouting; } + @Override + public long getPrimaryTerm() { + return term; + } + @Override public IndexShardState state() { return null;

After discussion, we agreed this is not quite right. We decided to go with the initial condition proposed here.

ywelsch

My previous suggestion was too good to be true. The condition you've proposed is the best I can think of right now until we distinguish between allocation ids and shard copy ids. I also think it's ok to keep the test, it's relatively short and succinctly documents an edge case.

ywelsch · 2018-07-30T06:37:34Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

@@ -420,6 +420,12 @@ private void removeShards(final ClusterState state) {
                    // state may result in a new shard being initialized while having the same allocation id as the currently started shard.
                    logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
                    indexService.removeShard(shardId.id(), "removing shard (stale copy)");
+                } else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) {


After discussion, we agreed this is not quite right. We decided to go with the initial condition proposed here.

ywelsch · 2018-07-30T06:39:39Z

.../test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java

@@ -74,6 +74,10 @@ public void injectRandomFailures() {
        enableRandomFailures = randomBoolean();
    }

+    protected void disalbeRandomFailures() {


bleskes · 2018-07-30T06:47:21Z

Thx @ywelsch . I tend towards pushing this to 6.3 too. Do you agree?

…tate_init_replcia_primary

ywelsch · 2018-07-30T06:50:28Z

yes

bleskes · 2018-07-30T13:24:57Z

thx @ywelsch

…it. primary with the same aId (#32374) In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep. Closes #32308

* master: Tests: Fix convert error tests to use fixed value (#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374) REST high-level client: parse back _ignored meta field (#32362) [CI] Mute DocumentSubsetReaderTests testSearch

…it. primary with the same aId (#32374) In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep. Closes #32308

* elastic/ccr: (57 commits) ShardFollowNodeTask should fetch operation once (elastic#32455) Do not expose hard-deleted docs in Lucene history (elastic#32333) Tests: Fix convert error tests to use fixed value (elastic#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374) REST high-level client: parse back _ignored meta field (elastic#32362) [CI] Mute DocumentSubsetReaderTests testSearch Reject follow request if following setting not enabled on follower (elastic#32448) TEST: testDocStats should always use forceMerge (elastic#32450) TEST: avoid merge in testSegmentMemoryTrackedInBreaker TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (elastic#32441) Remove reference to non-existent store type (elastic#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (elastic#32417) [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints Validate source of an index in LuceneChangesSnapshot (elastic#32288) [TEST] Mute failing testConvertLongHexError bump lucene version after backport Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390) ...

@throws

* 6.x: Fix scriptdocvalues tests with dates Correct minor typo in explain.asciidoc for HLRC Fix painless whitelist and warnings from backporting #31441 Build: Add elastic maven to repos used by BuildPlugin (#32549) Scripting: Conditionally use java time api in scripting (#31441) [ML] Improve error when no available field exists for rule scope (#32550) [ML] Improve error for functions with limited rule condition support (#32548) [ML] Remove multiple_bucket_spans [ML] Fix thread leak when waiting for job flush (#32196) (#32541) Painless: Clean Up PainlessField (#32525) Add @AwaitsFix for #32554 Remove broken @link in Javadoc Add AwaitsFix to failing test - see #32546 SQL: Added support for string manipulating functions with more than one parameter (#32356) [DOCS] Reloadable Secure Settings (#31713) Fix compilation error introduced by #32339 [Rollup] Remove builders from TermsGroupConfig (#32507) Use hostname instead of IP with SPNEGO test (#32514) Switch x-pack rolling restart to new style Requests (#32339) [DOCS] Small fixes in rule configuration page (#32516) Painless: Clean up PainlessMethod (#32476) SQL: Add test for handling of partial results (#32474) Docs: Add missing migration doc for logging change Build: Remove shadowing from benchmarks (#32475) Docs: Add all JDKs to CONTRIBUTING.md Logging: Make node name consistent in logger (#31588) High-level client: fix clusterAlias parsing in SearchHit (#32465) REST high-level client: parse back _ignored meta field (#32362) backport fix of reduceRandom fix (#32508) Add licensing enforcement for FIPS mode (#32437) INGEST: Clean up Java8 Stream Usage (#32059) (#32485) Improve the error message when an index is incompatible with field aliases. (#32482) Mute testFilterCacheStats Scripting: Fix painless compiler loader to know about context classes (#32385) [ML][DOCS] Fix typo applied_to => applies_to Mute SSLTrustRestrictionsTests on JDK 11 Changed ReindexRequest to use Writeable.Reader (#32401) Increase max chunk size to 256Mb for repo-azure (#32101) Mute KerberosAuthenticationIT fix no=>not typo (#32463) HLRC: Add delete watch action (#32337) Fix calculation of orientation of polygons (#27967) [Kerberos] Add missing javadocs (#32469) Fix missing JavaDoc for @throws in several places in KerberosTicketValidator. Make get all app privs requires "*" permission (#32460) Ensure KeyStoreWrapper decryption exceptions are handled (#32472) update rollover to leverage write-alias semantics (#32216) [Kerberos] Remove Kerberos bootstrap checks (#32451) Switch security to new style Requests (#32290) Switch security spi example to new style Requests (#32341) Painless: Add PainlessConstructor (#32447) Update Fuzzy Query docs to clarify default behavior re max_expansions (#30819) Remove > from Javadoc (fatal with Java 11) Tests: Fix convert error tests to use fixed value (#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374) auto-interval date histogram - 6.x backport (#32107) [CI] Mute DocumentSubsetReaderTests testSearch [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints TEST: testDocStats should always use forceMerge (#32450) TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (#32441) Remove reference to non-existent store type (#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (#32417) Wrong discovery.type for azure in breaking changes (#32432) Mute ConvertProcessorTests failing tests TESTS: Move netty leak detection to paranoid level (#32354) (#32425) Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (#32390) [Kerberos] Avoid vagrant update on precommit (#32416) TEST: Avoid triggering merges in FlushIT [DOCS] Fixes formatting of scope object in job resource Switch x-pack/plugin to new style Requests (#32327) Release requests in cors handle (#32410) Remove BouncyCastle dependency from runtime (#32402) Copy missing segment attributes in getSegmentInfo (#32396) Rest HL client: Add put license action (#32214) Docs: Correcting a typo in tophits (#32359) Build: Stop double generating buildSrc pom (#32408) Switch x-pack full restart to new style Requests (#32294) Painless: Clean Up PainlessClass Variables (#32380) [ML] Consistent pattern for strict/lenient parser names (#32399) Add Restore Snapshot High Level REST API Update update-settings.asciidoc (#31378) Introduce index store plugins (#32375) Rank-Eval: Reduce scope of an unchecked supression Make sure _forcemerge respects `max_num_segments`. (#32291)

bleskes added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.4.0 labels Jul 25, 2018

bleskes requested a review from ywelsch July 25, 2018 18:16

ywelsch reviewed Jul 26, 2018

View reviewed changes

ywelsch approved these changes Jul 30, 2018

View reviewed changes

bleskes added 2 commits July 30, 2018 09:47

Merge remote-tracking branch 'upstream/master' into indices_cluster_s…

ccd4d80

…tate_init_replcia_primary

typo

e3f39f4

bleskes added the v6.3.3 label Jul 30, 2018

bleskes merged commit 0cae19c into elastic:master Jul 30, 2018

bleskes deleted the indices_cluster_state_init_replcia_primary branch July 30, 2018 13:24

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374

IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374

bleskes commented Jul 25, 2018

elasticmachine commented Jul 25, 2018

ywelsch left a comment

ywelsch Jul 26, 2018

ywelsch Jul 30, 2018

ywelsch left a comment

ywelsch Jul 30, 2018

ywelsch Jul 30, 2018

bleskes commented Jul 30, 2018

ywelsch commented Jul 30, 2018

bleskes commented Jul 30, 2018

IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374

IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374

Conversation

bleskes commented Jul 25, 2018

elasticmachine commented Jul 25, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jul 26, 2018

Choose a reason for hiding this comment

ywelsch Jul 30, 2018

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jul 30, 2018

Choose a reason for hiding this comment

ywelsch Jul 30, 2018

Choose a reason for hiding this comment

bleskes commented Jul 30, 2018

ywelsch commented Jul 30, 2018

bleskes commented Jul 30, 2018