Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374

Merged

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Jul 25, 2018

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in POST_RECOVERY with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308

…it. primary with the same aId

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in POST_RECOVERY with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

Closes elastic#32308
@bleskes bleskes added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.4.0 labels Jul 25, 2018
@bleskes bleskes requested a review from ywelsch July 25, 2018 18:16
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested an (imo simpler) alternative.

@@ -420,6 +420,12 @@ private void removeShards(final ClusterState state) {
// state may result in a new shard being initialized while having the same allocation id as the currently started shard.
logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
indexService.removeShard(shardId.id(), "removing shard (stale copy)");
} else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're over-special-casing this, while also possibly missing some more murky cases. May I suggest the following alternative instead, which handles both this case as well as the previous (odd) one:

diff --git a/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java b/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
index e6a86d47f55..42a88383937 100644
--- a/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
+++ b/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
@@ -219,10 +219,10 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple
 
         removeUnallocatedIndices(event); // also removes shards of removed indices
 
-        failMissingShards(state);
-
         removeShards(state);   // removes any local shards that doesn't match what the master expects
 
+        failMissingShards(state);
+
         updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache
 
         createIndices(state);
@@ -414,17 +414,12 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple
                     logger.debug("{} removing shard (stale allocation id, stale {}, new {})", shardId,
                         currentRoutingEntry, newShardRouting);
                     indexService.removeShard(shardId.id(), "removing shard (stale copy)");
-                } else if (newShardRouting.initializing() && currentRoutingEntry.active()) {
+                } else if (newShardRouting.primary() &&
+                    state.metaData().index(shardId.getIndex()).primaryTerm(shardId.id()) != shard.getPrimaryTerm()) {
                     // this can happen if the node was isolated/gc-ed, rejoins the cluster and a new shard with the same allocation id
                     // is assigned to it. Batch cluster state processing or if shard fetching completes before the node gets a new cluster
                     // state may result in a new shard being initialized while having the same allocation id as the currently started shard.
-                    logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
-                    indexService.removeShard(shardId.id(), "removing shard (stale copy)");
-                } else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) {
-                    assert currentRoutingEntry.initializing() : currentRoutingEntry; // see above if clause
-                    // this can happen when cluster state batching batches activation of the shard, closing an index, reopening it
-                    // and assigning an initializing primary to this node
-                    logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
+                    logger.debug("{} removing shard (stale primary term)", shardId);
                     indexService.removeShard(shardId.id(), "removing shard (stale copy)");
                 }
             }
@@ -728,6 +723,11 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple
          */
         ShardRouting routingEntry();
 
+        /**
+         * Returns the current primary term of this shard
+         */
+        long getPrimaryTerm();
+
         /**
          * Returns the latest internal shard state.
          */
diff --git a/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java b/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java
index 5e9ba653d40..bfcca1946a6 100644
--- a/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java
+++ b/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java
@@ -383,6 +383,11 @@ public abstract class AbstractIndicesClusterStateServiceTestCase extends ESTestC
             return shardRouting;
         }
 
+        @Override
+        public long getPrimaryTerm() {
+            return term;
+        }
+
         @Override
         public IndexShardState state() {
             return null;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, we agreed this is not quite right. We decided to go with the initial condition proposed here.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous suggestion was too good to be true. The condition you've proposed is the best I can think of right now until we distinguish between allocation ids and shard copy ids. I also think it's ok to keep the test, it's relatively short and succinctly documents an edge case.

@@ -420,6 +420,12 @@ private void removeShards(final ClusterState state) {
// state may result in a new shard being initialized while having the same allocation id as the currently started shard.
logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
indexService.removeShard(shardId.id(), "removing shard (stale copy)");
} else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, we agreed this is not quite right. We decided to go with the initial condition proposed here.

@@ -74,6 +74,10 @@ public void injectRandomFailures() {
enableRandomFailures = randomBoolean();
}

protected void disalbeRandomFailures() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling

@bleskes
Copy link
Contributor Author

bleskes commented Jul 30, 2018

Thx @ywelsch . I tend towards pushing this to 6.3 too. Do you agree?

@ywelsch
Copy link
Contributor

ywelsch commented Jul 30, 2018

yes

@bleskes bleskes added the v6.3.3 label Jul 30, 2018
@bleskes bleskes merged commit 0cae19c into elastic:master Jul 30, 2018
@bleskes bleskes deleted the indices_cluster_state_init_replcia_primary branch July 30, 2018 13:24
@bleskes
Copy link
Contributor Author

bleskes commented Jul 30, 2018

thx @ywelsch

bleskes added a commit that referenced this pull request Jul 30, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
bleskes added a commit that referenced this pull request Jul 30, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
dnhatn added a commit that referenced this pull request Jul 30, 2018
* master:
  Tests: Fix convert error tests to use fixed value (#32415)
  IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374)
  REST high-level client: parse back _ignored meta field (#32362)
  [CI] Mute DocumentSubsetReaderTests testSearch
bleskes added a commit that referenced this pull request Jul 31, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 31, 2018
* elastic/ccr: (57 commits)
  ShardFollowNodeTask should fetch operation once (elastic#32455)
  Do not expose hard-deleted docs in Lucene history (elastic#32333)
  Tests: Fix convert error tests to use fixed value (elastic#32415)
  IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374)
  REST high-level client: parse back _ignored meta field (elastic#32362)
  [CI] Mute DocumentSubsetReaderTests testSearch
  Reject follow request if following setting not enabled on follower (elastic#32448)
  TEST: testDocStats should always use forceMerge (elastic#32450)
  TEST: avoid merge in testSegmentMemoryTrackedInBreaker
  TEST: Avoid deletion in FlushIT
  AwaitsFix IndexShardTests#testDocStats
  Painless: Add method type to method. (elastic#32441)
  Remove reference to non-existent store type (elastic#32418)
  [TEST] Mute failing FlushIT test
  Fix ordering of bootstrap checks in docs (elastic#32417)
  [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints
  Validate source of an index in LuceneChangesSnapshot (elastic#32288)
  [TEST] Mute failing testConvertLongHexError
  bump lucene version after backport
  Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390)
  ...
dnhatn added a commit that referenced this pull request Aug 2, 2018
* 6.x:
  Fix scriptdocvalues tests with dates
  Correct minor typo in explain.asciidoc for HLRC
  Fix painless whitelist and warnings from backporting #31441
  Build: Add elastic maven to repos used by BuildPlugin (#32549)
  Scripting: Conditionally use java time api in scripting (#31441)
  [ML] Improve error when no available field exists for rule scope (#32550)
  [ML] Improve error for functions with limited rule condition support (#32548)
  [ML] Remove multiple_bucket_spans
  [ML] Fix thread leak when waiting for job flush (#32196) (#32541)
  Painless: Clean Up PainlessField (#32525)
  Add @AwaitsFix for #32554
  Remove broken @link in Javadoc
  Add AwaitsFix to failing test - see #32546
  SQL: Added support for string manipulating functions with more than one parameter (#32356)
  [DOCS] Reloadable Secure Settings (#31713)
  Fix compilation error introduced by #32339
  [Rollup] Remove builders from TermsGroupConfig (#32507)
  Use hostname instead of IP with SPNEGO test (#32514)
  Switch x-pack rolling restart to new style Requests (#32339)
  [DOCS] Small fixes in rule configuration page (#32516)
  Painless: Clean up PainlessMethod (#32476)
  SQL: Add test for handling of partial results (#32474)
  Docs: Add missing migration doc for logging change
  Build: Remove shadowing from benchmarks (#32475)
  Docs: Add all JDKs to CONTRIBUTING.md
  Logging: Make node name consistent in logger (#31588)
  High-level client: fix clusterAlias parsing in SearchHit (#32465)
  REST high-level client: parse back _ignored meta field (#32362)
  backport fix of reduceRandom fix (#32508)
  Add licensing enforcement for FIPS mode (#32437)
  INGEST: Clean up Java8 Stream Usage (#32059) (#32485)
  Improve the error message when an index is incompatible with field aliases. (#32482)
  Mute testFilterCacheStats
  Scripting: Fix painless compiler loader to know about context classes (#32385)
  [ML][DOCS] Fix typo applied_to => applies_to
  Mute SSLTrustRestrictionsTests on JDK 11
  Changed ReindexRequest to use Writeable.Reader (#32401)
  Increase max chunk size to 256Mb for repo-azure (#32101)
  Mute KerberosAuthenticationIT
  fix no=>not typo (#32463)
  HLRC: Add delete watch action (#32337)
  Fix calculation of orientation of polygons (#27967)
  [Kerberos] Add missing javadocs (#32469)
  Fix missing JavaDoc for @throws in several places in KerberosTicketValidator.
  Make get all app privs requires "*" permission (#32460)
  Ensure KeyStoreWrapper decryption exceptions are handled (#32472)
  update rollover to leverage write-alias semantics (#32216)
  [Kerberos] Remove Kerberos bootstrap checks (#32451)
  Switch security to new style Requests (#32290)
  Switch security spi example to new style Requests (#32341)
  Painless: Add PainlessConstructor (#32447)
  Update Fuzzy Query docs to clarify default behavior re max_expansions (#30819)
  Remove > from Javadoc (fatal with Java 11)
  Tests: Fix convert error tests to use fixed value (#32415)
  IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374)
  auto-interval date histogram - 6.x backport (#32107)
  [CI] Mute DocumentSubsetReaderTests testSearch
  [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints
  TEST: testDocStats should always use forceMerge (#32450)
  TEST: Avoid deletion in FlushIT
  AwaitsFix IndexShardTests#testDocStats
  Painless: Add method type to method. (#32441)
  Remove reference to non-existent store type (#32418)
  [TEST] Mute failing FlushIT test
  Fix ordering of bootstrap checks in docs (#32417)
  Wrong discovery.type for azure in breaking changes (#32432)
  Mute ConvertProcessorTests failing tests
  TESTS: Move netty leak detection to paranoid level (#32354) (#32425)
  Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (#32390)
  [Kerberos] Avoid vagrant update on precommit (#32416)
  TEST: Avoid triggering merges in FlushIT
  [DOCS] Fixes formatting of scope object in job resource
  Switch x-pack/plugin to new style Requests (#32327)
  Release requests in cors handle (#32410)
  Remove BouncyCastle dependency from runtime (#32402)
  Copy missing segment attributes in getSegmentInfo (#32396)
  Rest HL client: Add put license action (#32214)
  Docs: Correcting a typo in tophits (#32359)
  Build: Stop double generating buildSrc pom (#32408)
  Switch x-pack full restart to new style Requests (#32294)
  Painless: Clean Up PainlessClass Variables (#32380)
  [ML] Consistent pattern for strict/lenient parser names (#32399)
  Add Restore Snapshot High Level REST API
  Update update-settings.asciidoc (#31378)
  Introduce index store plugins (#32375)
  Rank-Eval: Reduce scope of an unchecked supression
  Make sure _forcemerge respects `max_num_segments`. (#32291)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v6.3.3 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] IndicesClusterStateServiceRandomUpdatesTests.testRandomClusterStateUpdates
4 participants