A replica can be promoted and started in one cluster state update #32042

bleskes · 2018-07-13T15:04:20Z

When a replica is fully recovered (i.e., in POST_RECOVERY state) we send a request to the master to start the shard. The master changes the state of the replica and publishes a cluster state to that effect. In certain cases, that cluster state can be processed on the node hosting the replica together with a cluster state that promotes that, now started, replica to a primary. This can happen due to cluster state batched processing or if the master died after having committed the cluster state that starts the shard but before publishing it to the node with the replica. If the master also held the primary shard, the new master node will remove the primary (as it failed) and will also immediately promote the replica (thinking it is started).

Sadly our code in IndexShard didn't allow for this which caused assertions to be tripped in some of our tests runs.

elasticmachine · 2018-07-13T15:04:22Z

Pinging @elastic/es-distributed

bleskes · 2018-07-13T15:06:06Z

@ywelsch I'm on the fence as to whether this should go into 6.3 . Opinions?

ywelsch

I've left some smaller comments but main change LGTM. With the improved testing, I think this can go into 6.3.

ywelsch · 2018-07-13T15:26:14Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-                if (newRouting.primary() && currentRouting.isRelocationTarget() == false) {
-                    replicationTracker.activatePrimaryMode(getLocalCheckpoint());
-                }
+                assert currentRouting.isRelocationTarget()  == false || currentRouting.primary() == false ||


this assertion confuses me: What should it express?

Agreed it's confusing. I tried to clarify by adding the assertion message. If it helps I can flip the boolean around:

(currentRouting.isRelocationTarget() && currentRouting.primary() && replicationTracker.isPrimaryMode() == false) == false

but I think that might be just as confusing if not more. Let me know.

let's keep as is, with the assertion message I think it's ok. I wonder if we should have an assertion at the end of this method to say something like "if we have an active primary shard that's not relocating, then the replication tracker is in primary mode".

sounds good. will add.

added in 3c79d57

ywelsch · 2018-07-13T15:35:52Z

test/framework/src/main/java/org/elasticsearch/index/shard/IndexShardTestCase.java

+    /**
+     * re-enables default behavior to fail tests when shards created by this class fail
+     */
+    protected void failOnShardFailures() {


this method is not used -> remove

ywelsch · 2018-07-13T15:37:05Z

test/framework/src/main/java/org/elasticsearch/index/shard/IndexShardTestCase.java

     * @param listeners              an optional set of listeners to add to the shard
     */
    protected IndexShard newShard(ShardRouting routing, ShardPath shardPath, IndexMetaData indexMetaData,
                                  @Nullable IndexSearcherWrapper indexSearcherWrapper,
                                  @Nullable EngineFactory engineFactory,
                                  Runnable globalCheckpointSyncer,
-                                  IndexEventListener indexEventListener, IndexingOperationListener... listeners) throws IOException {
+                                  IndexEventListener indexEventListener, boolean ignoreShardFailures,


is this parameter used anywhere?

left over of a failed attempt, will remove. good catch

…omotion

…2042) When a replica is fully recovered (i.e., in `POST_RECOVERY` state) we send a request to the master to start the shard. The master changes the state of the replica and publishes a cluster state to that effect. In certain cases, that cluster state can be processed on the node hosting the replica *together* with a cluster state that promotes that, now started, replica to a primary. This can happen due to cluster state batched processing or if the master died after having committed the cluster state that starts the shard but before publishing it to the node with the replica. If the master also held the primary shard, the new master node will remove the primary (as it failed) and will also immediately promote the replica (thinking it is started). Sadly our code in IndexShard didn't allow for this which caused [assertions](https://github.com/elastic/elasticsearch/blob/13917162ad5c59a96ccb4d6a81a5044546c45c22/server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java#L482) to be tripped in some of our tests runs.

@return

* master: Painless: Simplify Naming in Lookup Package (#32177) Handle missing values in painless (#32207) add support for write index resolution when creating/updating documents (#31520) ECS Task IAM profile credentials ignored in repository-s3 plugin (#31864) Remove indication of future multi-homing support (#32187) Rest test - allow for snapshots to take 0 milliseconds Make x-pack-core generate a pom file Rest HL client: Add put watch action (#32026) Build: Remove pom generation for plugin zip files (#32180) Fix comments causing errors with Java 11 Fix rollup on date fields that don't support epoch_millis (#31890) Detect and prevent configuration that triggers a Gradle bug (#31912) [test] port linux package packaging tests (#31943) Revert "Introduce a Hashing Processor (#31087)" (#32178) Remove empty @return from JavaDoc Adjust SSLDriver behavior for JDK11 changes (#32145) [test] use randomized runner in packaging tests (#32109) Add support for field aliases. (#32172) Painless: Fix caching bug and clean up addPainlessClass. (#32142) Call setReferences() on custom referring tokenfilters in _analyze (#32157) Fix BwC Tests looking for UUID Pre 6.4 (#32158) Improve docs for search preferences (#32159) use before instead of onOrBefore Add more contexts to painless execute api (#30511) Add EC2 credential test for repository-s3 (#31918) A replica can be promoted and started in one cluster state update (#32042) Fix Java 11 javadoc compile problem Fix CP for namingConventions when gradle home has spaces (#31914) Fix `range` queries on `_type` field for singe type indices (#31756) [DOCS] Update TLS on Docker for 6.3 (#32114) ESIndexLevelReplicationTestCase doesn't support replicated failures but it's good to know what they are Remove versionType from translog (#31945) Switch distribution to new style Requests (#30595) Build: Skip jar tests if jar disabled Painless: Add PainlessClassBuilder (#32141) Build: Make additional test deps of check (#32015) Disable C2 from using AVX-512 on JDK 10 (#32138) Build: Move shadow customizations into common code (#32014) Painless: Fix Bug with Duplicate PainlessClasses (#32110) Remove empty @param from Javadoc Re-disable packaging tests on suse boxes Docs: Fix missing example script quote (#32010) [ML] Wait for aliases in multi-node tests (#32086) [ML] Move analyzer dependencies out of categorization config (#32123) Ensure to release translog snapshot in primary-replica resync (#32045) Handle TokenizerFactory TODOs (#32063) Relax TermVectors API to work with textual fields other than TextFieldType (#31915) Updates the build to gradle 4.9 (#32087) Mute :qa:mixed-cluster indices.stats/10_index/Index - all’ Check that client methods match API defined in the REST spec (#31825) Enable testing in FIPS140 JVM (#31666) Fix put mappings java API documentation (#31955) Add exclusion option to `keep_types` token filter (#32012) [Test] Modify assert statement for ssl handshake (#32072)

* es/6.x: (24 commits) Fix broken backport Switch full-cluster-restart to new style Requests (#32140) Fix multi level nested sort (#32204) MINOR: Remove unused `IndexDynamicSettings` (#32237) (#32248) [Tests] Remove QueryStringQueryBuilderTests#toQuery class assertions (#32236) Switch rolling restart to new style Requests (#32147) Enhance Parent circuit breaker error message (#32056) [ML] Use default request durability for .ml-state index (#32233) Enable testing in FIPS140 JVM (#31666) (#32231) Remove indices stats timeout from monitoring docs TESTS: Check for Netty resource leaks (#31861) (#32225) Rename ranking evaluation response section (#32166) Dependencies: Upgrade to joda time 2.10 (#32160) Backport SSL context names (#30953) to 6.x (#32223) Require Gradle 4.9 as minimum version (#32200) Detect old trial licenses and mimic behaviour (#32209) Painless: Simplify Naming in Lookup Package (#32177) add support for write index resolution when creating/updating documents (#31520) A replica can be promoted and started in one cluster state update (#32042) Rest test - allow for snapshots to take 0 milliseconds ...

bleskes added 3 commits July 13, 2018 16:06

fix + tests

d0af991

lint

13e178a

typo

61adb14

bleskes added >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.4.0 labels Jul 13, 2018

bleskes requested a review from ywelsch July 13, 2018 15:04

ywelsch approved these changes Jul 13, 2018

View reviewed changes

bleskes added 3 commits July 13, 2018 19:41

feedback

ebd3998

Merge remote-tracking branch 'upstream/master' into shard_recovery_pr…

6cd553a

…omotion

fix test and add assertion

3c79d57

bleskes merged commit 5856c39 into elastic:master Jul 18, 2018

bleskes deleted the shard_recovery_promotion branch July 18, 2018 09:30

bleskes added the backport pending label Jul 18, 2018

bleskes added v6.3.2 and removed backport pending labels Jul 19, 2018

jpountz removed the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Jan 29, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A replica can be promoted and started in one cluster state update #32042

A replica can be promoted and started in one cluster state update #32042

bleskes commented Jul 13, 2018

elasticmachine commented Jul 13, 2018

bleskes commented Jul 13, 2018

ywelsch left a comment

ywelsch Jul 13, 2018

bleskes Jul 13, 2018

ywelsch Jul 16, 2018

bleskes Jul 16, 2018

bleskes Jul 18, 2018

ywelsch Jul 13, 2018

bleskes Jul 13, 2018

ywelsch Jul 13, 2018

bleskes Jul 13, 2018

A replica can be promoted and started in one cluster state update #32042

A replica can be promoted and started in one cluster state update #32042

Conversation

bleskes commented Jul 13, 2018

elasticmachine commented Jul 13, 2018

bleskes commented Jul 13, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment