QA: Switch rolling upgrade to 3 nodes #30728

nik9000 · 2018-05-18T14:07:41Z

Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:

Old
One third upgraded
Two thirds upgraded
Upgraded

It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are almost the same case.

This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.

Closes #25336

Switches the rolling upgrade tests from upgrading two nodes to upgrading three nodes which is much more realistic and much better able to find unexpected bugs. It upgrades the nodes one at a time and runs tests between each upgrade. As such this now has four test runs: 1. Old 2. One third upgraded 3. Two thirds upgraded 4. Upgraded It sets system properties so the tests can figure out which stage they are in. It reuses the same yml tests for the "one third" and "two thirds" cases because they are *almost* the same case. This rewrites the yml-based indexing tests to be Java based because the yml-based tests can't handle different expected values for the counts. And the indexing tests need that when they are run twice.

elasticmachine · 2018-05-18T14:07:53Z

Pinging @elastic/es-core-infra

nik9000 · 2018-05-18T15:36:16Z

qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/RecoveryIT.java

@@ -258,7 +217,7 @@ public void testRelocationWithConcurrentIndexing() throws Exception {
                break;
            case UPGRADED:


I'm aware that during the MIXED branch of this switch we index the same documents twice and that might not be great. But it is simple. If we don't like it we can change it though.

ywelsch · 2018-05-18T15:51:50Z

qa/rolling-upgrade/build.gradle

+      dependsOn lastRunner, "${baseName}#oldClusterTestCluster#node${stopNode}.stop"
+      clusterName = 'rolling-upgrade'
+      unicastTransportUri = { seedNode, node, ant -> unicastSeed() }
+      minimumMasterNodes = { 3 }


would it work with 2 as well? That would be more realistic for a rolling-upgrade test

Let me experiment with that. I suspect it'll work but I think minimumMasterNodes=3 was there for the wait condition. Checking.

I think this is a bit tricky actually. The wait condition looks like it relies on minimum master nodes to to not start the tests before the node has actually joined the cluster. I suspect I could switch it to wait for the right number of nodes but that might have side effects that I don't understand yet.

So I'd like to do it in a follow up change that I don't backport ot 6.3.

The waitCondition knows nothing about min master nodes. It only checks there are numNodes nodes in the cluster.

I think you need to change the setting, though, on every cluster (including old cluster). Maybe that caused problems (the older nodes would still think min master nodes was 3?).

Right, I'd need to change it everywhere.

The wait condition doesn't use minimum master nodes but it does use the cluster not responding to http requests as a signal. I believe we don't respond to the request that it uses until we get all the master nodes. Let me recheck that assumption.

It'll reply but with a 503 so we retry:

{ "error" : { "root_cause" : [ { "type" : "cluster_block_exception", "reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];" } ], "type" : "cluster_block_exception", "reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];" }, "status" : 503 }

rjernst

Gradle side looks....ok

rjernst

I had a few more thoughts.

rjernst · 2018-05-18T16:57:57Z

qa/rolling-upgrade/build.gradle

+  Task oneThirdUpgradedTest = tasks.create(name: "${baseName}#oneThirdUpgradedTest", type: RestIntegTestTask)
+
+  configureUpgradeCluster("oneThirdUpgradedTestCluster", oldClusterTestRunner,
+      0, { oldClusterTest.nodes.get(1).transportUri() })


I don't think we should kill node 0 of the old cluster to start. Instead, work backwards from node 2. It is most likely node 0 is master (node 1 and 2 used it as their unicast host).

Hmm - triggering a new master election is kind of useful for testing but I'm fine either way.

By doing it last, we don't have to go through 2 master elections.

I'd prefer to have as many master elections as I can get to be honest. More master elections is more chances to catch the kind of weird stuff that this test should be catching.

It looks like because we set minmium master nodes to 3 the first node doesn't really have any more chance of winning the master election. If we'd set it to 1 or 2 it'd have more chance, but with it set to 3 the cluster has to wait for all the nodes to show up before holding the election and all nodes have an equal shot.

It is not most likely that node 0 is the master. With minimum master nodes set to three here, an election will not happen until all three nodes are present. When a node is voting for a master, it chooses the one with the minimal node ID that it sees among all the master-eligible nodes. As all three nodes have to present, and will see the same three nodes, they will all select the same minimal node ID and that will be the elected master. As node IDs are random, this will be random amongst node 0, node 1, and node 2.

rjernst · 2018-05-18T17:01:26Z

qa/rolling-upgrade/build.gradle

+     * so they are alive during the test.
+     */
+    finalizedBy "${baseName}#oneThirdUpgradedTestCluster#stop"
+    finalizedBy "${baseName}#twoThirdsUpgradedTestCluster#stop"


This doesn't actually force them to be stopped, it only means they will be run after this task at some point (like dependsOn only ensures a task is run before, but not right before, or in any order). In addition to finalizedBy here, the task upgradedClusterTest needs to depend on these stop tasks.

I think this handles that already. It runs after the project evaluates and makes the integ test task depend on everything that finalizes the runner and these lines make the "stop" tasks finalize the runner.

Ok, that should work.

* master: Reduce CLI scripts to one-liners (#30759) SQL: Preserve scoring in bool queries (#30730) QA: Switch rolling upgrade to 3 nodes (#30728) [TEST] Enable DEBUG logging on testAutoQueueSizingWithMax [ML] Don't install empty ML metadata on startup (#30751) Add assertion on removing copy_settings (#30748) bump lucene version for 6_3_0 [DOCS] Mark painless execute api as experimental (#30710) disable annotation processor for docs (#30610) Add more script contexts (#30721) Fix default shards count in create index docs (#30747) Mute testCorruptFileThenSnapshotAndRestore

* es/ccr: (50 commits) Reduce CLI scripts to one-liners (elastic#30759) SQL: Preserve scoring in bool queries (elastic#30730) QA: Switch rolling upgrade to 3 nodes (elastic#30728) [TEST] Enable DEBUG logging on testAutoQueueSizingWithMax [ML] Don't install empty ML metadata on startup (elastic#30751) Add assertion on removing copy_settings (elastic#30748) bump lucene version for 6_3_0 [DOCS] Mark painless execute api as experimental (elastic#30710) disable annotation processor for docs (elastic#30610) Add more script contexts (elastic#30721) Fix default shards count in create index docs (elastic#30747) Mute testCorruptFileThenSnapshotAndRestore Scripting: Remove getDate methods from ScriptDocValues (elastic#30690) Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (elastic#30726) [Docs] Fix single page :docs:check invocation (elastic#30725) Docs: Add uptasticsearch to list of clients (elastic#30738) [DOCS] Removes out-dated x-pack/docs/en/index.asciidoc [DOCS] Removes redundant index.asciidoc files (elastic#30707) [TEST] Reduce forecast overflow to disk test memory limit (elastic#30727) Plugins: Remove meta plugins (elastic#30670) ...

A rolling upgrade from oss Elasticsearch to the default distribution of Elasticsearch is significantly different than a full cluster restart to install a plugin and is again different from starting a new cluster with xpack installed. So this adds some basic tests to make sure that the rolling upgrade that enables xpack works at all. This also removes some unused imports from the tests that I modified in PR elastic#30728. I didn't mean to leave them.

A rolling upgrade from oss Elasticsearch to the default distribution of Elasticsearch is significantly different than a full cluster restart to install a plugin and is again different from starting a new cluster with xpack installed. So this adds some basic tests to make sure that the rolling upgrade that enables xpack works at all. This also removes some unused imports from the tests that I modified in PR #30728. I didn't mean to leave them.

nik9000 added >test Issues or PRs that are addressing/adding tests review :Delivery/Build Build or test infrastructure v7.0.0 v6.3.0 v6.4.0 labels May 18, 2018

nik9000 commented May 18, 2018

View reviewed changes

jasontedor mentioned this pull request May 18, 2018

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

Closed

9 tasks

Javadoc

69b2ba1

ywelsch reviewed May 18, 2018

View reviewed changes

rjernst approved these changes May 18, 2018

View reviewed changes

rjernst reviewed May 18, 2018

View reviewed changes

Merge branch 'master' into rolling_upgrade_three

7586c3a

ywelsch mentioned this pull request May 19, 2018

Only allow x-pack metadata if all nodes are ready #30743

Merged

nik9000 merged commit be8c0f2 into elastic:master May 21, 2018

nik9000 added the backport pending label May 21, 2018

nik9000 mentioned this pull request May 21, 2018

QA: Switch rolling upgrade to 3 nodes (6.x backport) #30733

Closed

nik9000 removed the backport pending label May 21, 2018

nik9000 mentioned this pull request May 22, 2018

QA: Add xpack tests to rolling upgrade #30795

Merged

ywelsch mentioned this pull request Jun 5, 2018

Rolling upgrade tests with 3 nodes #25336

Closed

DaveCTurner mentioned this pull request Nov 2, 2018

[CI] testRelocationWithConcurrentIndexing fails #34950

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA: Switch rolling upgrade to 3 nodes #30728

QA: Switch rolling upgrade to 3 nodes #30728

nik9000 commented May 18, 2018 •

edited

Loading

elasticmachine commented May 18, 2018

nik9000 May 18, 2018

ywelsch May 18, 2018

nik9000 May 18, 2018

nik9000 May 18, 2018

nik9000 May 18, 2018

rjernst May 18, 2018

rjernst May 18, 2018

nik9000 May 18, 2018

nik9000 May 18, 2018

rjernst left a comment

rjernst left a comment

rjernst May 18, 2018

nik9000 May 18, 2018

rjernst May 18, 2018

nik9000 May 18, 2018

nik9000 May 18, 2018

jasontedor May 18, 2018

rjernst May 18, 2018

nik9000 May 21, 2018

rjernst May 21, 2018

		@@ -258,7 +217,7 @@ public void testRelocationWithConcurrentIndexing() throws Exception {
		break;
		case UPGRADED:

QA: Switch rolling upgrade to 3 nodes #30728

QA: Switch rolling upgrade to 3 nodes #30728

Conversation

nik9000 commented May 18, 2018 • edited Loading

elasticmachine commented May 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjernst left a comment

Choose a reason for hiding this comment

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented May 18, 2018 •

edited

Loading