Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA: Switch rolling upgrade to 3 nodes #30728

Merged
merged 3 commits into from
May 21, 2018

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented May 18, 2018

Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:

  1. Old
  2. One third upgraded
  3. Two thirds upgraded
  4. Upgraded

It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are almost the same case.

This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.

Closes #25336

Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:

1. Old
2. One third upgraded
3. Two thirds upgraded
4. Upgraded

It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are *almost* the same case.

This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.
@nik9000 nik9000 added >test Issues or PRs that are addressing/adding tests review :Delivery/Build Build or test infrastructure v7.0.0 v6.3.0 v6.4.0 labels May 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@@ -258,7 +217,7 @@ public void testRelocationWithConcurrentIndexing() throws Exception {
break;
case UPGRADED:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware that during the MIXED branch of this switch we index the same documents twice and that might not be great. But it is simple. If we don't like it we can change it though.

dependsOn lastRunner, "${baseName}#oldClusterTestCluster#node${stopNode}.stop"
clusterName = 'rolling-upgrade'
unicastTransportUri = { seedNode, node, ant -> unicastSeed() }
minimumMasterNodes = { 3 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it work with 2 as well? That would be more realistic for a rolling-upgrade test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me experiment with that. I suspect it'll work but I think minimumMasterNodes=3 was there for the wait condition. Checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit tricky actually. The wait condition looks like it relies on minimum master nodes to to not start the tests before the node has actually joined the cluster. I suspect I could switch it to wait for the right number of nodes but that might have side effects that I don't understand yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd like to do it in a follow up change that I don't backport ot 6.3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waitCondition knows nothing about min master nodes. It only checks there are numNodes nodes in the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to change the setting, though, on every cluster (including old cluster). Maybe that caused problems (the older nodes would still think min master nodes was 3?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'd need to change it everywhere.

The wait condition doesn't use minimum master nodes but it does use the cluster not responding to http requests as a signal. I believe we don't respond to the request that it uses until we get all the master nodes. Let me recheck that assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll reply but with a 503 so we retry:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_block_exception",
        "reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
      }
    ],
    "type" : "cluster_block_exception",
    "reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
  },
  "status" : 503
}

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradle side looks....ok

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few more thoughts.

Task oneThirdUpgradedTest = tasks.create(name: "${baseName}#oneThirdUpgradedTest", type: RestIntegTestTask)

configureUpgradeCluster("oneThirdUpgradedTestCluster", oldClusterTestRunner,
0, { oldClusterTest.nodes.get(1).transportUri() })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should kill node 0 of the old cluster to start. Instead, work backwards from node 2. It is most likely node 0 is master (node 1 and 2 used it as their unicast host).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - triggering a new master election is kind of useful for testing but I'm fine either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing it last, we don't have to go through 2 master elections.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have as many master elections as I can get to be honest. More master elections is more chances to catch the kind of weird stuff that this test should be catching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like because we set minmium master nodes to 3 the first node doesn't really have any more chance of winning the master election. If we'd set it to 1 or 2 it'd have more chance, but with it set to 3 the cluster has to wait for all the nodes to show up before holding the election and all nodes have an equal shot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not most likely that node 0 is the master. With minimum master nodes set to three here, an election will not happen until all three nodes are present. When a node is voting for a master, it chooses the one with the minimal node ID that it sees among all the master-eligible nodes. As all three nodes have to present, and will see the same three nodes, they will all select the same minimal node ID and that will be the elected master. As node IDs are random, this will be random amongst node 0, node 1, and node 2.

* so they are alive during the test.
*/
finalizedBy "${baseName}#oneThirdUpgradedTestCluster#stop"
finalizedBy "${baseName}#twoThirdsUpgradedTestCluster#stop"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually force them to be stopped, it only means they will be run after this task at some point (like dependsOn only ensures a task is run before, but not right before, or in any order). In addition to finalizedBy here, the task upgradedClusterTest needs to depend on these stop tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this handles that already. It runs after the project evaluates and makes the integ test task depend on everything that finalizes the runner and these lines make the "stop" tasks finalize the runner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that should work.

@nik9000 nik9000 merged commit be8c0f2 into elastic:master May 21, 2018
dnhatn added a commit that referenced this pull request May 21, 2018
* master:
  Reduce CLI scripts to one-liners (#30759)
  SQL: Preserve scoring in bool queries (#30730)
  QA: Switch rolling upgrade to 3 nodes (#30728)
  [TEST] Enable DEBUG logging on testAutoQueueSizingWithMax
  [ML] Don't install empty ML metadata on startup (#30751)
  Add assertion on removing copy_settings (#30748)
  bump lucene version for 6_3_0
  [DOCS] Mark painless execute api as experimental (#30710)
  disable annotation processor for docs (#30610)
  Add more script contexts (#30721)
  Fix default shards count in create index docs (#30747)
  Mute testCorruptFileThenSnapshotAndRestore
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request May 22, 2018
* es/ccr: (50 commits)
  Reduce CLI scripts to one-liners (elastic#30759)
  SQL: Preserve scoring in bool queries (elastic#30730)
  QA: Switch rolling upgrade to 3 nodes (elastic#30728)
  [TEST] Enable DEBUG logging on testAutoQueueSizingWithMax
  [ML] Don't install empty ML metadata on startup (elastic#30751)
  Add assertion on removing copy_settings (elastic#30748)
  bump lucene version for 6_3_0
  [DOCS] Mark painless execute api as experimental (elastic#30710)
  disable annotation processor for docs (elastic#30610)
  Add more script contexts (elastic#30721)
  Fix default shards count in create index docs (elastic#30747)
  Mute testCorruptFileThenSnapshotAndRestore
  Scripting: Remove getDate methods from ScriptDocValues (elastic#30690)
  Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (elastic#30726)
  [Docs] Fix single page :docs:check invocation (elastic#30725)
  Docs: Add uptasticsearch to list of clients (elastic#30738)
  [DOCS] Removes out-dated x-pack/docs/en/index.asciidoc
  [DOCS] Removes redundant index.asciidoc files (elastic#30707)
  [TEST] Reduce forecast overflow to disk test memory limit (elastic#30727)
  Plugins: Remove meta plugins (elastic#30670)
  ...
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 22, 2018
A rolling upgrade from oss Elasticsearch to the default distribution of
Elasticsearch is significantly different than a full cluster restart to
install a plugin and is again different from starting a new cluster with
xpack installed. So this adds some basic tests to make sure that the
rolling upgrade that enables xpack works at all.

This also removes some unused imports from the tests that I modified in
PR elastic#30728. I didn't mean to leave them.
nik9000 added a commit that referenced this pull request May 22, 2018
A rolling upgrade from oss Elasticsearch to the default distribution of
Elasticsearch is significantly different than a full cluster restart to
install a plugin and is again different from starting a new cluster with
xpack installed. So this adds some basic tests to make sure that the
rolling upgrade that enables xpack works at all.

This also removes some unused imports from the tests that I modified in
PR #30728. I didn't mean to leave them.
nik9000 added a commit that referenced this pull request May 22, 2018
A rolling upgrade from oss Elasticsearch to the default distribution of
Elasticsearch is significantly different than a full cluster restart to
install a plugin and is again different from starting a new cluster with
xpack installed. So this adds some basic tests to make sure that the
rolling upgrade that enables xpack works at all.

This also removes some unused imports from the tests that I modified in
PR #30728. I didn't mean to leave them.
nik9000 added a commit that referenced this pull request May 22, 2018
A rolling upgrade from oss Elasticsearch to the default distribution of
Elasticsearch is significantly different than a full cluster restart to
install a plugin and is again different from starting a new cluster with
xpack installed. So this adds some basic tests to make sure that the
rolling upgrade that enables xpack works at all.

This also removes some unused imports from the tests that I modified in
PR #30728. I didn't mean to leave them.
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v6.3.0 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rolling upgrade tests with 3 nodes
7 participants