-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QA: Switch rolling upgrade to 3 nodes #30728
Conversation
Switches the rolling upgrade tests from upgrading two nodes to upgrading three nodes which is much more realistic and much better able to find unexpected bugs. It upgrades the nodes one at a time and runs tests between each upgrade. As such this now has four test runs: 1. Old 2. One third upgraded 3. Two thirds upgraded 4. Upgraded It sets system properties so the tests can figure out which stage they are in. It reuses the same yml tests for the "one third" and "two thirds" cases because they are *almost* the same case. This rewrites the yml-based indexing tests to be Java based because the yml-based tests can't handle different expected values for the counts. And the indexing tests need that when they are run twice.
Pinging @elastic/es-core-infra |
@@ -258,7 +217,7 @@ public void testRelocationWithConcurrentIndexing() throws Exception { | |||
break; | |||
case UPGRADED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware that during the MIXED
branch of this switch we index the same documents twice and that might not be great. But it is simple. If we don't like it we can change it though.
dependsOn lastRunner, "${baseName}#oldClusterTestCluster#node${stopNode}.stop" | ||
clusterName = 'rolling-upgrade' | ||
unicastTransportUri = { seedNode, node, ant -> unicastSeed() } | ||
minimumMasterNodes = { 3 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it work with 2 as well? That would be more realistic for a rolling-upgrade test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me experiment with that. I suspect it'll work but I think minimumMasterNodes=3
was there for the wait condition. Checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit tricky actually. The wait condition looks like it relies on minimum master nodes to to not start the tests before the node has actually joined the cluster. I suspect I could switch it to wait for the right number of nodes but that might have side effects that I don't understand yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'd like to do it in a follow up change that I don't backport ot 6.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The waitCondition knows nothing about min master nodes. It only checks there are numNodes
nodes in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to change the setting, though, on every cluster (including old cluster). Maybe that caused problems (the older nodes would still think min master nodes was 3?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'd need to change it everywhere.
The wait condition doesn't use minimum master nodes but it does use the cluster not responding to http requests as a signal. I believe we don't respond to the request that it uses until we get all the master nodes. Let me recheck that assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll reply but with a 503 so we retry:
{
"error" : {
"root_cause" : [
{
"type" : "cluster_block_exception",
"reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
}
],
"type" : "cluster_block_exception",
"reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
},
"status" : 503
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gradle side looks....ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few more thoughts.
Task oneThirdUpgradedTest = tasks.create(name: "${baseName}#oneThirdUpgradedTest", type: RestIntegTestTask) | ||
|
||
configureUpgradeCluster("oneThirdUpgradedTestCluster", oldClusterTestRunner, | ||
0, { oldClusterTest.nodes.get(1).transportUri() }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should kill node 0 of the old cluster to start. Instead, work backwards from node 2. It is most likely node 0 is master (node 1 and 2 used it as their unicast host).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - triggering a new master election is kind of useful for testing but I'm fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing it last, we don't have to go through 2 master elections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have as many master elections as I can get to be honest. More master elections is more chances to catch the kind of weird stuff that this test should be catching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like because we set minmium master nodes to 3 the first node doesn't really have any more chance of winning the master election. If we'd set it to 1 or 2 it'd have more chance, but with it set to 3 the cluster has to wait for all the nodes to show up before holding the election and all nodes have an equal shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not most likely that node 0 is the master. With minimum master nodes set to three here, an election will not happen until all three nodes are present. When a node is voting for a master, it chooses the one with the minimal node ID that it sees among all the master-eligible nodes. As all three nodes have to present, and will see the same three nodes, they will all select the same minimal node ID and that will be the elected master. As node IDs are random, this will be random amongst node 0, node 1, and node 2.
* so they are alive during the test. | ||
*/ | ||
finalizedBy "${baseName}#oneThirdUpgradedTestCluster#stop" | ||
finalizedBy "${baseName}#twoThirdsUpgradedTestCluster#stop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually force them to be stopped, it only means they will be run after this task at some point (like dependsOn only ensures a task is run before, but not right before, or in any order). In addition to finalizedBy here, the task upgradedClusterTest
needs to depend on these stop tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this handles that already. It runs after the project evaluates and makes the integ test task depend on everything that finalizes the runner and these lines make the "stop" tasks finalize the runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that should work.
* master: Reduce CLI scripts to one-liners (#30759) SQL: Preserve scoring in bool queries (#30730) QA: Switch rolling upgrade to 3 nodes (#30728) [TEST] Enable DEBUG logging on testAutoQueueSizingWithMax [ML] Don't install empty ML metadata on startup (#30751) Add assertion on removing copy_settings (#30748) bump lucene version for 6_3_0 [DOCS] Mark painless execute api as experimental (#30710) disable annotation processor for docs (#30610) Add more script contexts (#30721) Fix default shards count in create index docs (#30747) Mute testCorruptFileThenSnapshotAndRestore
* es/ccr: (50 commits) Reduce CLI scripts to one-liners (elastic#30759) SQL: Preserve scoring in bool queries (elastic#30730) QA: Switch rolling upgrade to 3 nodes (elastic#30728) [TEST] Enable DEBUG logging on testAutoQueueSizingWithMax [ML] Don't install empty ML metadata on startup (elastic#30751) Add assertion on removing copy_settings (elastic#30748) bump lucene version for 6_3_0 [DOCS] Mark painless execute api as experimental (elastic#30710) disable annotation processor for docs (elastic#30610) Add more script contexts (elastic#30721) Fix default shards count in create index docs (elastic#30747) Mute testCorruptFileThenSnapshotAndRestore Scripting: Remove getDate methods from ScriptDocValues (elastic#30690) Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (elastic#30726) [Docs] Fix single page :docs:check invocation (elastic#30725) Docs: Add uptasticsearch to list of clients (elastic#30738) [DOCS] Removes out-dated x-pack/docs/en/index.asciidoc [DOCS] Removes redundant index.asciidoc files (elastic#30707) [TEST] Reduce forecast overflow to disk test memory limit (elastic#30727) Plugins: Remove meta plugins (elastic#30670) ...
A rolling upgrade from oss Elasticsearch to the default distribution of Elasticsearch is significantly different than a full cluster restart to install a plugin and is again different from starting a new cluster with xpack installed. So this adds some basic tests to make sure that the rolling upgrade that enables xpack works at all. This also removes some unused imports from the tests that I modified in PR elastic#30728. I didn't mean to leave them.
A rolling upgrade from oss Elasticsearch to the default distribution of Elasticsearch is significantly different than a full cluster restart to install a plugin and is again different from starting a new cluster with xpack installed. So this adds some basic tests to make sure that the rolling upgrade that enables xpack works at all. This also removes some unused imports from the tests that I modified in PR #30728. I didn't mean to leave them.
A rolling upgrade from oss Elasticsearch to the default distribution of Elasticsearch is significantly different than a full cluster restart to install a plugin and is again different from starting a new cluster with xpack installed. So this adds some basic tests to make sure that the rolling upgrade that enables xpack works at all. This also removes some unused imports from the tests that I modified in PR #30728. I didn't mean to leave them.
A rolling upgrade from oss Elasticsearch to the default distribution of Elasticsearch is significantly different than a full cluster restart to install a plugin and is again different from starting a new cluster with xpack installed. So this adds some basic tests to make sure that the rolling upgrade that enables xpack works at all. This also removes some unused imports from the tests that I modified in PR #30728. I didn't mean to leave them.
Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:
It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are almost the same case.
This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.
Closes #25336