Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ILM policies run safely on leader indices #38140

Merged
merged 6 commits into from
Feb 2, 2019

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Jan 31, 2019

Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.

Relates to #34648


There will likely be a change to the integration test, as I'm still seeing occasional failures in it, which don't appear to be "real" failures (just hitting timeouts sometimes), but the code is otherwise ready to review.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@gwbrown
Copy link
Contributor Author

gwbrown commented Jan 31, 2019

This will also need to be updated after #37951 is merged, as that PR contains changes which will cause this one to fail to build. The changes should be pretty minor, though.

@@ -368,6 +326,70 @@ public void testUnfollowInjectedBeforeShrink() throws Exception {
}
}

public void testCannotShrinkLeaderIndex() throws Exception {
String indexName = "shrink-leader-test";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted about this test @gwbrown. The stopping of ilm in this test can be problematic for other tests that run after it. First all tests are executed as part of the leader cluster test task and then as part of the follower cluster test task (this while leader test cluster is running). The other which the test methods are executed is randomized by the test framework.

In order to avoid this problem the entire test can be ran just against the follow cluster test task, like is done below.
Also what also is possible is to remove the stop and start ilm calls and in instead setup the leader index first in leader cluster, execute put follow in follower cluster and then create the leader ilm policy.

if ("leader".equals(targetCluster)) {
            return;
        } else if ("follow".equals(targetCluster) == false) {
            throw  new AssertionError("unexpected target cluster [" + targetCluster + "]");
        }

        String indexName = "shrink-leader-test";
        String shrunkenIndexName = "shrink-" + indexName;
        String policyName = "shrink-leader-test-policy";
        try (RestClient leaderClient = buildLeaderClient()) {

            putShrinkOnlyPolicy(leaderClient, policyName);

            // Stop ILM to let us set up the follower before the policy swings into action and shrinks the index
            stopILM(client());
            assertBusy(() -> assertEquals("STOPPED", ilmStatus(client())));
            Settings indexSettings = Settings.builder()
                .put("index.soft_deletes.enabled", true)
                .put("index.number_of_shards", 2)
                .put("index.number_of_replicas", 0)
                .put("index.lifecycle.name", policyName)
                .build();
            createIndex(indexName, indexSettings, "", "", leaderClient); // <-- method does not yet exist
            ensureGreen(indexName, leaderClient); // <-- method does not yet exist

            // Policy with the same name must exist in follower cluster too:
            putUnfollowOnlyPolicy(client(), policyName);
            followIndex(indexName, indexName);
            ensureGreen(indexName);

            startILM(leaderClient);
            index(leaderClient, indexName, "1");
            assertDocumentExists(leaderClient, indexName, "1");

            assertBusy(() -> {
                assertDocumentExists(client(), indexName, "1");
                // Sanity check that following_index setting has been set, so that we can verify later that this setting has been unset:
                assertThat(getIndexSetting(client(), indexName, "index.xpack.ccr.following_index"), equalTo("true"));

                // We should get into a state with these policies where both leader and followers are waiting on each other
                assertILMPolicy(leaderClient, indexName, policyName, "warm", "shrink", "wait-for-shard-history-leases");
                assertILMPolicy(client(), indexName, policyName, "hot", "unfollow", "wait-for-indexing-complete");
            });

            // Manually set this to kick the process
            updateIndexSettings(leaderClient, indexName, Settings.builder()
                .put("index.lifecycle.indexing_complete", true)
                .build()
            );

            assertBusy(() -> {
                // The shrunken index should now be created on the leader...
                Response shrunkenIndexExistsResponse = leaderClient.performRequest(new Request("HEAD", "/" + shrunkenIndexName));
                assertEquals(RestStatus.OK.getStatus(), shrunkenIndexExistsResponse.getStatusLine().getStatusCode());

                // And both of these should now finish their policies
                assertILMPolicy(leaderClient, shrunkenIndexName, policyName, "completed");
                assertILMPolicy(client(), indexName, policyName, "completed");
            });
        }

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left three really minor comments, otherwise LGTM

import java.util.Arrays;
import java.util.Objects;

public class WaitForNoFollowersStep extends AsyncWaitStep {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadoc please?

private final String message;

Info() {
this.message = "this index is a leader index; waiting for all following indices to cease following before proceeding";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can probably be a static final field since it's hardcoded?

public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Info info = (Info) o;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be return true; since the messages are never different?

@gwbrown
Copy link
Contributor Author

gwbrown commented Feb 1, 2019

@elasticmachine run elasticsearch-ci/default-distro

* elastic/master: (54 commits)
  Introduce retention leases versioning (elastic#37951)
  Correctly disable tests for FIPS JVMs (elastic#38214)
  AwaitsFix testAbortedSnapshotDuringInitDoesNotStart (elastic#38227)
  Preserve ILM operation mode when creating new lifecycles (elastic#38134)
  Enable trace log in FollowerFailOverIT (elastic#38148)
  SnapshotShardsService Simplifications (elastic#38025)
  Default include_type_name to false in the yml test harness. (elastic#38058)
  Disable bwc preparing to backport of#37977, elastic#37857 and elastic#37872 (elastic#38126)
  Adding ml_settings entry to HLRC and Docs for deprecation_info (elastic#38118)
  Replace awaitBusy with assertBusy in atLeastDocsIndexed (elastic#38190)
  Adjust SearchRequest version checks (elastic#38181)
  AwaitsFix testClientSucceedsWithVerificationDisabled (elastic#38213)
  Zen2ify RareClusterStateIT (elastic#38184)
  ML: Fix error race condition on stop _all datafeeds and close _all jobs (elastic#38113)
  AwaitsFix PUT mapping with _doc on an index that has types (elastic#38204)
  Allow built-in monitoring_user role to call GET _xpack API (elastic#38060)
  Update geo_shape docs to include unsupported features (elastic#38138)
  [ML] Remove "8" prefixes from file structure finder timestamp formats (elastic#38016)
  Disable bwc tests while backporting elastic#38104 (elastic#38182)
  Enable TLSv1.3 by default for JDKs with support (elastic#38103)
  ...
@gwbrown
Copy link
Contributor Author

gwbrown commented Feb 2, 2019

Thanks for the reviews and thanks for handling the merge conflict @jasontedor!

@gwbrown gwbrown merged commit 7a1e89c into elastic:master Feb 2, 2019
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Feb 2, 2019
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
gwbrown added a commit that referenced this pull request Feb 4, 2019
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 4, 2019
…round-sync-6.x

* elastic/6.x:
  Fix testRestoreIncreasesPrimaryTerms on 6.x (elastic#38314)
  SQL: Remove exceptions from Analyzer (elastic#38260) (elastic#38287)
  SQL: Move metrics tracking inside PlanExecutor (elastic#38259) (elastic#38288)
  Backport of elastic#38311: Move TokenService to seqno powered cas
  Handle scheduler exceptions (elastic#38183)
  Mute MlMigrationFullClusterRestartIT#testMigration (elastic#38316)
  6.x Backport of elastic#38278: Move ML Optimistic Concurrency Control to Seq No
  Cleanup construction of interceptors (elastic#38296)
  Throw if two inner_hits have the same name (elastic#37645) (elastic#38194)
  AsyncTwoPhaseIndexerTests race condition fixed elastic#38195 Backport#37830
  Enable SSL in reindex with security QA tests (elastic#38293)
  Ensure ILM policies run safely on leader indices  (elastic#38140)
  Introduce ssl settings to reindex from remote (elastic#38292)
  Fix ordering problem in add or renew lease test (elastic#38281)
  Mute ReplicationTrackerRetentionLeaseTests#testAddOrRenewRetentionLease (elastic#38276)
  Fix NPE in Logfile Audit Filter (elastic#38120) (elastic#38271)
  Enable trace log in FollowerFailOverIT (elastic#38148)
  SQL: Generate relevant error message when grouping functions are not used in GROUP BY (elastic#38017)
@jasontedor jasontedor mentioned this pull request Feb 4, 2019
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants