ILM Make the `check-rollover-ready` step retryable #48256

andreidan · 2019-10-18T16:27:21Z

This adds the infrastructure to be able to retry the execution of retryable
steps and makes thecheck-rollover-ready retryable as an initial step to
make the rollover action more resilient to transient errors.

This is the initial effort to tackle #44135

The open and close follower steps didn't check if the index is closed, open respectively, before executing the open/close request. This changes the steps to check the index state and only perform the open/close operation if the index is not already open/closed.

This adds the infrastructure to be able to retry the execution of retryable steps up to a configurable number of times (controlled via the setting `index.lifecycle.max_failed_step_retries_count`) and makes the `check-rollover-ready` retryable as an initial step to make the `rollover` action more resilient to transitive errors.

elasticmachine · 2019-10-18T16:27:22Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

andreidan · 2019-10-18T17:26:00Z

Created this as a draft to get some feedback on the progress.

What's left is adding the retry count and the flag to signal the transitive error to the ilm/explain api and documentation (either as part of this PR or as a followup).

jasontedor · 2019-10-19T14:32:03Z

controlled via the setting index.lifecycle.max_failed_step_retries_count

What is the reasoning behind adding a setting for this? If we do add this, why is the default finite (currently in the implementation, it's 15).

andreidan · 2019-10-21T15:04:22Z

controlled via the setting index.lifecycle.max_failed_step_retries_count

What is the reasoning behind adding a setting for this? If we do add this, why is the default finite (currently in the implementation, it's 15).

@jasontedor we have integrations tests that expect the hot phase to end up in the error state. With the rollover action being retryable we'll flip back the lifecycle execution step to the failed step when an error occurs and the tests might not catch the error state. Controlling the number of retries is mostly for this particular use case where we'll allow 1-2 retries in tests before the error state is reached. I don't think there's any production use case where a user would want to modify this setting.

Regarding being finite or not, I don't have a strong opinion on this, just genuinely scared of unbounded loops. Although the retries are reacting to cluster state changes so there is a bit of backoff involved there.

dakrone

Thanks for working on this Andrei, I have some comments about it.

With regard to the number of retries, I think if we drop the retries to be on the periodic interval (so every 10 minutes by default), we should change the setting to -1 as a default, meaning to try forever. This lets us keep the ability for a user to set retries not to run infinitely, but still have the default keep trying forever by default (albeit spaced out by 10 minutes between each retry).

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/CloseFollowerIndexStep.java

...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

dakrone · 2019-10-21T20:44:49Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

+        return moveClusterStateToFailedStep(currentState, index, true);
+    }
+
+    private ClusterState moveClusterStateToFailedStep(ClusterState currentState, String index, boolean isRetry) {


I think it's getting confusing having moveClusterStateToErrorStep and moveClusterStateToFailedStep. The names are very similar and it's difficult to tell at a glance how the two differ.

How about we rename this to something like moveClusterStateOutOfErrorStep so it makes it clearer that one is entering the error state and the other is exiting the error state?

Also, the isRetry is an overloaded term since we have a "retry" API, maybe it should be isAutomaticRetry?

Thinking about this rename some more, it seems to me that moveClusterStateToFailedStep is actually more descriptive than moveClusterStateOutOfErrorStep (seeing a method like this, I'd immediately ask myself "moving where?" and I'd have to look at what it does to answer that question)

I appreciate the number of moveClusterStateToX methods increased with this PR. The methods we currently have are moveClusterStateTo:

FailedStep

ErrorStep

NextStep

Step

RetryFailedStep - added in this PR but it's just an overload which I'll drop as there's no need for an extra term in the vocabulary

Unfortunately, I don't have a better suggestion to clean up the namings (probably because I've spent a lot of time recently in this code and it makes sense to me) so if moveClusterStateOutOfErrorStep makes more sense to you I'll rename it, just wanted to express my concerns before doing so. @dakrone @gwbrown

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunnerTests.java

gwbrown

Left some comments - I tried to deduplicate my comments that overlapped with Lee's, but may have missed some.

Another couple thoughts:

I'm not sure transitive is the right adjective here as it doesn't clearly communicate what the intent is in the code. I think autoretryable or something may be a better choice.
While I'm with you on infinite retries being a little scary, it does limit the usefulness of auto-retry in some cases. For example, if an index has a misconfigured alias (a common issue in the field), it would be nice if ILM could figure itself out automatically once the configuration issue is resolved. But the automatic retries will run out in 2.5 hours, which is a fairly short time window to allow for an administrator to notice and correct the issue. IMO it's worth thinking about if we need to cap the retries, and even if we do in general, if some errors can ignore the cap.

gwbrown · 2019-10-18T19:53:40Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/CloseFollowerIndexStep.java

+        if (indexMetaData.getState() == IndexMetaData.State.OPEN) {
+            CloseIndexRequest closeIndexRequest = new CloseIndexRequest(followerIndex);
+            getClient().admin().indices().close(closeIndexRequest, ActionListener.wrap(
+                r -> {
+                    assert r.isAcknowledged() : "close index response is not acknowledged";
+                    listener.onResponse(true);
+                },
+                listener::onFailure)
+            );
+        } else {
+            listener.onResponse(true);
+        }


This change can (and should be IMO) be broken out into a separate PR that can be merged before this one.

gwbrown · 2019-10-18T20:51:45Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/LifecycleExecutionState.java

@@ -104,6 +112,12 @@ static LifecycleExecutionState fromCustomMetadata(Map<String, String> customData
        if (customData.containsKey(FAILED_STEP)) {
            builder.setFailedStep(customData.get(FAILED_STEP));
        }
+        if (customData.containsKey(IS_TRANSITIVE_ERROR)) {
+            builder.setIsTransitiveError(Boolean.parseBoolean(customData.get(IS_TRANSITIVE_ERROR)));


This (and the one below it) should probably be wrapped in a try/catch to clarify the exception if the value isn't parseable, like the other fields in this method:

elasticsearch/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/LifecycleExecutionState.java

Lines 152 to 156 in 016d1c9

try {

builder.setStepTime(Long.parseLong(customData.get(STEP_TIME)));

} catch (NumberFormatException e) {

throw new ElasticsearchException("Custom metadata field [{}] does not contain a valid long. Actual value: [{}]",

e, STEP_TIME, customData.get(STEP_TIME));

gwbrown · 2019-10-18T20:57:32Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/OpenFollowerIndexStep.java

+        if (indexMetaData.getState() == IndexMetaData.State.CLOSE) {
+            OpenIndexRequest request = new OpenIndexRequest(indexMetaData.getIndex().getName());
+            getClient().admin().indices().open(request, ActionListener.wrap(
+                r -> {
+                    assert r.isAcknowledged() : "open index response is not acknowledged";
+                    listener.onResponse(true);
+                },
+                listener::onFailure
+            ));
+        } else {
+            listener.onResponse(true);
+        }


Again, I think this should be broken out into another PR along with the change to CloseFollowerIndexStep.

gwbrown · 2019-10-18T21:04:05Z

.../plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/CloseFollowerIndexStepTests.java

+    public void testCloseFollowerIndexIsNoopForAlreadyClosedIndex() {
+        IndexMetaData indexMetadata = IndexMetaData.builder("follower-index")
+            .settings(settings(Version.CURRENT).put(LifecycleSettings.LIFECYCLE_INDEXING_COMPLETE, "true"))
+            .putCustom(CCR_METADATA_KEY, Collections.emptyMap())
+            .state(IndexMetaData.State.CLOSE)
+            .numberOfShards(1)
+            .numberOfReplicas(0)
+            .build();
+        Client client = Mockito.mock(Client.class);
+        CloseFollowerIndexStep step = new CloseFollowerIndexStep(randomStepKey(), randomStepKey(), client);
+        step.performAction(indexMetadata, null, null, new AsyncActionStep.Listener() {
+            @Override
+            public void onResponse(boolean complete) {
+                assertThat(complete, is(true));
+            }
+
+            @Override
+            public void onFailure(Exception e) {
+            }
+        });
+
+        Mockito.verifyZeroInteractions(client);
+    }
+


Same deal - break out into separate PR

gwbrown · 2019-10-18T21:05:16Z

.../core/src/test/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponseTests.java

@@ -52,6 +52,8 @@ private static IndexLifecycleExplainResponse randomManagedIndexExplainResponse()
            stepNull ? null : randomAlphaOfLength(10),
            stepNull ? null : randomAlphaOfLength(10),
            randomBoolean() ? null : randomAlphaOfLength(10),
+            stepNull ? null : randomBoolean(),
+            stepNull ? null : randomInt(15),


Because this isn't arbitrary (as opposed to, for example, the 10 in randomAlphaOfLength(10) above) this should be made a constant instead of a magic number.

this is somewhat arbitrary (I set it up to the previous rather arbitrary retry count I configured). I'll change it to 10 to not stand out of the bunch (especially as the retry count default will be infinite)

gwbrown · 2019-10-18T21:10:53Z

...lm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java

+        Request updateLifecylePollSetting = new Request("PUT", "_cluster/settings");
+        updateLifecylePollSetting.setJsonEntity("{" +
+            "  \"transient\": {\n" +
+            "     \"indices.lifecycle.poll_interval\" : \"1s\" \n" +
+            "  }\n" +
+            "}");
+        client().performRequest(updateLifecylePollSetting);


This (and the later one) shouldn't be necessary as we set this via Gradle for this test suite:

elasticsearch/x-pack/plugin/ilm/qa/multi-node/build.gradle

Line 29 in 7e06888

setting 'indices.lifecycle.poll_interval', '1000ms'

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

gwbrown · 2019-10-21T21:27:28Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

+                    clusterService.submitStateUpdateTask("ilm-retry-failed-step", new ClusterStateUpdateTask() {
+                        @Override
+                        public ClusterState execute(ClusterState currentState) {
+                            return moveClusterStateToRetryFailedStep(currentState, index);
+                        }
+
+                        @Override
+                        public void onFailure(String source, Exception e) {
+                            logger.error("retry execution of step [{}] failed due to [{}]", failedStep.getKey().getName(), e);
+                        }
+                    });


Doesn't this update task need a clusterStateProcessed() method that calls maybeRunAsyncAction? Like here:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportRetryAction.java

Lines 69 to 80 in 96b4f3d

public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {

for (String index : request.indices()) {

IndexMetaData idxMeta = newState.metaData().index(index);

LifecycleExecutionState lifecycleState = LifecycleExecutionState.fromIndexMetadata(idxMeta);

StepKey retryStep = new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(), lifecycleState.getStep());

if (idxMeta == null) {

// The index has somehow been deleted - there shouldn't be any opportunity for this to happen, but just in case.

logger.debug("index [" + index + "] has been deleted after moving to step [" +

lifecycleState.getStep() + "], skipping async action check");

return;

}

indexLifecycleService.maybeRunAsyncAction(newState, idxMeta, retryStep);

Or am I missing where we call maybeRunAsyncAction?

We'll follow up on running async steps ( #48010 (comment) )

gwbrown · 2019-10-21T21:28:53Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

+
+                        @Override
+                        public void onFailure(String source, Exception e) {
+                            logger.error("retry execution of step [{}] failed due to [{}]", failedStep.getKey().getName(), e);


This log message should specify the index in question.

gwbrown · 2019-10-21T21:30:11Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

+        Step failedStep = stepRegistry.getStep(indexMetaData, new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(),
+            lifecycleState.getFailedStep()));
+        if (failedStep == null) {
+            logger.warn("failed step [{}] is not part of policy [{}] anymore, or it is invalid. skipping execution",


This log message should specify the index in question.

andreidan · 2019-10-28T14:00:03Z

@gwbrown @dakrone thanks for the review. I'll move on to implement your suggestions. I'll create a separate PR for the open/close follower steps changes (made them in order to make the steps a bit more resilient as the api calls would fail when executed against a read-only index for example and we execute them regardless if the index is already in the desired state ie. open or closed)

andreidan · 2019-10-28T18:00:24Z

@elasticmachine run elasticsearch-ci/bwc

andreidan · 2019-10-28T18:05:26Z

@elasticmachine update branch

andreidan · 2019-10-29T07:25:50Z

@elasticmachine retest this please

This fixes the version checks to use 7.6.0 instead of 8.0.0 since the changes in elastic#48256 were backported to the 7.x branch.

This fixes the version checks to use 7.6.0 instead of 8.0.0 since the changes in #48256 were backported to the 7.x branch.

This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with elastic#48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves elastic#48767

This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with #48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves #48767

This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with elastic#48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves elastic#48767

This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with #48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves #48767

The rollover action is now a retryable step (see elastic#48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes elastic#49073

The rollover action is now a retryable step (see #48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes #49073

The rollover action is now a retryable step (see elastic#48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes elastic#49073 (cherry picked from commit 3ae9089) Signed-off-by: Andrei Dan <andrei.dan@elastic.co> # Conflicts: # x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java

The rollover action is now a retryable step (see #48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes #49073 (cherry picked from commit 3ae9089) Signed-off-by: Andrei Dan <andrei.dan@elastic.co> # Conflicts: # x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java

Relates: #4341, elastic/elasticsearch#48256 This commit adds FailedStepRetryCount and IsAutoRetryableError properties to ILM explain, applicable if a step fails.

Relates: #4341, elastic/elasticsearch#48256 This commit adds FailedStepRetryCount and IsAutoRetryableError properties to ILM explain, applicable if a step fails. (cherry picked from commit 4dc0e83)

andreidan added 2 commits October 18, 2019 17:06

andreidan added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Oct 18, 2019

andreidan mentioned this pull request Oct 18, 2019

Ilm retryable steps #48010

Closed

Drop unused imports

10368db

andreidan requested review from gwbrown and dakrone October 18, 2019 17:22

andreidan added v7.6.0 v8.0.0 labels Oct 18, 2019

ILM add step retries information to explain api

016d1c9

dakrone requested changes Oct 21, 2019

View reviewed changes

gwbrown reviewed Oct 21, 2019

View reviewed changes

andreidan added the WIP label Oct 28, 2019

Add versioning protection.

01a8bc1

elasticmachine and others added 2 commits October 28, 2019 11:05

Merge branch 'master' into ilm-retry-failed-step

1fd1ebc

Guard the serialisation changes against gte 8.0.0

b0256dd

andreidan added 6 commits October 29, 2019 08:23

Rename isTransitiveError to isAutoRetryableError

6ea24c5

Fix the ILM explain tet

4d6a488

IndexLifecycleExplainResponseTest: adjust the random retry values

ee80993

Dorp lifecycle poll interval configuration in IT

5ba959d

Change max retry setting default to -1

79d52c3

Log the index name too

3f2e68c

andreidan removed the backport pending label Oct 31, 2019

dakrone mentioned this pull request Oct 31, 2019

Fix IndexLifecycleExplainResponse serialization version checks #48754

Merged

dakrone added a commit that referenced this pull request Oct 31, 2019

Fix IndexLifecycleExplainResponse serialization version checks (#48754)

95ee5b1

This fixes the version checks to use 7.6.0 instead of 8.0.0 since the changes in #48256 were backported to the 7.x branch.

dakrone mentioned this pull request Oct 31, 2019

Fix TimeSeriesLifecycleActionsIT.testExplainFilters #48772

Merged

andreidan mentioned this pull request Nov 14, 2019

ILM Remove obsolete testRolloverAlreadyExists #49104

Merged

andreidan mentioned this pull request Nov 15, 2019

ILM Remove obsolete testRolloverAlreadyExists (#49104) #49144

Merged

$@polyfractal$ polyfractal added the >enhancement label Jan 15, 2020

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

russcam mentioned this pull request Feb 13, 2020

Add retry count and is retryable to ILM explain elastic/elasticsearch-net#4391

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM Make the `check-rollover-ready` step retryable #48256

ILM Make the `check-rollover-ready` step retryable #48256

andreidan commented Oct 18, 2019 •

edited

Loading

elasticmachine commented Oct 18, 2019

andreidan commented Oct 18, 2019

jasontedor commented Oct 19, 2019

andreidan commented Oct 21, 2019 •

edited

Loading

dakrone left a comment

dakrone Oct 21, 2019

dakrone Oct 21, 2019

andreidan Oct 29, 2019

gwbrown left a comment

gwbrown Oct 18, 2019

gwbrown Oct 18, 2019

gwbrown Oct 18, 2019

gwbrown Oct 18, 2019

gwbrown Oct 18, 2019

andreidan Oct 29, 2019

gwbrown Oct 18, 2019

gwbrown Oct 21, 2019

andreidan Oct 29, 2019

gwbrown Oct 21, 2019

gwbrown Oct 21, 2019

andreidan commented Oct 28, 2019

andreidan commented Oct 28, 2019

andreidan commented Oct 28, 2019

andreidan commented Oct 29, 2019

	try {
	builder.setStepTime(Long.parseLong(customData.get(STEP_TIME)));
	} catch (NumberFormatException e) {
	throw new ElasticsearchException("Custom metadata field [{}] does not contain a valid long. Actual value: [{}]",
	e, STEP_TIME, customData.get(STEP_TIME));

	public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
	for (String index : request.indices()) {
	IndexMetaData idxMeta = newState.metaData().index(index);
	LifecycleExecutionState lifecycleState = LifecycleExecutionState.fromIndexMetadata(idxMeta);
	StepKey retryStep = new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(), lifecycleState.getStep());
	if (idxMeta == null) {
	// The index has somehow been deleted - there shouldn't be any opportunity for this to happen, but just in case.
	logger.debug("index [" + index + "] has been deleted after moving to step [" +
	lifecycleState.getStep() + "], skipping async action check");
	return;
	}
	indexLifecycleService.maybeRunAsyncAction(newState, idxMeta, retryStep);

ILM Make the check-rollover-ready step retryable #48256

ILM Make the check-rollover-ready step retryable #48256

Conversation

andreidan commented Oct 18, 2019 • edited Loading

elasticmachine commented Oct 18, 2019

andreidan commented Oct 18, 2019

jasontedor commented Oct 19, 2019

andreidan commented Oct 21, 2019 • edited Loading

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwbrown left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Oct 28, 2019

andreidan commented Oct 28, 2019

andreidan commented Oct 28, 2019

andreidan commented Oct 29, 2019

ILM Make the `check-rollover-ready` step retryable #48256

ILM Make the `check-rollover-ready` step retryable #48256

andreidan commented Oct 18, 2019 •

edited

Loading

andreidan commented Oct 21, 2019 •

edited

Loading