Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM Make the check-rollover-ready step retryable #48256

Merged
merged 32 commits into from
Oct 31, 2019

Conversation

andreidan
Copy link
Contributor

@andreidan andreidan commented Oct 18, 2019

This adds the infrastructure to be able to retry the execution of retryable
steps and makes thecheck-rollover-ready retryable as an initial step to
make the rollover action more resilient to transient errors.

This is the initial effort to tackle #44135

The open and close follower steps didn't check if the index is closed, open
respectively, before executing the open/close request.
This changes the steps to check the index state and only perform the
open/close operation if the index is not already open/closed.
This adds the infrastructure to be able to retry the execution of retryable
steps up to a configurable number of times (controlled via the setting
`index.lifecycle.max_failed_step_retries_count`) and makes the
`check-rollover-ready` retryable as an initial step to make the `rollover`
action more resilient to transitive errors.
@andreidan andreidan added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Oct 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@andreidan andreidan mentioned this pull request Oct 18, 2019
@andreidan
Copy link
Contributor Author

Created this as a draft to get some feedback on the progress.

What's left is adding the retry count and the flag to signal the transitive error to the ilm/explain api and documentation (either as part of this PR or as a followup).

@jasontedor
Copy link
Member

controlled via the setting index.lifecycle.max_failed_step_retries_count

What is the reasoning behind adding a setting for this? If we do add this, why is the default finite (currently in the implementation, it's 15).

@andreidan
Copy link
Contributor Author

andreidan commented Oct 21, 2019

controlled via the setting index.lifecycle.max_failed_step_retries_count

What is the reasoning behind adding a setting for this? If we do add this, why is the default finite (currently in the implementation, it's 15).

@jasontedor we have integrations tests that expect the hot phase to end up in the error state. With the rollover action being retryable we'll flip back the lifecycle execution step to the failed step when an error occurs and the tests might not catch the error state. Controlling the number of retries is mostly for this particular use case where we'll allow 1-2 retries in tests before the error state is reached. I don't think there's any production use case where a user would want to modify this setting.

Regarding being finite or not, I don't have a strong opinion on this, just genuinely scared of unbounded loops. Although the retries are reacting to cluster state changes so there is a bit of backoff involved there.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Andrei, I have some comments about it.

With regard to the number of retries, I think if we drop the retries to be on the periodic interval (so every 10 minutes by default), we should change the setting to -1 as a default, meaning to try forever. This lets us keep the ability for a user to set retries not to run infinitely, but still have the default keep trying forever by default (albeit spaced out by 10 minutes between each retry).

return moveClusterStateToFailedStep(currentState, index, true);
}

private ClusterState moveClusterStateToFailedStep(ClusterState currentState, String index, boolean isRetry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's getting confusing having moveClusterStateToErrorStep and moveClusterStateToFailedStep. The names are very similar and it's difficult to tell at a glance how the two differ.

How about we rename this to something like moveClusterStateOutOfErrorStep so it makes it clearer that one is entering the error state and the other is exiting the error state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the isRetry is an overloaded term since we have a "retry" API, maybe it should be isAutomaticRetry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this rename some more, it seems to me that moveClusterStateToFailedStep is actually more descriptive than moveClusterStateOutOfErrorStep (seeing a method like this, I'd immediately ask myself "moving where?" and I'd have to look at what it does to answer that question)

I appreciate the number of moveClusterStateToX methods increased with this PR. The methods we currently have are moveClusterStateTo:

  • FailedStep
  • ErrorStep
  • NextStep
  • Step
  • RetryFailedStep - added in this PR but it's just an overload which I'll drop as there's no need for an extra term in the vocabulary

Unfortunately, I don't have a better suggestion to clean up the namings (probably because I've spent a lot of time recently in this code and it makes sense to me) so if moveClusterStateOutOfErrorStep makes more sense to you I'll rename it, just wanted to express my concerns before doing so. @dakrone @gwbrown

Copy link
Contributor

@gwbrown gwbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments - I tried to deduplicate my comments that overlapped with Lee's, but may have missed some.

Another couple thoughts:

  1. I'm not sure transitive is the right adjective here as it doesn't clearly communicate what the intent is in the code. I think autoretryable or something may be a better choice.
  2. While I'm with you on infinite retries being a little scary, it does limit the usefulness of auto-retry in some cases. For example, if an index has a misconfigured alias (a common issue in the field), it would be nice if ILM could figure itself out automatically once the configuration issue is resolved. But the automatic retries will run out in 2.5 hours, which is a fairly short time window to allow for an administrator to notice and correct the issue. IMO it's worth thinking about if we need to cap the retries, and even if we do in general, if some errors can ignore the cap.

Comment on lines 35 to 46
if (indexMetaData.getState() == IndexMetaData.State.OPEN) {
CloseIndexRequest closeIndexRequest = new CloseIndexRequest(followerIndex);
getClient().admin().indices().close(closeIndexRequest, ActionListener.wrap(
r -> {
assert r.isAcknowledged() : "close index response is not acknowledged";
listener.onResponse(true);
},
listener::onFailure)
);
} else {
listener.onResponse(true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can (and should be IMO) be broken out into a separate PR that can be merged before this one.

@@ -104,6 +112,12 @@ static LifecycleExecutionState fromCustomMetadata(Map<String, String> customData
if (customData.containsKey(FAILED_STEP)) {
builder.setFailedStep(customData.get(FAILED_STEP));
}
if (customData.containsKey(IS_TRANSITIVE_ERROR)) {
builder.setIsTransitiveError(Boolean.parseBoolean(customData.get(IS_TRANSITIVE_ERROR)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the one below it) should probably be wrapped in a try/catch to clarify the exception if the value isn't parseable, like the other fields in this method:

try {
builder.setStepTime(Long.parseLong(customData.get(STEP_TIME)));
} catch (NumberFormatException e) {
throw new ElasticsearchException("Custom metadata field [{}] does not contain a valid long. Actual value: [{}]",
e, STEP_TIME, customData.get(STEP_TIME));

Comment on lines 26 to 37
if (indexMetaData.getState() == IndexMetaData.State.CLOSE) {
OpenIndexRequest request = new OpenIndexRequest(indexMetaData.getIndex().getName());
getClient().admin().indices().open(request, ActionListener.wrap(
r -> {
assert r.isAcknowledged() : "open index response is not acknowledged";
listener.onResponse(true);
},
listener::onFailure
));
} else {
listener.onResponse(true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this should be broken out into another PR along with the change to CloseFollowerIndexStep.

Comment on lines 113 to 136
public void testCloseFollowerIndexIsNoopForAlreadyClosedIndex() {
IndexMetaData indexMetadata = IndexMetaData.builder("follower-index")
.settings(settings(Version.CURRENT).put(LifecycleSettings.LIFECYCLE_INDEXING_COMPLETE, "true"))
.putCustom(CCR_METADATA_KEY, Collections.emptyMap())
.state(IndexMetaData.State.CLOSE)
.numberOfShards(1)
.numberOfReplicas(0)
.build();
Client client = Mockito.mock(Client.class);
CloseFollowerIndexStep step = new CloseFollowerIndexStep(randomStepKey(), randomStepKey(), client);
step.performAction(indexMetadata, null, null, new AsyncActionStep.Listener() {
@Override
public void onResponse(boolean complete) {
assertThat(complete, is(true));
}

@Override
public void onFailure(Exception e) {
}
});

Mockito.verifyZeroInteractions(client);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deal - break out into separate PR

@@ -52,6 +52,8 @@ private static IndexLifecycleExplainResponse randomManagedIndexExplainResponse()
stepNull ? null : randomAlphaOfLength(10),
stepNull ? null : randomAlphaOfLength(10),
randomBoolean() ? null : randomAlphaOfLength(10),
stepNull ? null : randomBoolean(),
stepNull ? null : randomInt(15),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this isn't arbitrary (as opposed to, for example, the 10 in randomAlphaOfLength(10) above) this should be made a constant instead of a magic number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is somewhat arbitrary (I set it up to the previous rather arbitrary retry count I configured). I'll change it to 10 to not stand out of the bunch (especially as the retry count default will be infinite)

Comment on lines 300 to 306
Request updateLifecylePollSetting = new Request("PUT", "_cluster/settings");
updateLifecylePollSetting.setJsonEntity("{" +
" \"transient\": {\n" +
" \"indices.lifecycle.poll_interval\" : \"1s\" \n" +
" }\n" +
"}");
client().performRequest(updateLifecylePollSetting);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the later one) shouldn't be necessary as we set this via Gradle for this test suite:

setting 'indices.lifecycle.poll_interval', '1000ms'

Comment on lines 181 to 191
clusterService.submitStateUpdateTask("ilm-retry-failed-step", new ClusterStateUpdateTask() {
@Override
public ClusterState execute(ClusterState currentState) {
return moveClusterStateToRetryFailedStep(currentState, index);
}

@Override
public void onFailure(String source, Exception e) {
logger.error("retry execution of step [{}] failed due to [{}]", failedStep.getKey().getName(), e);
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this update task need a clusterStateProcessed() method that calls maybeRunAsyncAction? Like here:

public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
for (String index : request.indices()) {
IndexMetaData idxMeta = newState.metaData().index(index);
LifecycleExecutionState lifecycleState = LifecycleExecutionState.fromIndexMetadata(idxMeta);
StepKey retryStep = new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(), lifecycleState.getStep());
if (idxMeta == null) {
// The index has somehow been deleted - there shouldn't be any opportunity for this to happen, but just in case.
logger.debug("index [" + index + "] has been deleted after moving to step [" +
lifecycleState.getStep() + "], skipping async action check");
return;
}
indexLifecycleService.maybeRunAsyncAction(newState, idxMeta, retryStep);

Or am I missing where we call maybeRunAsyncAction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll follow up on running async steps ( #48010 (comment) )


@Override
public void onFailure(String source, Exception e) {
logger.error("retry execution of step [{}] failed due to [{}]", failedStep.getKey().getName(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message should specify the index in question.

Step failedStep = stepRegistry.getStep(indexMetaData, new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(),
lifecycleState.getFailedStep()));
if (failedStep == null) {
logger.warn("failed step [{}] is not part of policy [{}] anymore, or it is invalid. skipping execution",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message should specify the index in question.

@andreidan andreidan added the WIP label Oct 28, 2019
@andreidan
Copy link
Contributor Author

@gwbrown @dakrone thanks for the review. I'll move on to implement your suggestions. I'll create a separate PR for the open/close follower steps changes (made them in order to make the steps a bit more resilient as the api calls would fail when executed against a read-only index for example and we execute them regardless if the index is already in the desired state ie. open or closed)

@andreidan
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Contributor Author

@elasticmachine retest this please

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Oct 31, 2019
This fixes the version checks to use 7.6.0 instead of 8.0.0 since the
changes in elastic#48256 were backported to the 7.x branch.
dakrone added a commit that referenced this pull request Oct 31, 2019
This fixes the version checks to use 7.6.0 instead of 8.0.0 since the
changes in #48256 were backported to the 7.x branch.
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Oct 31, 2019
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with elastic#48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.

This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.

Resolves elastic#48767
dakrone added a commit that referenced this pull request Oct 31, 2019
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with #48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.

This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.

Resolves #48767
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Oct 31, 2019
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with elastic#48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.

This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.

Resolves elastic#48767
dakrone added a commit that referenced this pull request Oct 31, 2019
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with #48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.

This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.

Resolves #48767
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Nov 14, 2019
The rollover action is now a retryable step (see elastic#48256)
so ILM will keep retrying until it succeeds as opposed to stopping and
moving the execution in the ERROR step.

Fixes elastic#49073
andreidan added a commit that referenced this pull request Nov 15, 2019
The rollover action is now a retryable step (see #48256)
so ILM will keep retrying until it succeeds as opposed to stopping and
moving the execution in the ERROR step.

Fixes #49073
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Nov 15, 2019
The rollover action is now a retryable step (see elastic#48256)
so ILM will keep retrying until it succeeds as opposed to stopping and
moving the execution in the ERROR step.

Fixes elastic#49073

(cherry picked from commit 3ae9089)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

# Conflicts:
#	x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java
andreidan added a commit that referenced this pull request Nov 15, 2019
The rollover action is now a retryable step (see #48256)
so ILM will keep retrying until it succeeds as opposed to stopping and
moving the execution in the ERROR step.

Fixes #49073

(cherry picked from commit 3ae9089)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

# Conflicts:
#	x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Feb 13, 2020
Relates: #4341, elastic/elasticsearch#48256

This commit adds FailedStepRetryCount and IsAutoRetryableError
properties to ILM explain, applicable if a step fails.
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Feb 19, 2020
Relates: #4341, elastic/elasticsearch#48256

This commit adds FailedStepRetryCount and IsAutoRetryableError
properties to ILM explain, applicable if a step fails.
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Feb 19, 2020
Relates: #4341, elastic/elasticsearch#48256

This commit adds FailedStepRetryCount and IsAutoRetryableError
properties to ILM explain, applicable if a step fails.

(cherry picked from commit 4dc0e83)
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Feb 23, 2020
Relates: #4341, elastic/elasticsearch#48256

This commit adds FailedStepRetryCount and IsAutoRetryableError
properties to ILM explain, applicable if a step fails.

(cherry picked from commit 4dc0e83)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants