Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Transform fails with task encountered irrecoverable failure: org.elasticsearch.index.IndexNotFoundException #81252

Closed
hendrikmuhs opened this issue Dec 2, 2021 · 4 comments · Fixed by #81368
Labels

Comments

@hendrikmuhs
Copy link

hendrikmuhs commented Dec 2, 2021

Affected version: 7.15-
Fixed with: 7.16.1

Transform can fail with an IndexNotFoundException if you use transform on a data stream, index pattern or alias that covers multiple indices and you use index lifecycle management to delete old indices. In the case a transform updates documents while an index gets deleted by index lifecycle management the transform might fail.

This error is a regression introduced in 7.15, details below. Transform can be used together with index lifecycle management on source indices.

Mitigation

Until a fix is available you have 3 options:

  • stay on an older version or don't upgrade to a version >= 7.15 until a fix is available
  • don't use ILM to delete old data or ensure transform isn't running when ILM deletes an index
  • watch and restart transform after a failure:
POST _transform/{job_id}/_stop?force=true&wait_for_completion=true
POST _transform/{job_id}/_start

You need force=true for jobs that are in failed state.
wait_for_completion=true waits until the transform has really stopped, otherwise it might not be stopped if you call _start right after - within nanoseconds - in a script.
You can use a _stats call to find out if a transform is in failed state.

@hendrikmuhs hendrikmuhs added >bug needs:triage Requires assignment of a team area label :ml/Transform Transform labels Dec 2, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@hendrikmuhs hendrikmuhs removed Team:ML Meta label for the ML team needs:triage Requires assignment of a team area label labels Dec 2, 2021
@hendrikmuhs
Copy link
Author

hendrikmuhs commented Dec 2, 2021

Bug

The usage of a point in time reader has been introduced for 7.15 in #74984. This is a performance optimization. The point in time reader is a search context that gets 1st created and than reused. The error happens when an index gets deleted between the point in time reader calls. Transform treats this as permanent error and sets the transform to failed.

Before using a point in time reader transform executed ordinary search requests where the indices got resolved per call. It seems however that an index that gets deleted in between is somehow handled internally.

A point in time reader is meant to be a

lightweight view into the state of the data as it existed when initiated

This works fine if single documents are updated/deleted or added, a new index can be created as well, but not deleting an index.

In the transform use case this is a benign problem, most likely the transform does not care about the the index being deleted, because the deleted index likely does not contain data that transform queries for. For example for date_histogram groupings transform only updates buckets that have changed. The deleted index likely contains data from buckets that are not of interest and therefore the index gets pruned away early (can_match).

@hendrikmuhs
Copy link
Author

hendrikmuhs commented Dec 2, 2021

Solution

If point in time search is used, handle index not found exceptions and retry without a pit. On the next iteration a new pit context gets created.

In case further issues with pit are found, pit search can be disabled.

@hendrikmuhs
Copy link
Author

hendrikmuhs commented Dec 2, 2021

Note: in addition to the issue with pit, transform does some index resolving themselves in 7.16. This suffers from the same underlying issue. Until a better solution has been found, this should be disabled/reverted, too.

I am not quite sure how search handles this case. If an index gets deleted after index names are resolved, but before shard search requests are send, how does it handle this? It seems it is using an IndexEventListener, which removes deleted indexes from search contexts. However it's not clear to me, how this differs between explicit index names and resolved via wildcard index names. The former should create a failure, the later can be ignored.

hendrikmuhs pushed a commit that referenced this issue Dec 8, 2021
Do not fail the transform if pit search fails with index not found as a result of an index that got deleted via ILM, 
if that index is part of a search that selects indices using a wildcard, e.g. logs-*. If pit search fails, the search 
is retried using search without a pit context. The 2nd search might fail if the source targets an explicit index. 
In addition the usage of the pit API can not be disabled by transform.

fixes #81252
relates #81256
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Dec 8, 2021
Do not fail the transform if pit search fails with index not found as a result of an index that got deleted via ILM, 
if that index is part of a search that selects indices using a wildcard, e.g. logs-*. If pit search fails, the search 
is retried using search without a pit context. The 2nd search might fail if the source targets an explicit index. 
In addition the usage of the pit API can not be disabled by transform.

fixes elastic#81252
relates elastic#81256
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Dec 8, 2021
Do not fail the transform if pit search fails with index not found as a result of an index that got deleted via ILM, 
if that index is part of a search that selects indices using a wildcard, e.g. logs-*. If pit search fails, the search 
is retried using search without a pit context. The 2nd search might fail if the source targets an explicit index. 
In addition the usage of the pit API can not be disabled by transform.

fixes elastic#81252
relates elastic#81256
hendrikmuhs pushed a commit that referenced this issue Dec 8, 2021
Do not fail the transform if pit search fails with index not found as a result of an index that got deleted via ILM, 
if that index is part of a search that selects indices using a wildcard, e.g. logs-*. If pit search fails, the search 
is retried using search without a pit context. The 2nd search might fail if the source targets an explicit index. 
In addition the usage of the pit API can not be disabled by transform.

fixes #81252
relates #81256
elasticsearchmachine pushed a commit that referenced this issue Dec 8, 2021
* [Transform] handle pit index not found error (#81368)

Do not fail the transform if pit search fails with index not found as a result of an index that got deleted via ILM, 
if that index is part of a search that selects indices using a wildcard, e.g. logs-*. If pit search fails, the search 
is retried using search without a pit context. The 2nd search might fail if the source targets an explicit index. 
In addition the usage of the pit API can not be disabled by transform.

fixes #81252
relates #81256

* Update SettingsConfig.java

adapt version

* Update build.gradle

re-enable BWC tests

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants