Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Continuous data frame should be more robust to new and deleted indices #43992

Closed
sophiec20 opened this issue Jul 4, 2019 · 6 comments · Fixed by #44344
Closed

[ML] Continuous data frame should be more robust to new and deleted indices #43992

sophiec20 opened this issue Jul 4, 2019 · 6 comments · Fixed by #44344
Assignees

Comments

@sophiec20
Copy link
Contributor

sophiec20 commented Jul 4, 2019

Found in 7.3.0 "build_hash" : "f8fd432", "build_date" : "2019-07-03T15:05:06.452272Z",

3 node cluster.
Index template for temp-* has 3 shards and 1 replica.
New index temp-100? is being created every 12 seconds with a bulk upload of 4000 documents.

When polling GET _data_frame/transforms/blah*/_stats periodic checkpoint exceptions occur. These are displayed in the UI transform list as generic server error 500 toast messages, providing the page refresh cycle coincides.

Index temp_1013 has just been created. There is a small window when this index health is yellow. I think it might also be possible that the replica is not yet ready (not sure if health is considered yellow in this case).

  "node_failures": [
    {
      "type": "failed_node_exception",
      "reason": "Failed to retrieve checkpointing info",
      "node_id": "qMS4vptxxxkr7baDqqqq",
      "caused_by": {
        "type": "checkpoint_exception",
        "reason": "checkpoint_exception: Failure during source checkpoint info retrieval",
        "caused_by": {
          "type": "index_not_found_exception",
          "reason": "no such index [temp_1013]",
          "index_uuid": "_na_",
          "resource.type": "index_or_alias",
          "resource.id": "temp_1013",
          "index": "temp_1013"
        }
      }
    },
    {
      "type": "failed_node_exception",
      "reason": "Failed to retrieve checkpointing info",
      "node_id": "qMS4vptxxxkr7baDqqqq",
      "caused_by": {
        "type": "checkpoint_exception",
        "reason": "checkpoint_exception: Failure during source checkpoint info retrieval",
        "caused_by": {
          "type": "null_pointer_exception",
          "reason": null
        }
      }
    }

The elasticsearch logs contained repeated messages

[2019-07-04T17:32:36,894][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check
[2019-07-04T17:33:07,026][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check
[2019-07-04T17:35:59,204][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check
[2019-07-04T17:38:01,976][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check

Expected behavior
New source index creation is likely for continuous data frames. Continuous data frames should be tolerant of this.

@sophiec20 sophiec20 added >bug :ml Machine learning :ml/Transform Transform v7.3.0 labels Jul 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@sophiec20
Copy link
Contributor Author

Full exception from "failure in update check"

[2019-07-05T00:00:58,252][ERROR][o.e.x.d.t.DataFrameTransformTask] [node1] failure in update check
java.lang.NullPointerException: null
        at org.elasticsearch.xpack.dataframe.checkpoint.DataFrameTransformsCheckpointService.extractIndexCheckPoints(DataFrameTransformsCheckpointService.java:236) ~[?:?]
        at org.elasticsearch.xpack.dataframe.checkpoint.DataFrameTransformsCheckpointService.lambda$getCheckpoint$0(DataFrameTransformsCheckpointService.java:115) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:216) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:141) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]
        at java.lang.Thread.run(Thread.java:835) [?:?]

@sophiec20
Copy link
Contributor Author

Furthermore, if the setup also periodically deletes trailing indices that fall within the pattern, the checkpoint progress fails to move forward. "operations_behind" : -1 occurs which seems to stop the progress moving forward.

For the most recent test run, the progress is stopped at high 99.x%.

@sophiec20
Copy link
Contributor Author

Additional exception snippet. This occurs less frequently.

[2019-07-05T07:31:08,381][ERROR][o.e.x.d.t.DataFrameTransformTask] [node1] failure in update check
org.elasticsearch.transport.RemoteTransportException: [node2][127.0.0.1:9352][indices:admin/get]
Caused by: org.elasticsearch.index.IndexNotFoundException: no such index [gallery-temp_4639]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.indexNotFoundException(IndexNameExpressionResolver.java:761) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.innerResolve(IndexNameExpressionResolver.java:713) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.resolve(IndexNameExpressionResolver.java:669) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:163) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:142) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:75) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.get.TransportGetIndexAction.checkBlock(TransportGetIndexAction.java:77) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.get.TransportGetIndexAction.checkBlock(TransportGetIndexAction.java:50) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:170) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:161) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:138) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:58) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:145) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:86) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:172) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$15(AuthorizationService.java:341) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:117) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.StepListener.onResponse(StepListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:61) ~[?:?]
        at org.elasticsearch.xpack.security.authz.interceptor.SearchRequestInterceptor.intercept(SearchRequestInterceptor.java:19) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:336) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:117) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.StepListener.onResponse(StepListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.interceptor.BulkShardRequestInterceptor.intercept(BulkShardRequestInterceptor.java:71) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:336) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:117) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.StepListener.onResponse(StepListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:61) ~[?:?]
        at org.elasticsearch.xpack.security.authz.interceptor.UpdateRequestInterceptor.intercept(UpdateRequestInterceptor.java:23) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:336) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
...

@sophiec20 sophiec20 changed the title [ML] Data frame failed to retrieve checkpointing info exceptions on newly created indices [ML] Continuous data frame should be more robust to new and deleted indices Jul 5, 2019
@jpountz jpountz removed v7.3.0 :ml Machine learning labels Jul 5, 2019
@droberts195
Copy link
Contributor

droberts195 commented Jul 5, 2019

At a code level I think the problems observed are:

  1. DataFrameTransformsCheckpointService.getCheckpoint makes a GetIndexRequest, then for each index found it makes a IndicesStatsRequest. In between the two calls indices can be deleted. Therefore an IndexNotFoundException in response to any single IndicesStatsRequest should just be treated as though that index didn't exist in the original GetIndexRequest result.
  2. In DataFrameTransformsCheckpointServiceextractIndexCheckPoints calls are made to shard.getSeqNoStats(). This can return null and this is the cause of the NPE. A check for this situation has recently been added, so the NPE will no longer occur. However, instead a different exception is thrown that aborts the entire checkpoint. This is also not friendly to the index pattern that's an input to the transform being a dynamically changing set of indices, for example managed by ILM.

We need to find a way to make checkpoints robust to indices entering or leaving the set of source indices. When an index enters or leaves the set it's reasonable to treat this as meaning there's been a change since the previous checkpoint. But for the indices that do still exist and are still open it's still possible to calculate checkpoint stats.

@droberts195
Copy link
Contributor

droberts195 commented Jul 5, 2019

I discussed this with @hendrikmuhs. For 7.3 some simple bug fixes we could do are:

  1. Don't spam the log with huge stack traces when likely problems occur during checkpoint calculation like indices being created, closed or deleted. A single line debug message would suffice when checkpoint calculation is complicated by these events.
  2. Alter the "has anything changed" check from "changes > 0" to "changes != 0" so that it treats "couldn't calculate the checkpoint" as a change. Or alternatively "changes > 0" could be altered to "changes > 0 for the indices that are currently searchable".

However, this also interacts quite heavily with solving the 65000 terms problem. So the timeline and mechanism for fixing that affects the decision of what to do about this problem.

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Jul 15, 2019
- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes elastic#43992
hendrikmuhs pushed a commit that referenced this issue Jul 16, 2019
make checkpointing more robust:

- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes #43992
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Jul 16, 2019
make checkpointing more robust:

- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes elastic#43992
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Jul 16, 2019
make checkpointing more robust:

- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes elastic#43992
hendrikmuhs pushed a commit that referenced this issue Jul 16, 2019
make checkpointing more robust:

- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes #43992
hendrikmuhs pushed a commit that referenced this issue Jul 16, 2019
make checkpointing more robust:

- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes #43992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants