Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] onShardResult and onShardFailure are executed on one shard causes opensearch jvm crashed #11881

Closed
kkewwei opened this issue Jan 14, 2024 · 9 comments · Fixed by #12158
Closed
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc Severity-Major v2.13.0 Issues and PRs related to version 2.13.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@kkewwei
Copy link
Contributor

kkewwei commented Jan 14, 2024

Describe the bug

In #3626, we add onPhaseFailure trying to catch the exception throwing from onPhaseDone, it can avoid jvm crash in some case.

But onPhaseFailure may throw exception, we encountered the new case in os2.9.0, and see from the stack that task cancellation in multisearch throws TaskCancelledException in the onPhaseFailure, which causes onShardResult and onShardFailure are executed on one shard, then lead to jvm crashed.

[2024-01-05T17:06:29,624+0800][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [node0] fatal error in thread [Thread-17309], exiting
java.lang.AssertionError: unexpected higher total ops [21] compared to expected [20]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:480) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:295) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:755) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.TransportService$6.handleException(TransportService.java:884) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:379) ~[?:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1504) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:418) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:395) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:385) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:160) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:114) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:769) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:94) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1383) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1246) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1295) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.action.search.SearchPhaseExecutionException: Shard failures
	... 48 more
Caused by: org.opensearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:666) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:345) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.support.TransportAction.registerChildNode(TransportAction.java:76) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:95) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:476) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.support.AbstractClient.search(AbstractClient.java:607) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:180) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:203) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:188) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:122) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.ActionListener$5.onFailure(ActionListener.java:275) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:691) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:665) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:584) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResultConsumed(AbstractSearchAsyncAction.java:569) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$onShardResult$9(AbstractSearchAsyncAction.java:552) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.QueryPhaseResultConsumer$PendingMerges.consume(QueryPhaseResultConsumer.java:373) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:132) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:552) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:159) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:286) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:59) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:44) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:99) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:52) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:746) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:878) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:373) ~[?:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1494) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.9.0.jar:2.9.0]
	... 36 more

Expected behavior

We can see the root case is that TransportAction.execute unexpectedly threw an TaskCancelledException, if we should do as follows.

 public final Task execute(Request request, ActionListener<Response> listener) {
        try {
               .....
               return task;
        } catch(Exception e) {
              listener.onFailure(e)
        }
        return null;
}

I'm pleasure to fix it.

Related component

Search

To Reproduce

If needed or unclear, I will try to reproduce it.

Additional Details

Host/Environment (please complete the following information):

  • OS: Linux version 5.15.43,GNU ld (GNU Binutils for Debian) 2.28)
  • Version os2.9.0
@kkewwei kkewwei added bug Something isn't working untriaged labels Jan 14, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Jan 14, 2024
@kkewwei kkewwei changed the title [BUG] onShardResult and onShardFailure are executed on one shard causes opensearch jvm crash [BUG] onShardResult and onShardFailure are executed on one shard causes opensearch jvm crashed Jan 14, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4]
Thanks for filing

@peternied
Copy link
Member

@reta Sounds like you have experience with this kind of issue, can you take a look?

@reta reta self-assigned this Jan 17, 2024
@reta
Copy link
Collaborator

reta commented Jan 17, 2024

@kkewwei by any chance, do you have a reproducer for this issue? or at least some hints how to reproduce it? thank you.

@kkewwei
Copy link
Contributor Author

kkewwei commented Jan 18, 2024

@reta , I will try to reproducer for this issue recently, if possible, can you assign it to me, I'm pleasure try to fix it.

@reta reta assigned kkewwei and unassigned reta Jan 18, 2024
@kkewwei
Copy link
Contributor Author

kkewwei commented Feb 3, 2024

@reta It's can be reproduced changing the test code:

               if (failExecutePhaseOnShard) {
                    listener.onFailure(new ShardNotFoundException(shardIt.shardId()));
                } else {
                    try {
                        listener.onResponse(new QuerySearchResult());
                    } catch (Exception e) {
                        listener.onFailure(e);
                    }
                }

then you run the unit test AbstractSearchAsyncActionTests.testExecutePhaseOnShardFailure(), it will throw the fatal exception.

@reta
Copy link
Collaborator

reta commented Feb 3, 2024

@reta It's can be reproduced changing the test code:

               if (failExecutePhaseOnShard) {
                    listener.onFailure(new ShardNotFoundException(shardIt.shardId()));
                } else {
                    try {
                        listener.onResponse(new QuerySearchResult());
                    } catch (Exception e) {
                        listener.onFailure(e);
                    }
                }

then you run the unit test AbstractSearchAsyncActionTests.testExecutePhaseOnShardFailure(), it will throw the fatal exception.

Thanks @kkewwei , I am wondering if that what happens in the wild: the listener usually being called only once onResponse or onFailure, that may happen we have a loop somewhere in the code, would be great to pin point it.

@kkewwei
Copy link
Contributor Author

kkewwei commented Feb 4, 2024

@reta I think the core code is InboundHandler.doHandleResponse:

    private <T extends TransportResponse> void doHandleResponse(TransportResponseHandler<T> handler, T response) {
        try {
            // will call onResponse
            handler.handleResponse(response);
        } catch (Exception e) {
           // will call onFailure 
            handleException(handler, new ResponseHandlerFailureTransportException(e));
        }
    }

Here are the simplified steps, which also can be seen from the exception about:

Phase1:
InboundHandler.doHandleResponse->AbstractSearchAsyncAction.onShardResult
->AbstractSearchAsyncAction.successfulShardExecution->AbstractSearchAsyncAction.onPhaseDone, and it throws exception for unknown reason, which means that onResponse has been called.

Phase2:
Then come into AbstractSearchAsyncAction.onPhaseFailure->......->TransportMultiSearchAction.executeSearch, it throws TaskCancelledException:

Caused by: org.opensearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:666) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:345) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.support.TransportAction.registerChildNode(TransportAction.java:76) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:95) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:476) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.client.support.AbstractClient.search(AbstractClient.java:607) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:180) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:203) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:188) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:122) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.ActionListener$5.onFailure(ActionListener.java:275) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:691) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:665) ~[opensearch-2.9.0.jar:2.9.0]
	at org.opensearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:584) ~[opensearch-2.9.0.jar:2.9.0]

Phase3:
InboundHandler.doHandleResponse->InboundHandler.handleException->......->
AbstractSearchAsyncAction.onShardFailure

@kkewwei
Copy link
Contributor Author

kkewwei commented Feb 4, 2024

@reta, because it is most likely a bug, so I create pr in advance, pleas help review.

@reta
Copy link
Collaborator

reta commented Feb 5, 2024

@reta, because it is most likely a bug, so I create pr in advance, pleas help review.

Thank you, @kkewwei , I think the fix make sense (considering your findings) but we would need to add a test case that show cases and reproduces the problem, could you please add one?

@reta reta added v2.13.0 Issues and PRs related to version 2.13.0 v3.0.0 Issues and PRs related to version 3.0.0 labels Mar 8, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc Severity-Major v2.13.0 Issues and PRs related to version 2.13.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
3 participants