-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Search pipeline seen executing on transport_worker thread #10248
Comments
transport_worker thread example full trace:
|
search thread example:
|
@austintlee the difference between these two flows is that the first one (#10248 (comment)) is run against remote node (by coordinator) but the latter one (#10248 (comment)) uses local transports (the coordinator node itself) |
Hmm.... I had been saying that we will eventually need to support async execution in search pipelines. It sounds like that time may be now. For response processors, it's arguably easier than for request processors (since they already get executed on a callback.) Ingest pipelines assume async execution, but that also leads to #6338. |
Unless other processors in the search pipeline make async call to a transport action, the search processor should be run in search thread I guess? At least that is what I witnessed in ingest pipeline. Do you have other search processors in the search pipeline which are called before your processor? |
Okay. Seems like we need to see where the thread is converted from search thread to transport thread in the search pipeline implementation. I agree with @austintlee that search pipeline should run only in search thread which is similar to what ingest pipeline behave. Also, support of async won't do much good here because we lose search back pressure mechanism by releasing search thread earlier than search is completed. |
@heemin32 Does the ingest pipeline allow invocation of any transport actions? If so, can you provide an example that I can take a look at? |
The first code calls GET index API and the second code calls Query index API synchronously. |
I cobbled together an implementation of async response processors: msfroh@6b4279c It's mostly untested (besides unit tests), but it might let you avoid the blocking |
@heemin32 in both of those examples, I see that you are using |
That is strange that it does not get timed out with timeout value but got stuck without it.. |
How often does this occur? I'm thinking that this may be worth a patch release. |
@macohen you need a search processor that does something that causes the thread executing it to block where the thread is a transport thread. At the moment, the new search response processor that we added to the ml-commons plugin is the only one (that we know of) that does something like that and it is an experimental feature and is disabled by default. Once you enable it, it is fairly easy to get it to hang. You would need a cluster with at least two nodes and a query that runs on an index with two shards across two nodes. This set-up leads to a condition where you are almost guaranteed to have the search pipeline execution to occur on a transport thread and when you have a search processor that blocks the transport thread, it seems to cause the whole cluster to become unstable. |
@msfroh I just built OS 2.10 + your fix and tested it on my cluster and it seems to work! I think we should do a patch release for 2.10. 2.10 has the blocking call without a timeout. This can cause I already switched to a blocking call with a timeout which will be in 2.11. I have not been able to reproduce the hang after I introduced a timeout, but if we are doing a patch release, we should go with the async approach that Michael put forth. By the way, @heemin32 I think the geo processor will also fail on gradle run because of this: OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/BaseFuture.java Lines 80 to 81 in 8807d7a
|
If we're going to support async execution for response processors, I should probably take care of request processors in the same commit. I might also try to avoid the problem we see with stack overflows on long chains by adding an "isAsync" method to the interface -- a chain of non-async processors should just execute in a loop. Only async processors should need to do the callback dance. |
Describe the bug
I have a search processor that uses a
org.opensearch.client.Client
object to execute aTransportAction
. This invocation returns aBaseFuture
and the processor does.get()
, blocking until the client returns a response. Occasionally, the get() call blocks indefinitely and brings the whole cluster into bad state.A thread dump on the node that hung revealed that the search processor was executing on a
transport_worker
thread.I do see in a few other thread dumps I captured the same processor running on a
search
thread.From:
OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/BaseFuture.java
Line 109 in 8807d7a
And:
OpenSearch/modules/transport-netty4/src/main/java/org/opensearch/transport/netty4/Netty4MessageChannelHandler.java
Lines 86 to 89 in 8807d7a
I can see that we don't expect a blocking call to happen on transport threads and there is code specifically in
BaseFuture.get
to disallow invocations on transport threads, althoughassert
won't trigger unless the OpenSearch process is run with theea
JVM flag.Can we ensure that search processors run on
search
threads? Or are they really allowed to run on transport threads which means that I should not have any blocking calls in my search processor?To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: