-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node is not responsive after the end of a big merge for close to 10 minutes #8905
Comments
We had another node also hanging in index refresh. 2014-12-11 17:57:57
|
Hmm the first hot threads output only shows 3 (throttled) merges running; I don't see any other threads blocked or anything. So I can't see from this why you'd see a 10 minute non-responsive node... From talking to @s1monw it looks like the 2nd hot threads has one thread doing phase 3 of recovery (replay xlog), which holds a write lock on the engine, and then blocks all other threads trying to get a read lock. Three threads are stuck in segmentsStats, but that really shouldn't need a readLock... I'll open PR to maybe fix that. But I'm not sure why phase3 recovery would take so long here. Do you use any index-time scripts? |
We don't use any index-time scripts. I also copied over the entire jstack just to be sure that it's not waiting somewhere where it shouldn't. (ragarding #8908) We also had some issues with two replicas never recovering ((#8911). I assume for the second jstack (I'm not sure anymore if it was first or second), the node also had many smaller segments before and issued a merge which then resulted in that blocking mode. Please let me know what else I can provide for you to help next time the issue happens? |
I opened #8910 so index stats shouldn't block when recovery phase 3 is taking a long time. But I'm not sure why phase 3 takes so long in your case. This is a function of how large the xlog is that needs to be moved over (and your network speed), which in turn is a function of 1) how long it took to replicate the shard (how big was it?) and 2) how quickly indexing was happening into the source shard. Can you run the diagnostics plugin (https://github.com/elasticsearch/elasticsearch-support-diagnostics) and post the results? |
We will add this to our next deployment and run it when the issue reoccurs. |
We have upgraded to 1.4.2 and haven't seen any hanging of the nodes so far, maybe also because #8911 didn't happen anymore so far (but we are also careful at this)). I'm closing this issue. |
On one of our nodes a merge was triggered by ES.
At the end of the merge of "index1", the following call will hang on that node:
curl -XGET 'http://localhost:9200/_nodes/_local/stats?pretty'
Our node recovered after 10 minutes and was available again.
2014-12-11 16:50:18
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode):
The text was updated successfully, but these errors were encountered: