-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
org.apache.lucene.util.MergedIterator.pullTop() Crash #31425
Comments
@elastic/es-core-infra could you pls have a look ? |
@WUMUXIAN can you share more information on the machine type that you're using, in particular the output of |
@yevhen Here you go:
A few more information:
I am not sure why this is the behaviour, very strange to me, hope it can help you. Let me know if you need more information. |
Thank you @WUMUXIAN. Can you also share the content of |
@jpountz Sure, see it below:
|
Thanks. For the record, I asked on the Lucene lists as well in case someone already saw this segfault: https://markmail.org/message/n7rjurxzinyrhrwb. |
@jpountz Thanks for the info. |
I'm also encountering this error in a similar environment: official docker image 6.3.0, m5.large AWS EC2 instance, Ubuntu 16.04. The previous version I was running, 6.2.4, was not affected. I'm happy to provide more info if required. |
I am also seeing this error on a m5.4xlarge in AWS. Linux 4.9. Debian stretch on the host. Using the |
I am also experiencing this issue with the elasticsearch:6.3.0 docker image. |
I started experiencing this crash today: I've been running the 6.3.0 Docker image on AWS i3 instances for weeks without any crashes. Today I launched some c5d instances to help with some hot indexes, and every single one of those has been crashing every couple minutes when I started indexing on them. I have since moved all indexes back to the i3s but I'll try to fetch some logs, although they were nearly identical to the ones already provided here. Based on the comments above, this might be an issue with 6.3.0 and the M5/C5 instance family on AWS. Please let me know if I can help debug this in some way. OS: Amazon Linux ECS 2018.03.a |
I've been doing some tests, and I believe I found the issue. The 6.3.0 Docker image ships with Java 10.0.1, more specifically the generic binary targz from Oracle. I modified the Docker image to use the OpenJDK 8 binary provided by the base CentOS image package manager (1.8.0.171). It's now been running for ~1h without a single crash. So the issue seems to be with Java 10 on AWS C5/M5 instances. |
Thanks for digging @xose ! @dliappis and I tried to reproduce this one on a c5d.large instance with Amazon Linux ECS 2018.03.a, Elasticsearch 6.3.0 and OpenJDK 10.0.1 by running various rally tracks, but this didn't reproduce. There might be other conditions that are needed for this bug to reproduce. Given that the error occurs when merging the list of field names that are indexed, it might depend on the actual list of field names that exist in the index. |
sharing my experience. using the 6.3 docker image on a m5.large. |
I've been experiencing a similar problem, running Elastic on m5 instance. However, unlike most people here, I am only experiencing it with |
For the users that are reporting this issue, would you please try the following as a workaround? Add the options When Elasticsearch starts you should see the log lines
as an indication that the options were successfully applied. If this eliminates the issue, this would point to a compilation bug in JDK 10. Meanwhile, I am taking this issue upstream. |
Also, another option to try as this appears to be manifesting on server-grade Skylake chips: start with the JVM option |
I note that the disassembly for the instructions given above from the hserr log
is
which shows that AVX-512 instructions are indeed in play here. In particular, the instruction that we crashed on is:
I really hope that someone that can reliably reproduce this issue can report whether or not the issue goes away if AVX-512 is disabled (via |
Currently setting -XX:CompileCommand=exclude,org/apache/lucene/util/MergedIterator.pullTop and -XX:CompileCommand=exclude,org/apache/lucene/util/MergedIterator.pushTop seems to help. I can test AVX-512 if this remains stable. I'm running on a Intel(R) Xeon(R) Silver 4114 Proc |
Update: tested with -XX:UseAVX=2 in ES_JAVA_OPTS and it seems stable. Been up for 10 mins now, typically would crash in less than 2. I will continue to monitor. |
@dsmitty166 Thanks for checking and this is welcome news. I feel fairly comfortable asserting that the issue is due to a bug in the C2 compiler on JDK 10 when using AVX-512. Here are some more data points:
I have taken this upstream and will link to the mail thread once it receives moderator approval. For now, I think we can say that a workaround is to use |
We have integrated a change (#32138) into 6.3.2 that will disable AVX-512 on JDK 10. We will keep this disabled until there is an upstream patch to the JDK to address this issue. |
This is now tracked upstream via JDK-8207746. |
Is there a way to reproduce the crash? Unfortunately, the log doesn't contain enough info to spot the root cause. I'll be grateful for instructions how to trigger it. Thanks in advance! |
I'm running on a Intel(R) Xeon(R) Silver 4116 Proc (in docker for Windows) and was having this issue. Appears to stop happening when -XX:UseAVX=2 is specified. |
@iwanowww Thanks for your support. You can reproduce this issue as follows:
The Elasticsearch container |
@iwanowww I have also tested this on JDK 11 (11-ea+24) and the crash continues to reproduce when AVX-512 is enabled and does not reproduce when AVX-512 is disabled ( |
Thanks a lot for the info, @jasontedor! |
@jasontedor Hi Jason, I am getting stuck at step 3/7 for npm install pm2 -g. Is there any way we can bypass that? |
@vivdesh What is the issue you’re having? |
@jasontedor I have set the right proxies and registry, but still having the issue |
@jasontedor Hi Jason, I have it running on skylake server after setting the proxies inside the server. |
@vivdesh It can be tricky. The best option would be to create a new image based on JDK 11. I have done this in my own testing when I reported above that the issue still persists on JDK 11. Is this something that you would repeatedly need to do, or are you looking for a single build off of the latest JDK 11-ea build? If the latter, I think that the most straightforward approach would be that I build a new image for you and point you how to use it in the reproduction. If the former, we can come up with something. |
@jasontedor I was able to reproduce the problem with jdk11 by mounting the local jdk image and have the fix. You can check the progress: https://bugs.openjdk.java.net/browse/JDK-8207746. Thanks. |
@vivdesh That is great news, thank you! If I understand correctly, this will be fixed in JDK 11 but remain a problem in JDK 10? |
@jasontedor Yes. this is fixed in the jdk11. http://hg.openjdk.java.net/jdk/jdk11/rev/7339b9e38182 |
Thanks for your support @vivdesh. It’s no problem for us if this problem remains in JDK 10, we have disabled by default the use of AVX-512 there, and will re-enable it for JDK 11. |
Suggesting update to a higher version of elasticsearch, (one thats not running on JDK10) due to an issue related to JVM inside JDK10. ref elastic/elasticsearch#31425 (comment)
So where do I put -XX:UseAVX=2 to make this work? Do I put it in the jvm options config file? |
@LiamKarlMitchell Are you using the official 6.8.3 Docker image? If so, I don't think you can be experiencing this issue because that container is based on JDK 12 where this issue is mitigated. If not, please use the official image and we will support you. Either way, please use the forums for additional help. |
I did a rebuild and deleted volumes.
After destroying the volume and rebuilding it looks like it works :). |
Elasticsearch version (
bin/elasticsearch --version
):6.3.0
Plugins installed: [ingest-geoip, ingest-user-agent]
JVM version (
java -version
):openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10)
OpenJDK 64-Bit Server VM (build 10.0.1+10, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux es-data-us-east-1c-0 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
I am running a elasticsearch data node using this docker image:
docker.elastic.co/elasticsearch/elasticsearch:6.3.0
Somehow it always crashes after running for ~5 mins with error like this:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
The text was updated successfully, but these errors were encountered: