-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Gradle test executor failing frequentlly in :server:test project #103004
Comments
Looks like the core issue here might actually be a test causing the jvm to crash.
@ChrisHegarty looks like you are investigating this? |
Pinging @elastic/es-delivery (Team:Delivery) |
JDK issue tracking this SIGSEGV - https://bugs.openjdk.org/browse/JDK-8321370 |
The failures started with the Lucene upgrade to 9.9.0. |
We want to capture compiler replay data so we can better troubleshoot the root cause of #103004.
@ChrisHegarty with #103007 merged we should have the replay file on the next failure. |
Thanks @mark-vieira. I grabbed a replay from the latest crash, and managed to reproduce this locally. FTR - I added the following comment to the JDK JIRA issue too.
|
Failing running
Appears to be compiling: |
Reproducer: minor patch to the elasticsearch repo:
|
JDK 19.0.2 - fails (crashes) [Corretto-19.0.2.7.1] Planned release of JDK 21.0.2 is 2024-01-16.
|
Excluding the following methods from compilation avoids the issue (since the issue is as a direct result of compiling these methods:
|
This is not an entirely new issue - Lucene ran into similar(ish) before, see https://bugs.openjdk.org/browse/JDK-8285835, which was resolved, but seems fairly similar. The code patterns in the above two Lucene methods tickle this area again. Though, I've not been able to reproduce with just Lucene, but it seems reasonable to assume that Lucene is susceptible. |
What's our short term plan here before JDK 21.0.2 is released? |
We disabled some optimizations in the past for a similar problem. #32138 I wonder if excluding these methods from compilation is a good short-term option, the performance impact looks like it could be higher than on this other issue. |
Agreed. I think excluding the two methods from compilation, as above, should relieve the problem. I can do this tomorrow, if someone doesn’t get to it first. I’d like to see if we can rework the code pattern in Lucene to avoid the crash. The potential to crash will still be there, as it always has been. But we’d need a lucene 9.9.1 for that. |
I've opened a PR to exclude the offending two methods from JIT compilation. Local testing and the CI all pass successfully. I would like to merge this PR as a interim solution in order to restore stability to Elasticsearch. Separately, and orthogonal, I've written a small test that reproduces the SEGV with just Lucene code. Work will continue in Lucene in an effort to avoid the crash. The outcome of this is still unclear, but if successful then a Lucene 9.9.1 release may be warranted. More in this Lucene issue apache/lucene#12887 |
A Lucene 9.9.1 is currently in the works. It includes a fix that avoids the JVM crash. Once Elasticsearch is upgraded to use 9.9.1, then the compiler excludes can be removed. |
I think we can close this issue since it has been resolved for now by #103112. We can track the upgrade to Lucene 9.9.1 separately. |
Agreed. |
We've seen a large up-tick in gradle test work executors exiting specifically in the
:server:test
project.https://gradle-enterprise.elastic.co/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20%27:server:test%27.%0A%3E%20Process%20%27Gradle%20Test%20Executor%20*%20finished%20with%20non-zero%20exit%20value%20134%0A%20%20This%20problem%20might%20be%20caused%20by%20incorrect%20test%20process%20configuration.%0A%20%20For%20more%20on%20test%20execution%2C%20please%20refer%20to%20https:%2F%2Fdocs.gradle.org%2F8.5%2Fuserguide%2Fjava_testing.html%23sec:test_execution%20in%20the%20Gradle%20documentation.&search.timeZoneId=America%2FLos_Angeles#
The text was updated successfully, but these errors were encountered: