Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Gradle test executor failing frequentlly in :server:test project #103004

Closed
mark-vieira opened this issue Dec 5, 2023 · 18 comments
Closed
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team

Comments

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Dec 5, 2023
@mark-vieira
Copy link
Contributor Author

Looks like the core issue here might actually be a test causing the jvm to crash.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f57de17ef0e, pid=9600, tid=48393
#
# JRE version: OpenJDK Runtime Environment (21.0.1+12) (build 21.0.1+12-29)
# Java VM: OpenJDK 64-Bit Server VM (21.0.1+12-29, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xb93f0e]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0xce
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /opt/local-ssd/buildkite/builds/bk-agent-prod-gcp-1701781307069242257/elastic/elasticsearch-periodic-platform-support/server/build/testrun/test/core.9600)
#
# An error report file with more information is saved as:
# /opt/local-ssd/buildkite/builds/bk-agent-prod-gcp-1701781307069242257/elastic/elasticsearch-periodic-platform-support/server/build/testrun/test/hs_err_pid9600.log
[thread 48457 also had an error]
Dec 05, 2023 1:10:11 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release
#
# Compiler replay data is saved as:
# /opt/local-ssd/buildkite/builds/bk-agent-prod-gcp-1701781307069242257/elastic/elasticsearch-periodic-platform-support/server/build/testrun/test/replay_pid9600.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

@ChrisHegarty looks like you are investigating this?

@mark-vieira mark-vieira added the :Delivery/Build Build or test infrastructure label Dec 5, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Delivery Meta label for Delivery team label Dec 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Dec 5, 2023
@ChrisHegarty
Copy link
Contributor

JDK issue tracking this SIGSEGV - https://bugs.openjdk.org/browse/JDK-8321370

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Dec 5, 2023

The failures started with the Lucene upgrade to 9.9.0.

elasticsearchmachine pushed a commit that referenced this issue Dec 5, 2023
We want to capture compiler replay data so we can better troubleshoot
the root cause of
#103004.
@mark-vieira
Copy link
Contributor Author

@ChrisHegarty with #103007 merged we should have the replay file on the next failure.

@ChrisHegarty
Copy link
Contributor

Thanks @mark-vieira. I grabbed a replay from the latest crash, and managed to reproduce this locally.

FTR - I added the following comment to the JDK JIRA issue too.

I ran this on my Mac/AArch64 (which reproduces the issue), even though
the crash and replay were initially captured on Linux/x64.

$ mkdir -p /tmp/crash
$ cd /tmp/crash
$ git clone https://github.com/elastic/elasticsearch.git
$ cd elasticsearch
$ git rev-parse HEAD
b88df64f03c11e525ebae24e96fe094b99a2d86c

# This will compile the necessary classes/jar that we need on the
# classpath - let's just allow gradle to do the necessary work!
$ ./gradlew localDistro
$ ./gradlew :server:compileTestJava
$ export CLASSPATH="/tmp/crash/elasticsearch/build/distribution/local/elasticsearch-8.12.0-SNAPSHOT/lib/*:/tmp/crash/elasticsearch/server/out/test/classes"

# Reproduce with the replay file. I added ignore init errors simply to
# avoid having to put too many dependencies on the classpath, since they
# don't seem to be necessary
$ cp replay_pid78630.log /tmp/crash/elasticsearch/
$ /Users/chegar/binaries/jdk-21.0.1.jdk/Contents/Home/bin/java \
-XX:+UnlockDiagnosticVMOptions \
-XX:+ReplayCompiles \
-XX:+ReplayIgnoreInitErrors \
-XX:ReplayDataFile=replay_pid78630.log \
-cp $CLASSPATH

...
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000103b3f5a4, pid=10087, tid=24323
#
# JRE version: OpenJDK Runtime Environment (21.0.1+12) (build 21.0.1+12-29)
# Java VM: OpenJDK 64-Bit Server VM (21.0.1+12-29, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-aarch64)
# Problematic frame:
# V [libjvm.dylib+0x74f5a4] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x30c
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/crash/elasticsearch/hs_err_pid10087.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
#
Abort trap: 6

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Dec 6, 2023

Failing running RangeAggregatorTests.testRuntimeFieldRangesNotOptimized:

 see this in the hs err log - which matches with your analysis
  0x00007efe2caacca0 JavaThread "SUITE-RangeAggregatorTests-seed#[74BBD5155B19065C]"        [_thread_blocked, id=115596, stack(0x00007efdad3e4000,0x00007efdad4e5000) (1028K)]
  0x00007efc78015ea0 JavaThread "TEST-RangeAggregatorTests.testRuntimeFieldRangesNotOptimized-seed#[74BBD5155B19065C]"        [_thread_in_Java, id=115598, stack(0x00007efe0029d000,0x00007efe0039e000) (1028K)]
  0x00007efca01d62a0 JavaThread "elasticsearch[org.elasticsearch.search.aggregations.AggregatorTestCase][[timer]]" daemon [_thread_blocked, id=115655, stack(0x00007efdaebfc000,0x00007efdaecfd000) (1028K)]

Appears to be compiling:
org.apache.lucene.util.RadixSelector::computeCommonPrefixLengthAndBuildHistogram

@ChrisHegarty
Copy link
Contributor

Reproducer:

minor patch to the elasticsearch repo:

$ git diff
diff --git a/server/src/test/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregatorTests.java b/server/src/test/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregatorTests.java
index 148be0aec58..86e3638abfa 100644
--- a/server/src/test/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregatorTests.java
+++ b/server/src/test/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregatorTests.java
@@ -632,6 +632,12 @@ public class RangeAggregatorTests extends AggregatorTestCase {
         }, new NumberFieldMapper.NumberFieldType(NUMBER_FIELD_NAME, NumberFieldMapper.NumberType.INTEGER));
     }
 
+    public void testRuntimeFieldRangesNotOptimizedTIMES() throws IOException {
+        for (int i=0; i <20_000; i++) {
+            testRuntimeFieldRangesNotOptimized();
+        }
+    }
+
     /**
      * If the field we're getting the range of is a runtime field it'd be super
      * slow to run a bunch of range queries on it so we disable the optimization.
$ export RUNTIME_JAVA_HOME=/Users/chegar/binaries/jdk-21.0.1.jdk/Contents/Home/; 
$ ./gradlew :server:test --tests "org.elasticsearch.search.aggregations.bucket.range.RangeAggregatorTests.testRuntimeFieldRangesNotOptimizedTIMES"

> Configure project :x-pack:plugin:searchable-snapshots:qa:hdfs
hdfsFixture unsupported, please set HADOOP_HOME and put HADOOP_HOME\bin in PATH
=======================================
Elasticsearch Build Hamster says Hello!
  Gradle Version        : 8.5
  OS Info               : Mac OS X 13.6 (aarch64)
  Runtime JDK Version   : 21.0.1+12-29 (Oracle, 21.0.1+12-29)
  Runtime java.home     : /Users/chegar/binaries/jdk-21.0.1.jdk/Contents/Home
  Gradle JDK Version    : 19.0.2+7-FR (Amazon Corretto)
  Gradle java.home      : /Users/chegar/Library/Java/JavaVirtualMachines/corretto-19.0.2/Contents/Home
  Random Testing Seed   : 479A323FF6FA6758
  In FIPS 140 mode      : false
=======================================

> Task :server:test
Dec 06, 2023 12:17:24 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001047eb5a4, pid=18426, tid=24579
#
# JRE version: OpenJDK Runtime Environment (21.0.1+12) (build 21.0.1+12-29)
# Java VM: OpenJDK 64-Bit Server VM (21.0.1+12-29, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-aarch64)
# Problematic frame:
# V  [libjvm.dylib+0x74f5a4]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x30c
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/crash/elasticsearch/server/build/testrun/test/hs_err_pid18426.log
#
# Compiler replay data is saved as:
# /private/tmp/crash/elasticsearch/server/build/testrun/test/replay_pid18426.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

> Task :server:test FAILED

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Dec 6, 2023

JDK 19.0.2 - fails (crashes) [Corretto-19.0.2.7.1]
JDK 20.0.2 - fails (crashes)
JDK 21.0.1 - fails (crashes)
JDK 21.0.2 - success (*)
JDK 22 - success (**)

Planned release of JDK 21.0.2 is 2024-01-16.

  • with a local build of the latest jdk21u, which is accumulating changes for 21.0.2
  • with a local build of the latest jdk/master, which is accumulating changes for 22

@ChrisHegarty
Copy link
Contributor

Excluding the following methods from compilation avoids the issue (since the issue is as a direct result of compiling these methods:

-XX:CompileCommand=exclude,org.apache.lucene.util.MSBRadixSorter::computeCommonPrefixLengthAndBuildHistogram \
-XX:CompileCommand=exclude,org.apache.lucene.util.RadixSelector::computeCommonPrefixLengthAndBuildHistogram \

@ChrisHegarty
Copy link
Contributor

This is not an entirely new issue - Lucene ran into similar(ish) before, see https://bugs.openjdk.org/browse/JDK-8285835, which was resolved, but seems fairly similar.

The code patterns in the above two Lucene methods tickle this area again. Though, I've not been able to reproduce with just Lucene, but it seems reasonable to assume that Lucene is susceptible.

@mark-vieira
Copy link
Contributor Author

What's our short term plan here before JDK 21.0.2 is released?

@jpountz
Copy link
Contributor

jpountz commented Dec 6, 2023

We disabled some optimizations in the past for a similar problem. #32138 I wonder if excluding these methods from compilation is a good short-term option, the performance impact looks like it could be higher than on this other issue.

@ChrisHegarty
Copy link
Contributor

We disabled some optimizations in the past for a similar problem. #32138 I wonder if excluding these methods from compilation is a good short-term option, the performance impact looks like it could be higher than on this other issue.

Agreed. I think excluding the two methods from compilation, as above, should relieve the problem. I can do this tomorrow, if someone doesn’t get to it first.

I’d like to see if we can rework the code pattern in Lucene to avoid the crash. The potential to crash will still be there, as it always has been. But we’d need a lucene 9.9.1 for that.

@ChrisHegarty
Copy link
Contributor

I've opened a PR to exclude the offending two methods from JIT compilation. Local testing and the CI all pass successfully. I would like to merge this PR as a interim solution in order to restore stability to Elasticsearch.

Separately, and orthogonal, I've written a small test that reproduces the SEGV with just Lucene code. Work will continue in Lucene in an effort to avoid the crash. The outcome of this is still unclear, but if successful then a Lucene 9.9.1 release may be warranted. More in this Lucene issue apache/lucene#12887

@ChrisHegarty
Copy link
Contributor

A Lucene 9.9.1 is currently in the works. It includes a fix that avoids the JVM crash. Once Elasticsearch is upgraded to use 9.9.1, then the compiler excludes can be removed.

@mark-vieira
Copy link
Contributor Author

I think we can close this issue since it has been resolved for now by #103112. We can track the upgrade to Lucene 9.9.1 separately.

@ChrisHegarty
Copy link
Contributor

I think we can close this issue since it has been resolved for now by #103112. We can track the upgrade to Lucene 9.9.1 separately.

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

4 participants