-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LongGCDisruptionTests.testNotBlockingUnsafeStackTraces failed #50047
Comments
Pinging @elastic/es-distributed (:Distributed/Distributed) |
I can reproduce this locally, running this test in a loop eventually gets the JVM into a state where it completely locks up (as in not even a thread-dump can be taken) and then the test eventually fails. |
The problem here is that we are randomly running into safe points that take multiple seconds when calling
This looks a lot like the following JVM bug. https://bugs.openjdk.java.net/browse/JDK-8212933 that is supposed to be fixed in JDK 12 b13, but I can still reproduce it in b33. |
See discussion in elastic#50047 (comment). There are reproducible issues with `Thread#suspend` in `Jdk11` and `Jdk12` for me locally and we have one failure for each on CI. `Jdk8` and `Jdk13` are stable though on CI and in my testing so I'd selectively disable this test here to keep the coverage. We aren't using `suspend` in production code so the JDK bug behind this does not affect us.
See discussion in #50047 (comment). There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally and we have one failure for each on CI. Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable this test here to keep the coverage. We aren't using suspend in production code so the JDK bug behind this does not affect us. Closes #50047
See discussion in elastic#50047 (comment). There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally and we have one failure for each on CI. Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable this test here to keep the coverage. We aren't using suspend in production code so the JDK bug behind this does not affect us. Closes elastic#50047
See discussion in #50047 (comment). There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally and we have one failure for each on CI. Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable this test here to keep the coverage. We aren't using suspend in production code so the JDK bug behind this does not affect us. Closes #50047
Reopening... @original-brownbear the test fail on Java 8 (runtime, Java 13 build)
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk13,ES_RUNTIME_JAVA=java8,nodes=general-purpose/433/console And a few more found on build stats: |
Seems this is actually this issue https://bugs.openjdk.java.net/browse/JDK-8218446 for which the fix has been backed out again. |
The above failure was on JDK8, not JDK13? |
failed again
I wonder if the fix mentioned in comment #50047 (comment) is aimed jdk13 only, we should try to disable the test for jdk8? |
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes elastic#50047
right it's also happening on JDK-8 and 13 now ... 13 I'd expect because they backed out the fix for the linked issue but I opened #50731 to not fail the test if we run into blocked |
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes #50047
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes elastic#50047
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes elastic#50047
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes #50047
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes #50047
See discussion in elastic#50047 (comment). There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally and we have one failure for each on CI. Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable this test here to keep the coverage. We aren't using suspend in production code so the JDK bug behind this does not affect us. Closes elastic#50047
There is a JVM bug causing `Thread#suspend` calls to randomly take multiple seconds breaking these tests that call the method numerous times in a loop. Increasing the timeout would will not work since we may call `suspend` tens if not hundreds of times and even a small number of them experiencing the blocking will lead to multiple minutes of waiting. This PR detects the specific issue by timing the `Thread#suspend` calls and skips the remainder of the test if it timed out because of the JVM bug. Closes elastic#50047
I suspect that's the same thing as #93707 |
Happened also a month ago on the 2nd February: https://gradle-enterprise.elastic.co/s/azrns37543ppo/tests/:test:framework:test/org.elasticsearch.test.disruption.LongGCDisruptionTests/testNotBlockingUnsafeStackTraces?top-execution=1 |
Over the last month + there have been periodic failures of org.elasticsearch.test.disruption.LongGCDisruptionTests.testNotBlockingUnsafeStackTraces
A recent failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=adoptopenjdk11,nodes=general-purpose/379/console
Reproduce Line:
The internal stack trace:
The threadstacks:
The text was updated successfully, but these errors were encountered: