Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LongGCDisruptionTests.testNotBlockingUnsafeStackTraces failed #50047

Closed
benwtrent opened this issue Dec 10, 2019 · 11 comments · Fixed by #50097 or #50731
Closed

LongGCDisruptionTests.testNotBlockingUnsafeStackTraces failed #50047

benwtrent opened this issue Dec 10, 2019 · 11 comments · Fixed by #50097 or #50731
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI

Comments

@benwtrent
Copy link
Member

Over the last month + there have been periodic failures of org.elasticsearch.test.disruption.LongGCDisruptionTests.testNotBlockingUnsafeStackTraces

A recent failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=adoptopenjdk11,nodes=general-purpose/379/console

Reproduce Line:

./gradlew ':test:framework:test' --tests "org.elasticsearch.test.disruption.LongGCDisruptionTests.testNotBlockingUnsafeStackTraces" \
  -Dtests.seed=60FAC2799A7B2796 \
  -Dtests.security.manager=true \
  -Dtests.locale=hu-HU \
  -Dtests.timezone=US/Central \
  -Dcompiler.java=12

The internal stack trace:

11:07:35   2> java.lang.RuntimeException: suspending node threads took too long
11:07:35         at __randomizedtesting.SeedInfo.seed([60FAC2799A7B2796:62E98AC6BE559FFF]:0)
11:07:35         at org.elasticsearch.test.disruption.LongGCDisruption.startDisrupting(LongGCDisruption.java:124)
11:07:35         at org.elasticsearch.test.disruption.LongGCDisruptionTests.testNotBlockingUnsafeStackTraces(LongGCDisruptionTests.java:149)

The threadstacks:

11:07:35   1> [2019-12-10T10:07:32,452][WARN ][o.e.t.d.LongGCDisruptionTests$2] [testNotBlockingUnsafeStackTraces] failed to suspend node [test_node]'s threads within [30000] millis. Suspending thread stack trace:
11:07:35   1>  java.base@11.0.5/java.lang.Thread.suspend0(Native Method)
11:07:35   1> java.base@11.0.5/java.lang.Thread.suspend(Thread.java:1085)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruption.suspendThreads(LongGCDisruption.java:254)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruption$1.doRun(LongGCDisruption.java:87)
11:07:35   1> app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
11:07:35   1> java.base@11.0.5/java.lang.Thread.run(Thread.java:834)
11:07:35   1> Threads that weren't suspended:
11:07:35   1>  [test_node][0]
11:07:35   1> ----
11:07:35   1> java.base@11.0.5/java.lang.Thread.yield(Native Method)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$2(LongGCDisruptionTests.java:131)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$260/0x000000010033e840.run(Unknown Source)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$3(LongGCDisruptionTests.java:141)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$261/0x000000010033ec40.run(Unknown Source)
11:07:35   1> java.base@11.0.5/java.lang.Thread.run(Thread.java:834)
11:07:35   1> [test_node][4]
11:07:35   1> ----
11:07:35   1> java.base@11.0.5/jdk.internal.misc.Unsafe.park(Native Method)
11:07:35   1> java.base@11.0.5/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
11:07:35   1> java.base@11.0.5/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
11:07:35   1> java.base@11.0.5/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
11:07:35   1> java.base@11.0.5/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
11:07:35   1> java.base@11.0.5/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$LockedExecutor.executeLocked(LongGCDisruptionTests.java:45)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$3(LongGCDisruptionTests.java:139)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$261/0x000000010033ec40.run(Unknown Source)
11:07:35   1> java.base@11.0.5/java.lang.Thread.run(Thread.java:834)
11:07:35   1> [test_node][3]
11:07:35   1> ----
11:07:35   1> java.base@11.0.5/java.lang.Thread.yield(Native Method)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$2(LongGCDisruptionTests.java:131)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$260/0x000000010033e840.run(Unknown Source)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$3(LongGCDisruptionTests.java:141)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$261/0x000000010033ec40.run(Unknown Source)
11:07:35   1> java.base@11.0.5/java.lang.Thread.run(Thread.java:834)
11:07:35   1> [test_node][1]
11:07:35   1> ----
11:07:35   1> java.base@11.0.5/java.lang.Thread.yield(Native Method)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$2(LongGCDisruptionTests.java:131)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$260/0x000000010033e840.run(Unknown Source)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$3(LongGCDisruptionTests.java:141)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$261/0x000000010033ec40.run(Unknown Source)
11:07:35   1> java.base@11.0.5/java.lang.Thread.run(Thread.java:834)
11:07:35   1> [test_node][2]
11:07:35   1> ----
11:07:35   1> java.base@11.0.5/java.lang.Thread.yield(Native Method)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$2(LongGCDisruptionTests.java:131)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$260/0x000000010033e840.run(Unknown Source)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests.lambda$testNotBlockingUnsafeStackTraces$3(LongGCDisruptionTests.java:141)
11:07:35   1> app//org.elasticsearch.test.disruption.LongGCDisruptionTests$$Lambda$261/0x000000010033ec40.run(Unknown Source)
11:07:35   1> java.base@11.0.5/java.lang.Thread.run(Thread.java:834)
@benwtrent benwtrent added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Dec 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@original-brownbear original-brownbear self-assigned this Dec 10, 2019
@original-brownbear
Copy link
Member

I can reproduce this locally, running this test in a loop eventually gets the JVM into a state where it completely locks up (as in not even a thread-dump can be taken) and then the test eventually fails.

@original-brownbear
Copy link
Member

The problem here is that we are randomly running into safe points that take multiple seconds when calling suspend on the thread:

[136.264s][debug  ][vmoperation             ] begin VM_Operation (0x00007f4aeaa576c0): ThreadDump, mode: safepoint, requested by thread 0x00007f4abc0e3800
[136.264s][debug  ][vmoperation             ] end VM_Operation (0x00007f4aeaa576c0): ThreadDump, mode: safepoint, requested by thread 0x00007f4abc0e3800
[136.264s][debug  ][safepoint,stats         ]  136.264: ThreadDump                    [                18                 0             0 ][             0       0       0       0       0 ]               0 
[136.264s][info   ][safepoint               ] Leaving safepoint region
[136.264s][info   ][safepoint               ] Total time for which application threads were stopped: 0.0002377 seconds, Stopping threads took: 0.0000023 seconds
[136.264s][debug  ][thread,smr              ] tid=6164: SafeThreadsListPtr::acquire_stable_list: add nested list pointer to ThreadsList=0x00007f4abc004730
[136.264s][debug  ][thread,smr              ] tid=6164: SafeThreadsListPtr::release_stable_list: delete nested list pointer to ThreadsList=0x00007f4abc004730
[136.264s][debug  ][thread,smr              ] tid=6164: SafeThreadsListPtr::acquire_stable_list: add nested list pointer to ThreadsList=0x00007f4abc004730
[142.645s][debug  ][vmthread                ] Adding VM operation: ThreadSuspend
[142.645s][debug  ][vmthread                ] Evaluating safepoint VM operation: ThreadSuspend
[142.645s][debug  ][safepoint               ] Safepoint synchronization initiated. (18 threads)
[142.645s][info   ][safepoint               ] Application time: 6.3813087 seconds
[142.645s][info   ][safepoint               ] Entering safepoint region: ThreadSuspend
[142.645s][info   ][safepoint,cleanup       ] updating inline caches, 0.0000016 secs
[142.645s][info   ][safepoint,cleanup       ] compilation policy safepoint handler, 0.0000006 secs
[142.645s][info   ][safepoint,cleanup       ] deflating global idle monitors, 0.0000004 secs
[142.645s][info   ][safepoint,cleanup       ] purging class loader data graph, 0.0000001 secs
[142.645s][info   ][safepoint,cleanup       ] resizing system dictionaries, 0.0000008 secs
[142.645s][info   ][safepoint,cleanup       ] deflating per-thread idle monitors, 0.0000044 secs
[142.646s][info   ][safepoint,cleanup       ] safepoint cleanup tasks, 0.0002165 secs
[142.646s][debug  ][vmoperation             ] begin VM_Operation (0x00007f4aeaa57710): ThreadSuspend, mode: safepoint, requested by thread 0x00007f4abc0e3800
[142.646s][debug  ][vmoperation             ] end VM_Operation (0x00007f4aeaa57710): ThreadSuspend, mode: safepoint, requested by thread 0x00007f4abc0e3800
[142.646s][debug  ][safepoint,stats         ]  142.645: ThreadSuspend                 [                18                 0             0 ][             0       0       0       0       0 ]               0 
[142.646s][info   ][safepoint               ] Leaving safepoint region

This looks a lot like the following JVM bug. https://bugs.openjdk.java.net/browse/JDK-8212933 that is supposed to be fixed in JDK 12 b13, but I can still reproduce it in b33.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 11, 2019
See discussion in elastic#50047 (comment).
There are reproducible issues with `Thread#suspend` in `Jdk11` and `Jdk12` for me locally
and we have one failure for each on CI.
`Jdk8` and `Jdk13` are stable though on CI and in my testing so I'd selectively disable
this test here to keep the coverage. We aren't using `suspend` in production code so the
JDK bug behind this does not affect us.
original-brownbear added a commit that referenced this issue Dec 12, 2019
See discussion in #50047 (comment).
There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally
and we have one failure for each on CI.
Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable
this test here to keep the coverage. We aren't using suspend in production code so the
JDK bug behind this does not affect us.

Closes #50047
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 12, 2019
See discussion in elastic#50047 (comment).
There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally
and we have one failure for each on CI.
Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable
this test here to keep the coverage. We aren't using suspend in production code so the
JDK bug behind this does not affect us.

Closes elastic#50047
original-brownbear added a commit that referenced this issue Dec 12, 2019
See discussion in #50047 (comment).
There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally
and we have one failure for each on CI.
Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable
this test here to keep the coverage. We aren't using suspend in production code so the
JDK bug behind this does not affect us.

Closes #50047
@davidkyle
Copy link
Member

Reopening...

@original-brownbear the test fail on Java 8 (runtime, Java 13 build)

java.lang.RuntimeException: suspending node threads took too long
	at __randomizedtesting.SeedInfo.seed([59768ACBFC336E0F:5B65C274D81DD666]:0)
	at org.elasticsearch.test.disruption.LongGCDisruption.startDisrupting(LongGCDisruption.java:124)
	at org.elasticsearch.test.disruption.LongGCDisruptionTests.testNotBlockingUnsafeStackTraces(LongGCDisruptionTests.java:152)

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk13,ES_RUNTIME_JAVA=java8,nodes=general-purpose/433/console
https://gradle-enterprise.elastic.co/s/mz54zhh4o5wfq

And a few more found on build stats:
https://gradle-enterprise.elastic.co/s/zpncey4ukuwgk
https://gradle-enterprise.elastic.co/s/ag4cebu2jraju

@davidkyle davidkyle reopened this Jan 6, 2020
@original-brownbear
Copy link
Member

Seems this is actually this issue https://bugs.openjdk.java.net/browse/JDK-8218446 for which the fix has been backed out again.
It's talking about the exact 6s waits I observed here as well in https://bugs.openjdk.java.net/browse/JDK-8218446?focusedCommentId=14251478&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14251478

@ywelsch
Copy link
Contributor

ywelsch commented Jan 6, 2020

The above failure was on JDK8, not JDK13?

@pgomulka
Copy link
Contributor

pgomulka commented Jan 7, 2020

failed again
https://gradle-enterprise.elastic.co/s/p2c4a73pwq3js

./gradlew ':test:framework:test' --tests "org.elasticsearch.test.disruption.LongGCDisruptionTests.testNotBlockingUnsafeStackTraces" \
  -Dtests.seed=E32A05972B9EA548 \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-KW \
  -Dtests.timezone=America/Argentina/Rio_Gallegos \
  -Dcompiler.java=13 \
  -Druntime.java=8

I wonder if the fix mentioned in comment #50047 (comment) is aimed jdk13 only, we should try to disable the test for jdk8?
@original-brownbear what do you think?

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 8, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
@original-brownbear
Copy link
Member

@ywelsch

right it's also happening on JDK-8 and 13 now ... 13 I'd expect because they backed out the fix for the linked issue but 8 is unexpected.

I opened #50731 to not fail the test if we run into blocked Thread#suspend but otherwise leave the test enabled as a suggested fix for this. If 8 fails for a different reason it should still fail and we can look into that then and we retain most of the coverage in general.

original-brownbear added a commit that referenced this issue Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes #50047
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
original-brownbear added a commit that referenced this issue Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes #50047
original-brownbear added a commit that referenced this issue Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes #50047
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
See discussion in elastic#50047 (comment).
There are reproducible issues with Thread#suspend in Jdk11 and Jdk12 for me locally
and we have one failure for each on CI.
Jdk8 and Jdk13 are stable though on CI and in my testing so I'd selectively disable
this test here to keep the coverage. We aren't using suspend in production code so the
JDK bug behind this does not affect us.

Closes elastic#50047
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
@DaveCTurner
Copy link
Contributor

I suspect that's the same thing as #93707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI
Projects
None yet
8 participants