Why do tests sometimes deadlock on 2.12 in Travis? #774

rossabaker · 2016-12-01T00:19:54Z

We are frequently seeing tests run for about 49 minutes and get killed on Travis.

We have already been slain by SI-10064 in tut. But this only happens in Travis, and only intermittently. I have not managed to reproduce this locally.

bfritz · 2017-01-19T05:32:40Z

@rossabaker Looks like a good place for me to get my feet wet. Is this a recent example of the failure?

rossabaker · 2017-01-19T16:53:28Z

Yes. I haven't been able to make this happen locally, which is frustrating -- it's a lot of iterating through Travis and hoping to get unlucky.

One way to do it might be to look at the list of specs emitted by the failing job, and compare to a successful job. Specs we don't see are probably the ones that hang. And then run those in a loop on a branch in Travis so we get unlucky every time, and we might get the red-to-green confidence in an eventual fix.

bfritz · 2017-01-20T03:32:04Z

Currently trying to reproduce without hijacking the project's real Travis CI account. Running in Docker using the (dated) travis-jvm image on a 4GB Ubuntu 16.04 DigitalOcean droplet.

In a 6 or 7 runs so far, I've noticed one short (~2 min) hang and captured a thread dump. Very possible it's unrelated, but to help track my progress I'm attaching console output and a thread dump.

bfritz · 2017-01-20T05:31:42Z

The http4s tests can definitely hang for long periods during TomcatServerSpec if the build container is low on entropy. Pretty sure that's what happened during the hang above:

"localhost-startStop-1" #1472 daemon prio=5 os_prio=0 tid=0x00007ff865263000 nid=0x764 runnable [0x00007ff81dd52000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:246)
        at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedBytes(SeedGenerator.java:539)
        at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:144)
        at sun.security.provider.SecureRandom$SeederHolder.<clinit>(SecureRandom.java:203)
        at sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:221)
        - locked <0x00000000f910dc48> (a sun.security.provider.SecureRandom)
        at java.security.SecureRandom.nextBytes(SecureRandom.java:457)
        - locked <0x00000000f910dc80> (a java.security.SecureRandom)
        at java.security.SecureRandom.next(SecureRandom.java:480)
        at java.util.Random.nextInt(Random.java:329)
        at org.apache.catalina.util.SessionIdGeneratorBase.createSecureRandom(SessionIdGeneratorBase.java:237)
        at org.apache.catalina.util.SessionIdGeneratorBase.getRandomBytes(SessionIdGeneratorBase.java:174)
        at org.apache.catalina.util.StandardSessionIdGenerator.generateSessionId(StandardSessionIdGenerator.java:34)
        at org.apache.catalina.util.SessionIdGeneratorBase.generateSessionId(SessionIdGeneratorBase.java:167)
        at org.apache.catalina.util.SessionIdGeneratorBase.startInternal(SessionIdGeneratorBase.java:256)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8d30790> (a org.apache.catalina.util.StandardSessionIdGenerator)
        at org.apache.catalina.session.ManagerBase.startInternal(ManagerBase.java:719)
[..]

"pool-103-thread-1" #1457 daemon prio=5 os_prio=0 tid=0x00007ff87c4ca000 nid=0x756 waiting on condition [0x00007ff823960000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f8bc3168> (a java.util.concurrent.FutureTask)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:943)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.core.StandardEngine.startInternal(StandardEngine.java:262)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.core.StandardService.startInternal(StandardService.java:441)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8c0bb90> (a org.apache.catalina.core.StandardService)
        at org.apache.catalina.core.StandardServer.startInternal(StandardServer.java:789)
        - locked <0x00000000f8c0c338> (a java.lang.Object)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8c0bd60> (a org.apache.catalina.core.StandardServer)
        at org.apache.catalina.startup.Tomcat.start(Tomcat.java:346)
        at org.http4s.server.tomcat.TomcatBuilder.$anonfun$start$1(TomcatBuilder.scala:150)
        at org.http4s.server.tomcat.TomcatBuilder$$Lambda$19006/1810047281.apply(Unknown Source)

(There are also some interesting specs2.fixed.env-* threads in the dump.)

Low entropy could definitely hang the tests...and it has happened to others on Travis. Seems very unlikely it would be limited to Scala 2.12 (or 2.12 with scalaz-7.2) though. Might be separate from the hangs that raised this issue.

rossabaker · 2017-01-20T15:12:17Z

Lots of interesting observations here.

We've been plagued by client timeouts in Travis the past. These aren't "job gets killed after 10 minutes", these are "a client spec fails and everything else passed." Those could be explained by hangs of a couple minutes. This run only shows the TomcatServerSpec being held up, so I doubt that's connected. Maybe we use SecureRandom elsewhere? (Probably not, or it would have hung elsewhere.)
The last comment looks interesting to fix the Tomcat hang.
I agree that there is nothing that points to Scala 2.12 or Scalaz-7.2 here.
I'm trying to recall if we've ever seen this on Scala 2.12 and Scalaz-7.1. I assumed this was one of those issues with Scala 2.12 and lambda lifting, but I can only remember the Scala-2.12+Scalaz-7.2 combo failing this way. If that's true, that's a clue, but I don't know what it means.

Starting Tomcat in `TomcatServerSpec` seeds a `SecureRandom` instance. By default `/dev/random` is used on Linux and, in low entropy environments like Travis VMs, reading from `/dev/random` may block for long periods of time. Reconfigure the Travis build script to point the JVM at `/dev/urandom` to speed up the CI builds and hopefully prevent cases of the failed builds described in http4s#774. References: http4s#774 (comment) travis-ci/travis-ci#1494 (comment) https://en.wikipedia.org/wiki//dev/random

bfritz · 2017-01-20T17:25:52Z

I agree that there is nothing that points to Scala 2.12 or Scalaz-7.2 here.

Possible explanation...but I'm reaching a bit...is that $SBT_COMMAND differs based on Scala version:

http4s/bin/travis

Line 5 in c6df29a

if [[ $TRAVIS_SCALA_VERSION == 2.11* ]] && [[ $SCALAZ_VERSION = 7.2* ]]; then

(And differs differently after the very recent docs change in 2784c3d.)

Was hard to say with confidence by just watching /proc/sys/kernel/random/entropy_avail manually in my test environment, but it's possible the coverage task was adding some entropy to the process.

Published PR #882 for consideration. In hindsight, I should have forked from the v0.15.3 tag, before the docs were moved to 2.12. Hard to compare build times before and after #882 because of the docs change.

rossabaker · 2017-01-20T17:41:14Z

I can cherry-pick it back to 0.15.x.

Starting Tomcat in `TomcatServerSpec` seeds a `SecureRandom` instance. By default `/dev/random` is used on Linux and, in low entropy environments like Travis VMs, reading from `/dev/random` may block for long periods of time. Reconfigure the Travis build script to point the JVM at `/dev/urandom` to speed up the CI builds and hopefully prevent cases of the failed builds described in #774. References: #774 (comment) travis-ci/travis-ci#1494 (comment) https://en.wikipedia.org/wiki//dev/random

bfritz · 2017-01-20T20:02:00Z

Unfortunately the /dev/urandom change did not help. Change is active but build still hung for 10 minutes. Back to the drawing board.

rossabaker · 2017-01-20T20:16:22Z

Well, it likely does help the previously undetected two-minute hang in Tomcat. :)

bfritz · 2017-01-20T21:02:15Z

No promises, but I'm toying with the idea of writing a SBT plugin or JVM agent that watches stdout for absence of input and triggers a thread dump. Might even be possible to hack something together with a shell script and jstack.

rossabaker · 2017-01-20T21:16:10Z

https://etorreborre.github.io/specs2/guide/SPECS2-3.5/org.specs2.guide.TimeoutExamples.html might give the missing hook, but you'd have to intercept failures caused by however that fails. I did similar on Scalatest once, and it got me pointed in the right direction.

The other thing I did to narrow down the Travis-only deadlock at work was to compare the list of specs names in a successful run to the ones that didn't appear on the hung job. It pointed me right to the offending spec, though not the cause.

bfritz · 2017-01-26T19:59:21Z

Using the specs2 hooks makes sense. Downside is it means I'm diving in at the deep end of specs2. Haven't drowned yet, but might be several more days before I have anything PR worthy.

rossabaker · 2017-01-26T21:44:18Z

53a9c16 sort of works, but:

it results in skipped tests instead of failed tests on timeout
now it runs out of memory on my machine

bfritz · 2017-01-26T22:25:02Z

Cool. I'll take a look tonight. This is my proof-of-concept (outside of http4s) to grab thread dump:
https://gist.github.com/bfritz/544a56352f5fcdaf7522c09a24a791cc

Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print thread dump to console 500ms before the spec times out. Hoping to better understand: http4s#858 (comment) and also apply `ThreadDumpOnTimeout` to understand http4s#774.

bfritz · 2017-01-27T21:21:00Z

@rossabaker Pretty far into the weeds, but 53a9c16 will cause every assertion in the test suite to take 6+ seconds (60 second timeout / 10).

TimeoutExamples takes the global timeout argument (60 seconds), divides by 10, and then calls Thread.sleep(6000) for every assertion. If the assertion is done when sleep() returns it moves on If not, Thread.sleep(6000) is called up to 9 more times.

A snippet with showtimes turned on:

[info] AuthenticationSpec
[info] 
[info] Failure to authenticate should
[info]   + not run unauthorized routes (6 seconds, 73 ms)
[info] BasicAuthentication should
[info]   + Respond to a request without authentication with 401 (6 seconds, 72 ms)
[info]   + Respond to a request with unknown username with 401 (6 seconds, 71 ms)
[info]   + Respond to a request with wrong password with 401 (6 seconds, 71 ms)
[info]   + Respond to a request with correct credentials (6 seconds, 71 ms)
[info] DigestAuthentication should
[info]   + Respond to a request without authentication with 401 (6 seconds, 71 ms)
[info]   + Respond to a request with correct credentials (6 seconds, 71 ms)
[info]   + Respond to many concurrent requests while cleaning up nonces (6 seconds, 70 ms)
[info]   + Avoid many concurrent replay attacks (6 seconds, 691 ms)
[info]   + Respond to invalid requests with 401 (6 seconds, 691 ms)
[info] 
[info] 
[info] Total for specification AuthenticationSpec
[info] Finished in 30 seconds, 200 ms
[info] 10 examples, 0 failure, 0 error

Strongly suspect the bump in memory usage is tied to the extra blocking (and parallelism).

Still experimenting, but I hope to have a PR this weekend that is more targetted to the specs that are making Travis unhappy.

In several of the recent Travis CI build build failures: * [3444.6](https://travis-ci.org/http4s/http4s/jobs/193191028) * [3458.6](https://travis-ci.org/http4s/http4s/jobs/193838431) * [3534.6](https://travis-ci.org/http4s/http4s/jobs/195974285) the test that did not complete was `CharsetRangeSpec`. Mix in `ThreadDumpOnTimeout` and set a 2 second timeout to help identify the source of the deadlock causing http4s#774.

Test output will now include all specs2 expecations and how long each took to complete.

…ming turn on `showtimes` for specs2 to help diagnose #774

Wait for up to 20 x 200ms intervals (4 seconds) before triggering a thread dump. Spec frequently runs for 1-2 seconds on Travis CI infrastructure and the 2 second timeout was firing for tests that were likely not deadlocked. Would not have been a big deal, but the thread dumps put Travis over its 4MB log limit and cause the job to fail. Longer term, if we keep the dumps, they should be written to a log file and uploaded to S3 instead of polluting the console output. Refs http4s#774

bfritz · 2017-03-01T18:45:38Z

Have not seen a deadlock for a long time, possibly not since ThreadDumpOnTimeout was added in #897 ...possibly even earlier.

I opened a more specific #978 for the AsyncHttpClientSpec timeout error we are still seeing. Going to close this one and we can re-open if we see more deadlocks.

bfritz · 2017-03-02T14:39:57Z

After further consideration, reopening. The builds Travis kills after > 4MB of output, due to thread dumps on console, are likely the the new version of the old deadlocks.

There is also a pattern emerging when they happen. Juicy bits from the thread dumps at:
https://gist.github.com/bfritz/73d85d6189e1216f791a003a64a90063

rossabaker · 2017-03-03T02:00:40Z

https://travis-ci.org/http4s/http4s/jobs/207223261 from #979

bfritz · 2017-03-03T13:27:45Z

sbt test deadlocked on me while running b54a8c3 (topic/rig). Thread dump shows deadlock in CharsetRangeSpec related to lazy initialization of scalacheck bits:

Found one Java-level deadlock:
=============================
"specs2.fixed.env-595028191-8":
  waiting to lock monitor 0x00007ff940034518 (object 0x00000000af499340, a org.http4s.CharsetRangeSpec),
  which is held by "specs2.fixed.env-595028191-3"
"specs2.fixed.env-595028191-3":
  waiting to lock monitor 0x00007ff98c26df88 (object 0x00000000af69e088, a org.scalacheck.ArbitraryLowPriority$$anon$1),
  which is held by "specs2.fixed.env-595028191-8"

Full jstack output at:
https://gist.github.com/bfritz/3f466f6f75427437e5f0a4f0fd3a906c#file-thread_dump_with_deadlock-txt-L5125

Best I can tell, that dump is consistent with this deadlock. Unfortunately, I have no idea how to fix it.

bfritz · 2017-03-03T15:32:45Z

Just had a deadlock in the release-0.15.x branch:
https://travis-ci.org/http4s/http4s/jobs/207383661#L1585

CharsetRangeSpec notably absent from list of completed specs.

rossabaker · 2017-03-03T15:42:18Z

CharsetRangeSpec??!?

bfritz · 2017-03-03T15:58:43Z

CharsetRangeSpec??!?

Yep, same as in the jstack output above. Looks like a deadlock involving lazy initialization and scalacheck.

For a while, I thought it might be typelevel/scalacheck#290 , but it's different and we already have the release of scalacheck with that change in it.

In response to this build: https://travis-ci.org/http4s/http4s/jobs/207818583#L3278 on the cats branch that hung despite CharsetRangeSpec completing. Hope this will identify source of deadlock if http4s#774 shows itself again.

Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print thread dump to console 500ms before the spec times out. Hoping to better understand: https://github.com/http4s/http4s/issues/858http4s/http4s#issuecomment-274133582 and also apply `ThreadDumpOnTimeout` to understand http4s/http4s#774.

* `genCharsetRangeNoQuality` -> `arbitraryCharset.arbitrary` * `arbitraryCharset.arbitrary` -> `arbitraryNioCharset` Each needs the lock of the other. This is a "non-circular dependency" as described in SIP-20. A better solution based on defs or vals will be designed for 0.16. This is a binary-compatible mitigation for 0.15. Fixes http4s/http4s#774

Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print thread dump to console 500ms before the spec times out. Hoping to better understand: https://github.com/http4s/http4s/issues/858http4s/http4s#issuecomment-274133582 and also apply `ThreadDumpOnTimeout` to understand http4s/http4s#774.

rossabaker added bug Determined to be a bug in http4s testing Issues related to tests labels Dec 1, 2016

rossabaker mentioned this issue Dec 5, 2016

#779 Arbitrary[Uri] and satellites #780

Closed

rossabaker mentioned this issue Jan 17, 2017

Modified bin/travis to make site using scala 2.12 #855

Merged

bfritz mentioned this issue Jan 20, 2017

Use /dev/urandom in possibly low-entropy Travis CI environments #882

Closed

This was referenced Jan 23, 2017

Deprecate EntityEncoder for ByteBuffer and CharBuffer #884

Merged

List 0.16 snapshots, including cats #893

Merged

rossabaker mentioned this issue Jan 27, 2017

Fix the tuts broken by #810, reform syntax #895

Merged

bfritz mentioned this issue Jan 28, 2017

log thread dump when specs with possible deadlocks exceed timeouts #897

Merged

bfritz added a commit to bfritz/http4s that referenced this issue Jan 28, 2017

turn on showtimes for specs2 to help diagnose http4s#774

4828bcc

Test output will now include all specs2 expecations and how long each took to complete.

rossabaker added a commit that referenced this issue Jan 28, 2017

Merge pull request #898 from bfritz/topic/specs2-report-expecation-ti…

1f4def4

…ming turn on `showtimes` for specs2 to help diagnose #774

This was referenced Feb 6, 2017

Increase CharsetRangeSpec timeout to 4 seconds before thread dump #910

Merged

Minor: add explicit return type to Retry.apply #908

Merged

rossabaker mentioned this issue Feb 21, 2017

Filter duplicate entries in scaladoc jar #940

Merged

bfritz mentioned this issue Mar 1, 2017

sporadic async-http-client spec timeouts fail build #978

Closed

bfritz closed this as completed Mar 1, 2017

rossabaker mentioned this issue Mar 2, 2017

Log unhandled MessageFailure before backends render them #982

Merged

bfritz reopened this Mar 2, 2017

rossabaker mentioned this issue Mar 4, 2017

Fix deadlock of CharsetRangeSpec #994

Merged

rossabaker closed this as completed in 3c92206 Mar 5, 2017

bfritz mentioned this issue Mar 5, 2017

Call jstack on sbt if sbt runs longer than 13 minutes #997

Merged

rossabaker mentioned this issue Mar 27, 2017

Unexplained build failure: client timeouts #1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do tests sometimes deadlock on 2.12 in Travis? #774

Why do tests sometimes deadlock on 2.12 in Travis? #774

rossabaker commented Dec 1, 2016

bfritz commented Jan 19, 2017

rossabaker commented Jan 19, 2017

bfritz commented Jan 20, 2017 •

edited

Loading

bfritz commented Jan 20, 2017

rossabaker commented Jan 20, 2017

bfritz commented Jan 20, 2017 •

edited

Loading

rossabaker commented Jan 20, 2017

bfritz commented Jan 20, 2017

rossabaker commented Jan 20, 2017

bfritz commented Jan 20, 2017

rossabaker commented Jan 20, 2017

bfritz commented Jan 26, 2017

rossabaker commented Jan 26, 2017

bfritz commented Jan 26, 2017 •

edited

Loading

bfritz commented Jan 27, 2017

bfritz commented Mar 1, 2017

bfritz commented Mar 2, 2017 •

edited

Loading

rossabaker commented Mar 3, 2017

bfritz commented Mar 3, 2017

bfritz commented Mar 3, 2017 •

edited

Loading

rossabaker commented Mar 3, 2017

bfritz commented Mar 3, 2017 •

edited

Loading

Why do tests sometimes deadlock on 2.12 in Travis? #774

Why do tests sometimes deadlock on 2.12 in Travis? #774

Comments

rossabaker commented Dec 1, 2016

bfritz commented Jan 19, 2017

rossabaker commented Jan 19, 2017

bfritz commented Jan 20, 2017 • edited Loading

bfritz commented Jan 20, 2017

rossabaker commented Jan 20, 2017

bfritz commented Jan 20, 2017 • edited Loading

rossabaker commented Jan 20, 2017

bfritz commented Jan 20, 2017

rossabaker commented Jan 20, 2017

bfritz commented Jan 20, 2017

rossabaker commented Jan 20, 2017

bfritz commented Jan 26, 2017

rossabaker commented Jan 26, 2017

bfritz commented Jan 26, 2017 • edited Loading

bfritz commented Jan 27, 2017

bfritz commented Mar 1, 2017

bfritz commented Mar 2, 2017 • edited Loading

rossabaker commented Mar 3, 2017

bfritz commented Mar 3, 2017

bfritz commented Mar 3, 2017 • edited Loading

rossabaker commented Mar 3, 2017

bfritz commented Mar 3, 2017 • edited Loading

bfritz commented Jan 20, 2017 •

edited

Loading

bfritz commented Jan 20, 2017 •

edited

Loading

bfritz commented Jan 26, 2017 •

edited

Loading

bfritz commented Mar 2, 2017 •

edited

Loading

bfritz commented Mar 3, 2017 •

edited

Loading

bfritz commented Mar 3, 2017 •

edited

Loading