Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do tests sometimes deadlock on 2.12 in Travis? #774

Closed
rossabaker opened this issue Dec 1, 2016 · 22 comments
Closed

Why do tests sometimes deadlock on 2.12 in Travis? #774

rossabaker opened this issue Dec 1, 2016 · 22 comments
Labels
bug Determined to be a bug in http4s testing Issues related to tests

Comments

@rossabaker
Copy link
Member

We are frequently seeing tests run for about 49 minutes and get killed on Travis.

We have already been slain by SI-10064 in tut. But this only happens in Travis, and only intermittently. I have not managed to reproduce this locally.

@rossabaker rossabaker added bug Determined to be a bug in http4s testing Issues related to tests labels Dec 1, 2016
@bfritz
Copy link
Member

bfritz commented Jan 19, 2017

@rossabaker Looks like a good place for me to get my feet wet. Is this a recent example of the failure?

@rossabaker
Copy link
Member Author

Yes. I haven't been able to make this happen locally, which is frustrating -- it's a lot of iterating through Travis and hoping to get unlucky.

One way to do it might be to look at the list of specs emitted by the failing job, and compare to a successful job. Specs we don't see are probably the ones that hang. And then run those in a loop on a branch in Travis so we get unlucky every time, and we might get the red-to-green confidence in an eventual fix.

@bfritz
Copy link
Member

bfritz commented Jan 20, 2017

Currently trying to reproduce without hijacking the project's real Travis CI account. Running in Docker using the (dated) travis-jvm image on a 4GB Ubuntu 16.04 DigitalOcean droplet.

In a 6 or 7 runs so far, I've noticed one short (~2 min) hang and captured a thread dump. Very possible it's unrelated, but to help track my progress I'm attaching console output and a thread dump.

@bfritz
Copy link
Member

bfritz commented Jan 20, 2017

The http4s tests can definitely hang for long periods during TomcatServerSpec if the build container is low on entropy. Pretty sure that's what happened during the hang above:

"localhost-startStop-1" #1472 daemon prio=5 os_prio=0 tid=0x00007ff865263000 nid=0x764 runnable [0x00007ff81dd52000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:246)
        at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedBytes(SeedGenerator.java:539)
        at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:144)
        at sun.security.provider.SecureRandom$SeederHolder.<clinit>(SecureRandom.java:203)
        at sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:221)
        - locked <0x00000000f910dc48> (a sun.security.provider.SecureRandom)
        at java.security.SecureRandom.nextBytes(SecureRandom.java:457)
        - locked <0x00000000f910dc80> (a java.security.SecureRandom)
        at java.security.SecureRandom.next(SecureRandom.java:480)
        at java.util.Random.nextInt(Random.java:329)
        at org.apache.catalina.util.SessionIdGeneratorBase.createSecureRandom(SessionIdGeneratorBase.java:237)
        at org.apache.catalina.util.SessionIdGeneratorBase.getRandomBytes(SessionIdGeneratorBase.java:174)
        at org.apache.catalina.util.StandardSessionIdGenerator.generateSessionId(StandardSessionIdGenerator.java:34)
        at org.apache.catalina.util.SessionIdGeneratorBase.generateSessionId(SessionIdGeneratorBase.java:167)
        at org.apache.catalina.util.SessionIdGeneratorBase.startInternal(SessionIdGeneratorBase.java:256)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8d30790> (a org.apache.catalina.util.StandardSessionIdGenerator)
        at org.apache.catalina.session.ManagerBase.startInternal(ManagerBase.java:719)
[..]

"pool-103-thread-1" #1457 daemon prio=5 os_prio=0 tid=0x00007ff87c4ca000 nid=0x756 waiting on condition [0x00007ff823960000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f8bc3168> (a java.util.concurrent.FutureTask)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:943)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.core.StandardEngine.startInternal(StandardEngine.java:262)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.core.StandardService.startInternal(StandardService.java:441)
        - locked <0x00000000f8c0e738> (a org.apache.catalina.core.StandardEngine)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8c0bb90> (a org.apache.catalina.core.StandardService)
        at org.apache.catalina.core.StandardServer.startInternal(StandardServer.java:789)
        - locked <0x00000000f8c0c338> (a java.lang.Object)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        - locked <0x00000000f8c0bd60> (a org.apache.catalina.core.StandardServer)
        at org.apache.catalina.startup.Tomcat.start(Tomcat.java:346)
        at org.http4s.server.tomcat.TomcatBuilder.$anonfun$start$1(TomcatBuilder.scala:150)
        at org.http4s.server.tomcat.TomcatBuilder$$Lambda$19006/1810047281.apply(Unknown Source)

(There are also some interesting specs2.fixed.env-* threads in the dump.)

Low entropy could definitely hang the tests...and it has happened to others on Travis. Seems very unlikely it would be limited to Scala 2.12 (or 2.12 with scalaz-7.2) though. Might be separate from the hangs that raised this issue.

@rossabaker
Copy link
Member Author

Lots of interesting observations here.

  • We've been plagued by client timeouts in Travis the past. These aren't "job gets killed after 10 minutes", these are "a client spec fails and everything else passed." Those could be explained by hangs of a couple minutes. This run only shows the TomcatServerSpec being held up, so I doubt that's connected. Maybe we use SecureRandom elsewhere? (Probably not, or it would have hung elsewhere.)
  • The last comment looks interesting to fix the Tomcat hang.
  • I agree that there is nothing that points to Scala 2.12 or Scalaz-7.2 here.
  • I'm trying to recall if we've ever seen this on Scala 2.12 and Scalaz-7.1. I assumed this was one of those issues with Scala 2.12 and lambda lifting, but I can only remember the Scala-2.12+Scalaz-7.2 combo failing this way. If that's true, that's a clue, but I don't know what it means.

bfritz added a commit to bfritz/http4s that referenced this issue Jan 20, 2017
Starting Tomcat in `TomcatServerSpec` seeds a `SecureRandom`
instance.  By default `/dev/random` is used on Linux and, in
low entropy environments like Travis VMs, reading from
`/dev/random` may block for long periods of time.

Reconfigure the Travis build script to point the JVM at
`/dev/urandom` to speed up the CI builds and hopefully prevent
cases of the failed builds described in http4s#774.

References:
  http4s#774 (comment)
  travis-ci/travis-ci#1494 (comment)
  https://en.wikipedia.org/wiki//dev/random
@bfritz
Copy link
Member

bfritz commented Jan 20, 2017

I agree that there is nothing that points to Scala 2.12 or Scalaz-7.2 here.

Possible explanation...but I'm reaching a bit...is that $SBT_COMMAND differs based on Scala version:

if [[ $TRAVIS_SCALA_VERSION == 2.11* ]] && [[ $SCALAZ_VERSION = 7.2* ]]; then

(And differs differently after the very recent docs change in 2784c3d.)

Was hard to say with confidence by just watching /proc/sys/kernel/random/entropy_avail manually in my test environment, but it's possible the coverage task was adding some entropy to the process.

Published PR #882 for consideration. In hindsight, I should have forked from the v0.15.3 tag, before the docs were moved to 2.12. Hard to compare build times before and after #882 because of the docs change.

@rossabaker
Copy link
Member Author

I can cherry-pick it back to 0.15.x.

rossabaker pushed a commit that referenced this issue Jan 20, 2017
Starting Tomcat in `TomcatServerSpec` seeds a `SecureRandom`
instance.  By default `/dev/random` is used on Linux and, in
low entropy environments like Travis VMs, reading from
`/dev/random` may block for long periods of time.

Reconfigure the Travis build script to point the JVM at
`/dev/urandom` to speed up the CI builds and hopefully prevent
cases of the failed builds described in #774.

References:
  #774 (comment)
  travis-ci/travis-ci#1494 (comment)
  https://en.wikipedia.org/wiki//dev/random
@bfritz
Copy link
Member

bfritz commented Jan 20, 2017

Unfortunately the /dev/urandom change did not help. Change is active but build still hung for 10 minutes. Back to the drawing board.

@rossabaker
Copy link
Member Author

Well, it likely does help the previously undetected two-minute hang in Tomcat. :)

@bfritz
Copy link
Member

bfritz commented Jan 20, 2017

No promises, but I'm toying with the idea of writing a SBT plugin or JVM agent that watches stdout for absence of input and triggers a thread dump. Might even be possible to hack something together with a shell script and jstack.

@rossabaker
Copy link
Member Author

https://etorreborre.github.io/specs2/guide/SPECS2-3.5/org.specs2.guide.TimeoutExamples.html might give the missing hook, but you'd have to intercept failures caused by however that fails. I did similar on Scalatest once, and it got me pointed in the right direction.

The other thing I did to narrow down the Travis-only deadlock at work was to compare the list of specs names in a successful run to the ones that didn't appear on the hung job. It pointed me right to the offending spec, though not the cause.

@bfritz
Copy link
Member

bfritz commented Jan 26, 2017

Using the specs2 hooks makes sense. Downside is it means I'm diving in at the deep end of specs2. Haven't drowned yet, but might be several more days before I have anything PR worthy.

@rossabaker
Copy link
Member Author

53a9c16 sort of works, but:

  • it results in skipped tests instead of failed tests on timeout
  • now it runs out of memory on my machine

@bfritz
Copy link
Member

bfritz commented Jan 26, 2017

Cool. I'll take a look tonight. This is my proof-of-concept (outside of http4s) to grab thread dump:
https://gist.github.com/bfritz/544a56352f5fcdaf7522c09a24a791cc

bfritz added a commit to bfritz/http4s that referenced this issue Jan 27, 2017
Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print
thread dump to console 500ms before the spec times out.  Hoping to
better understand:
http4s#858 (comment)
and also apply `ThreadDumpOnTimeout` to understand http4s#774.
@bfritz
Copy link
Member

bfritz commented Jan 27, 2017

@rossabaker Pretty far into the weeds, but 53a9c16 will cause every assertion in the test suite to take 6+ seconds (60 second timeout / 10).

TimeoutExamples takes the global timeout argument (60 seconds), divides by 10, and then calls Thread.sleep(6000) for every assertion. If the assertion is done when sleep() returns it moves on If not, Thread.sleep(6000) is called up to 9 more times.

A snippet with showtimes turned on:

[info] AuthenticationSpec
[info] 
[info] Failure to authenticate should
[info]   + not run unauthorized routes (6 seconds, 73 ms)
[info] BasicAuthentication should
[info]   + Respond to a request without authentication with 401 (6 seconds, 72 ms)
[info]   + Respond to a request with unknown username with 401 (6 seconds, 71 ms)
[info]   + Respond to a request with wrong password with 401 (6 seconds, 71 ms)
[info]   + Respond to a request with correct credentials (6 seconds, 71 ms)
[info] DigestAuthentication should
[info]   + Respond to a request without authentication with 401 (6 seconds, 71 ms)
[info]   + Respond to a request with correct credentials (6 seconds, 71 ms)
[info]   + Respond to many concurrent requests while cleaning up nonces (6 seconds, 70 ms)
[info]   + Avoid many concurrent replay attacks (6 seconds, 691 ms)
[info]   + Respond to invalid requests with 401 (6 seconds, 691 ms)
[info] 
[info] 
[info] Total for specification AuthenticationSpec
[info] Finished in 30 seconds, 200 ms
[info] 10 examples, 0 failure, 0 error

Strongly suspect the bump in memory usage is tied to the extra blocking (and parallelism).

Still experimenting, but I hope to have a PR this weekend that is more targetted to the specs that are making Travis unhappy.

bfritz added a commit to bfritz/http4s that referenced this issue Jan 28, 2017
In several of the recent Travis CI build build failures:

 * [3444.6](https://travis-ci.org/http4s/http4s/jobs/193191028)
 * [3458.6](https://travis-ci.org/http4s/http4s/jobs/193838431)
 * [3534.6](https://travis-ci.org/http4s/http4s/jobs/195974285)

the test that did not complete was `CharsetRangeSpec`.  Mix in
`ThreadDumpOnTimeout` and set a 2 second timeout to help identify
the source of the deadlock causing http4s#774.
bfritz added a commit to bfritz/http4s that referenced this issue Jan 28, 2017
Test output will now include all specs2 expecations and how long
each took to complete.
rossabaker added a commit that referenced this issue Jan 28, 2017
…ming

turn on `showtimes` for specs2 to help diagnose #774
bfritz added a commit to bfritz/http4s that referenced this issue Feb 6, 2017
Wait for up to 20 x 200ms intervals (4 seconds) before triggering
a thread dump.  Spec frequently runs for 1-2 seconds on Travis CI
infrastructure and the 2 second timeout was firing for tests that
were likely not deadlocked.

Would not have been a big deal, but the thread dumps put Travis
over its 4MB log limit and cause the job to fail.  Longer term,
if we keep the dumps, they should be written to a log file and
uploaded to S3 instead of polluting the console output.

Refs http4s#774
@bfritz
Copy link
Member

bfritz commented Mar 1, 2017

Have not seen a deadlock for a long time, possibly not since ThreadDumpOnTimeout was added in #897 ...possibly even earlier.

I opened a more specific #978 for the AsyncHttpClientSpec timeout error we are still seeing. Going to close this one and we can re-open if we see more deadlocks.

@bfritz
Copy link
Member

bfritz commented Mar 2, 2017

After further consideration, reopening. The builds Travis kills after > 4MB of output, due to thread dumps on console, are likely the the new version of the old deadlocks.

There is also a pattern emerging when they happen. Juicy bits from the thread dumps at:
https://gist.github.com/bfritz/73d85d6189e1216f791a003a64a90063

@bfritz bfritz reopened this Mar 2, 2017
@rossabaker
Copy link
Member Author

@bfritz
Copy link
Member

bfritz commented Mar 3, 2017

sbt test deadlocked on me while running b54a8c3 (topic/rig). Thread dump shows deadlock in CharsetRangeSpec related to lazy initialization of scalacheck bits:

Found one Java-level deadlock:
=============================
"specs2.fixed.env-595028191-8":
  waiting to lock monitor 0x00007ff940034518 (object 0x00000000af499340, a org.http4s.CharsetRangeSpec),
  which is held by "specs2.fixed.env-595028191-3"
"specs2.fixed.env-595028191-3":
  waiting to lock monitor 0x00007ff98c26df88 (object 0x00000000af69e088, a org.scalacheck.ArbitraryLowPriority$$anon$1),
  which is held by "specs2.fixed.env-595028191-8"

Full jstack output at:
https://gist.github.com/bfritz/3f466f6f75427437e5f0a4f0fd3a906c#file-thread_dump_with_deadlock-txt-L5125

Best I can tell, that dump is consistent with this deadlock. Unfortunately, I have no idea how to fix it.

@bfritz
Copy link
Member

bfritz commented Mar 3, 2017

Just had a deadlock in the release-0.15.x branch:
https://travis-ci.org/http4s/http4s/jobs/207383661#L1585

CharsetRangeSpec notably absent from list of completed specs.

@rossabaker
Copy link
Member Author

CharsetRangeSpec??!?

@bfritz
Copy link
Member

bfritz commented Mar 3, 2017

CharsetRangeSpec??!?

Yep, same as in the jstack output above. Looks like a deadlock involving lazy initialization and scalacheck.

For a while, I thought it might be typelevel/scalacheck#290 , but it's different and we already have the release of scalacheck with that change in it.

bfritz added a commit to bfritz/http4s that referenced this issue Mar 5, 2017
In response to this build:
https://travis-ci.org/http4s/http4s/jobs/207818583#L3278
on the cats branch that hung despite CharsetRangeSpec completing.

Hope this will identify source of deadlock if http4s#774 shows itself
again.
rossabaker pushed a commit to http4s/http4s-servlet that referenced this issue Apr 3, 2022
Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print
thread dump to console 500ms before the spec times out.  Hoping to
better understand:
https://github.com/http4s/http4s/issues/858http4s/http4s#issuecomment-274133582
and also apply `ThreadDumpOnTimeout` to understand http4s/http4s#774.
rossabaker added a commit to http4s/http4s-servlet that referenced this issue Apr 3, 2022
* `genCharsetRangeNoQuality` -> `arbitraryCharset.arbitrary`
* `arbitraryCharset.arbitrary` -> `arbitraryNioCharset`

Each needs the lock of the other.  This is a "non-circular dependency"
as described in SIP-20.

A better solution based on defs or vals will be designed for 0.16.
This is a binary-compatible mitigation for 0.15.

Fixes http4s/http4s#774
armanbilge pushed a commit to http4s/http4s-async-http-client that referenced this issue May 20, 2022
Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print
thread dump to console 500ms before the spec times out.  Hoping to
better understand:
https://github.com/http4s/http4s/issues/858http4s/http4s#issuecomment-274133582
and also apply `ThreadDumpOnTimeout` to understand http4s/http4s#774.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Determined to be a bug in http4s testing Issues related to tests
Projects
None yet
Development

No branches or pull requests

2 participants