-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do tests sometimes deadlock on 2.12 in Travis? #774
Comments
@rossabaker Looks like a good place for me to get my feet wet. Is this a recent example of the failure? |
Yes. I haven't been able to make this happen locally, which is frustrating -- it's a lot of iterating through Travis and hoping to get unlucky. One way to do it might be to look at the list of specs emitted by the failing job, and compare to a successful job. Specs we don't see are probably the ones that hang. And then run those in a loop on a branch in Travis so we get unlucky every time, and we might get the red-to-green confidence in an eventual fix. |
Currently trying to reproduce without hijacking the project's real Travis CI account. Running in Docker using the (dated) travis-jvm image on a 4GB Ubuntu 16.04 DigitalOcean droplet. In a 6 or 7 runs so far, I've noticed one short (~2 min) hang and captured a thread dump. Very possible it's unrelated, but to help track my progress I'm attaching console output and a thread dump. |
The http4s tests can definitely hang for long periods during
(There are also some interesting Low entropy could definitely hang the tests...and it has happened to others on Travis. Seems very unlikely it would be limited to Scala 2.12 (or 2.12 with scalaz-7.2) though. Might be separate from the hangs that raised this issue. |
Lots of interesting observations here.
|
Starting Tomcat in `TomcatServerSpec` seeds a `SecureRandom` instance. By default `/dev/random` is used on Linux and, in low entropy environments like Travis VMs, reading from `/dev/random` may block for long periods of time. Reconfigure the Travis build script to point the JVM at `/dev/urandom` to speed up the CI builds and hopefully prevent cases of the failed builds described in http4s#774. References: http4s#774 (comment) travis-ci/travis-ci#1494 (comment) https://en.wikipedia.org/wiki//dev/random
Possible explanation...but I'm reaching a bit...is that Line 5 in c6df29a
(And differs differently after the very recent docs change in 2784c3d.) Was hard to say with confidence by just watching Published PR #882 for consideration. In hindsight, I should have forked from the |
I can cherry-pick it back to 0.15.x. |
Starting Tomcat in `TomcatServerSpec` seeds a `SecureRandom` instance. By default `/dev/random` is used on Linux and, in low entropy environments like Travis VMs, reading from `/dev/random` may block for long periods of time. Reconfigure the Travis build script to point the JVM at `/dev/urandom` to speed up the CI builds and hopefully prevent cases of the failed builds described in #774. References: #774 (comment) travis-ci/travis-ci#1494 (comment) https://en.wikipedia.org/wiki//dev/random
Unfortunately the |
Well, it likely does help the previously undetected two-minute hang in Tomcat. :) |
No promises, but I'm toying with the idea of writing a SBT plugin or JVM agent that watches stdout for absence of input and triggers a thread dump. Might even be possible to hack something together with a shell script and jstack. |
https://etorreborre.github.io/specs2/guide/SPECS2-3.5/org.specs2.guide.TimeoutExamples.html might give the missing hook, but you'd have to intercept failures caused by however that fails. I did similar on Scalatest once, and it got me pointed in the right direction. The other thing I did to narrow down the Travis-only deadlock at work was to compare the list of specs names in a successful run to the ones that didn't appear on the hung job. It pointed me right to the offending spec, though not the cause. |
Using the specs2 hooks makes sense. Downside is it means I'm diving in at the deep end of specs2. Haven't drowned yet, but might be several more days before I have anything PR worthy. |
53a9c16 sort of works, but:
|
Cool. I'll take a look tonight. This is my proof-of-concept (outside of http4s) to grab thread dump: |
Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print thread dump to console 500ms before the spec times out. Hoping to better understand: http4s#858 (comment) and also apply `ThreadDumpOnTimeout` to understand http4s#774.
@rossabaker Pretty far into the weeds, but 53a9c16 will cause every assertion in the test suite to take 6+ seconds (60 second timeout / 10).
A snippet with
Strongly suspect the bump in memory usage is tied to the extra blocking (and parallelism). Still experimenting, but I hope to have a PR this weekend that is more targetted to the specs that are making Travis unhappy. |
In several of the recent Travis CI build build failures: * [3444.6](https://travis-ci.org/http4s/http4s/jobs/193191028) * [3458.6](https://travis-ci.org/http4s/http4s/jobs/193838431) * [3534.6](https://travis-ci.org/http4s/http4s/jobs/195974285) the test that did not complete was `CharsetRangeSpec`. Mix in `ThreadDumpOnTimeout` and set a 2 second timeout to help identify the source of the deadlock causing http4s#774.
Test output will now include all specs2 expecations and how long each took to complete.
…ming turn on `showtimes` for specs2 to help diagnose #774
Wait for up to 20 x 200ms intervals (4 seconds) before triggering a thread dump. Spec frequently runs for 1-2 seconds on Travis CI infrastructure and the 2 second timeout was firing for tests that were likely not deadlocked. Would not have been a big deal, but the thread dumps put Travis over its 4MB log limit and cause the job to fail. Longer term, if we keep the dumps, they should be written to a log file and uploaded to S3 instead of polluting the console output. Refs http4s#774
After further consideration, reopening. The builds Travis kills after > 4MB of output, due to thread dumps on console, are likely the the new version of the old deadlocks. There is also a pattern emerging when they happen. Juicy bits from the thread dumps at: |
Full jstack output at: Best I can tell, that dump is consistent with this deadlock. Unfortunately, I have no idea how to fix it. |
Just had a deadlock in the
|
CharsetRangeSpec??!? |
Yep, same as in the jstack output above. Looks like a deadlock involving lazy initialization and scalacheck. For a while, I thought it might be typelevel/scalacheck#290 , but it's different and we already have the release of scalacheck with that change in it. |
In response to this build: https://travis-ci.org/http4s/http4s/jobs/207818583#L3278 on the cats branch that hung despite CharsetRangeSpec completing. Hope this will identify source of deadlock if http4s#774 shows itself again.
Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print thread dump to console 500ms before the spec times out. Hoping to better understand: https://github.com/http4s/http4s/issues/858http4s/http4s#issuecomment-274133582 and also apply `ThreadDumpOnTimeout` to understand http4s/http4s#774.
* `genCharsetRangeNoQuality` -> `arbitraryCharset.arbitrary` * `arbitraryCharset.arbitrary` -> `arbitraryNioCharset` Each needs the lock of the other. This is a "non-circular dependency" as described in SIP-20. A better solution based on defs or vals will be designed for 0.16. This is a binary-compatible mitigation for 0.15. Fixes http4s/http4s#774
Mix new `ThreadDumpOnTimeout` into `AsyncHttpClientSpec` and print thread dump to console 500ms before the spec times out. Hoping to better understand: https://github.com/http4s/http4s/issues/858http4s/http4s#issuecomment-274133582 and also apply `ThreadDumpOnTimeout` to understand http4s/http4s#774.
We are frequently seeing tests run for about 49 minutes and get killed on Travis.
We have already been slain by SI-10064 in tut. But this only happens in Travis, and only intermittently. I have not managed to reproduce this locally.
The text was updated successfully, but these errors were encountered: