Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contended cache access causes deadlock #2976

Closed
slovdahl opened this issue Oct 24, 2017 · 1 comment
Closed

Contended cache access causes deadlock #2976

slovdahl opened this issue Oct 24, 2017 · 1 comment
Assignees
Labels

Comments

@slovdahl
Copy link

slovdahl commented Oct 24, 2017

I'm also observing some strange behaviour with guava caches under contention. According to @ben-manes, this is unrelated to #2863. Motivation for opening this issue: #2863 (comment)

While I was doing some JMH benchmarks I noticed that benchmarks that were using guava caches stalled completely and/or had very erratic performance during thread contention. Benchmark iterations very often stalled for seconds, minutes or sometimes seemingly indefinitely. It turned out that the cache was the culprit in all cases.

The issue is very easy to trigger. I published a sample project on https://github.com/hibox-systems/guava-cache-test that demonstrates the issue. In my environment it works fine with 1, 2 and 3 threads. Using 4 threads or more always stalls the benchmark at some point. I have tried the benchmark with all published guava versions that are compatible with the sample code, i.e. guava 11.0 and newer. Tested with these JVM versions on Ubuntu 16.04 with a Intel Core i7-3770 CPU:

  • OpenJDK 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
  • Oracle 1.8.0_151-b12
  • Oracle 9.0.1+11

FWIW, I tried replacing the guava caches in the benchmarks with caffeine caches, and all kinds of stalling and erratic performance went away.

@kluever kluever added package=cache type=defect Bug, not working as expected labels Oct 24, 2017
@ben-manes
Copy link
Contributor

I ran your benchmark and the problem is due to GC / allocation rate. When using the default settings, in JDK8, a small maximum heap is used with the parallel collector. That defers collection for overall throughput, resulting in a long pause during an iteration. Due to the allocation rate vastly exceeding the drain rate, the entries are promoted to old generation and may cause a GC death spiral.

When using G1 and an 8gb heap instead, the pauses mostly disappeared. It did hang, but jstack showed no benchmark threads running. Instead JMH was blocked waiting for the runner to complete, indicating a missed signal. My guess is that the constant GC pressure exposed a race in JMH, which is not typically under memory pressure.

In LRU every read is a write by mutating a doubly-linked list. If guarded by a lock for concurrency, then this creates significant contention. Guava solves this by delaying the LRU reorder and batching it, using a read buffer as an intermediary. For simplicity, this was a ConcurrentLinkedQueue and an AtomicInteger counter. This creates an allocation, but overall is cheaper than a lock. The work is then amortized on callers to drain the read buffer and catch-up the LRU.

This was intended to eventually be optimized with a ring buffer to remove allocations, do less work, and favor performance over strict LRU order. Unfortunately this improvement never occurred, though it was proven out in the predecessor (CLHM) and further optimized in its successor (Caffeine). After Caffeine began getting some traction, I tried once again to float the idea and provided a patch as a proof-of-concept.

Using that patch and your original JMH command, no issues occurs during benchmarking. The ns/op is relatively tight at 5-50ns, with the variance due to whether JMH is observing the amortized drain penalty. After 10 forks with the 20 warmup iterations and 20 actual iterations, there is no stalling.

Thus while Guava's cache degrades poorly under synthetic pressure, it does not deadlock.

lucamilanesio pushed a commit to GerritCodeReview/gerrit that referenced this issue Feb 26, 2018
This deadlock is not the typical deadlock where 2 threads locked a
resource and each one is waiting to lock a second resource already
locked by the other thread. The thread owning the account cache lock is
parked, which tell us that the locked was not released. I could not
determine the exact sequence of events leading to this deadlock making
it really hard to report/fix the problem.

While investigating, I realized that there quite a few reported issues
in Guava that could be major for Gerrit:

  Our deadlock happening in account cache
  https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

  Other deadlock
  google/guava#2976
  google/guava#2863

  Performance
  google/guava#2063

  Race condition
  google/guava#2971

Because I could not reproduce the deadlock in a dev environment or in
a unit test making it almost impossible to fix, I considered other
options such as replacing Guava by something else.

The maintainer of Caffeine[1] cache claims that Caffeine is a high
performance[2], near optimal caching library designed/implemented base
on the experience of designing Guava's cache and ConcurrentLinkedHashMap.
I also did some benchmarks about spawning a lot of threads reading/writing
values from the caches. I ran those benchmarks on both Guava and Caffeine
and Guava was always taking at least double the time than Caffeine to
complete all operations.

Migrating to Caffeine is almost a drop-in replacement. Caffeine
interface are very similar to Guava cache and there is an adapter to
migrate to Caffeine and keep using Guava's interfaces. After migrating
to Caffeine, we saw that deadlock occurrence was reduced from once a day
to once every 2 weeks in our production server.

The maintainer of Caffeine, Ben Manes pointed us to the possible
cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our
kernel version is supposed to have the fix but we will try another OS
and kernel version.

Replacing Guava caches by Caffeine brings 2 things, it reduces the
chances of having the deadlock most likely caused by a kernel bug and
improve the cache performance.

[1]https://github.com/ben-manes/caffeine
[2]https://github.com/ben-manes/caffeine/wiki/Benchmarks
[3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
[4]torvalds/linux@b0c29f7
[5]torvalds/linux@76835b0

Bug: Issue 7645
Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants