Contended cache access causes deadlock #2976

slovdahl · 2017-10-24T11:02:21Z

I'm also observing some strange behaviour with guava caches under contention. According to @ben-manes, this is unrelated to #2863. Motivation for opening this issue: #2863 (comment)

While I was doing some JMH benchmarks I noticed that benchmarks that were using guava caches stalled completely and/or had very erratic performance during thread contention. Benchmark iterations very often stalled for seconds, minutes or sometimes seemingly indefinitely. It turned out that the cache was the culprit in all cases.

The issue is very easy to trigger. I published a sample project on https://github.com/hibox-systems/guava-cache-test that demonstrates the issue. In my environment it works fine with 1, 2 and 3 threads. Using 4 threads or more always stalls the benchmark at some point. I have tried the benchmark with all published guava versions that are compatible with the sample code, i.e. guava 11.0 and newer. Tested with these JVM versions on Ubuntu 16.04 with a Intel Core i7-3770 CPU:

OpenJDK 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
Oracle 1.8.0_151-b12
Oracle 9.0.1+11

FWIW, I tried replacing the guava caches in the benchmarks with caffeine caches, and all kinds of stalling and erratic performance went away.

ben-manes · 2017-10-25T04:55:45Z

I ran your benchmark and the problem is due to GC / allocation rate. When using the default settings, in JDK8, a small maximum heap is used with the parallel collector. That defers collection for overall throughput, resulting in a long pause during an iteration. Due to the allocation rate vastly exceeding the drain rate, the entries are promoted to old generation and may cause a GC death spiral.

When using G1 and an 8gb heap instead, the pauses mostly disappeared. It did hang, but jstack showed no benchmark threads running. Instead JMH was blocked waiting for the runner to complete, indicating a missed signal. My guess is that the constant GC pressure exposed a race in JMH, which is not typically under memory pressure.

In LRU every read is a write by mutating a doubly-linked list. If guarded by a lock for concurrency, then this creates significant contention. Guava solves this by delaying the LRU reorder and batching it, using a read buffer as an intermediary. For simplicity, this was a ConcurrentLinkedQueue and an AtomicInteger counter. This creates an allocation, but overall is cheaper than a lock. The work is then amortized on callers to drain the read buffer and catch-up the LRU.

This was intended to eventually be optimized with a ring buffer to remove allocations, do less work, and favor performance over strict LRU order. Unfortunately this improvement never occurred, though it was proven out in the predecessor (CLHM) and further optimized in its successor (Caffeine). After Caffeine began getting some traction, I tried once again to float the idea and provided a patch as a proof-of-concept.

Using that patch and your original JMH command, no issues occurs during benchmarking. The ns/op is relatively tight at 5-50ns, with the variance due to whether JMH is observing the amortized drain penalty. After 10 forks with the 20 warmup iterations and 20 actual iterations, there is no stalling.

Thus while Guava's cache degrades poorly under synthetic pressure, it does not deadlock.

This deadlock is not the typical deadlock where 2 threads locked a resource and each one is waiting to lock a second resource already locked by the other thread. The thread owning the account cache lock is parked, which tell us that the locked was not released. I could not determine the exact sequence of events leading to this deadlock making it really hard to report/fix the problem. While investigating, I realized that there quite a few reported issues in Guava that could be major for Gerrit: Our deadlock happening in account cache https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 Other deadlock google/guava#2976 google/guava#2863 Performance google/guava#2063 Race condition google/guava#2971 Because I could not reproduce the deadlock in a dev environment or in a unit test making it almost impossible to fix, I considered other options such as replacing Guava by something else. The maintainer of Caffeine[1] cache claims that Caffeine is a high performance[2], near optimal caching library designed/implemented base on the experience of designing Guava's cache and ConcurrentLinkedHashMap. I also did some benchmarks about spawning a lot of threads reading/writing values from the caches. I ran those benchmarks on both Guava and Caffeine and Guava was always taking at least double the time than Caffeine to complete all operations. Migrating to Caffeine is almost a drop-in replacement. Caffeine interface are very similar to Guava cache and there is an adapter to migrate to Caffeine and keep using Guava's interfaces. After migrating to Caffeine, we saw that deadlock occurrence was reduced from once a day to once every 2 weeks in our production server. The maintainer of Caffeine, Ben Manes pointed us to the possible cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our kernel version is supposed to have the fix but we will try another OS and kernel version. Replacing Guava caches by Caffeine brings 2 things, it reduces the chances of having the deadlock most likely caused by a kernel bug and improve the cache performance. [1]https://github.com/ben-manes/caffeine [2]https://github.com/ben-manes/caffeine/wiki/Benchmarks [3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 [4]torvalds/linux@b0c29f7 [5]torvalds/linux@76835b0 Bug: Issue 7645 Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae

kluever assigned lowasser Oct 24, 2017

kluever added package=cache type=defect Bug, not working as expected labels Oct 24, 2017

jhmanson added the status: needs triage label Oct 30, 2017

ronshapiro added the status=research label Jan 24, 2018

ronshapiro closed this as completed Jan 24, 2018

netdpb removed the status=needs-triage label Feb 14, 2018

avarzar added status=fixed and removed status=research labels Apr 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contended cache access causes deadlock #2976

Contended cache access causes deadlock #2976

slovdahl commented Oct 24, 2017 •

edited

Loading

ben-manes commented Oct 25, 2017

Contended cache access causes deadlock #2976

Contended cache access causes deadlock #2976

Comments

slovdahl commented Oct 24, 2017 • edited Loading

ben-manes commented Oct 25, 2017

slovdahl commented Oct 24, 2017 •

edited

Loading