-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contended cache access causes deadlock #2976
Comments
I ran your benchmark and the problem is due to GC / allocation rate. When using the default settings, in JDK8, a small maximum heap is used with the parallel collector. That defers collection for overall throughput, resulting in a long pause during an iteration. Due to the allocation rate vastly exceeding the drain rate, the entries are promoted to old generation and may cause a GC death spiral. When using G1 and an 8gb heap instead, the pauses mostly disappeared. It did hang, but In LRU every read is a write by mutating a doubly-linked list. If guarded by a lock for concurrency, then this creates significant contention. Guava solves this by delaying the LRU reorder and batching it, using a read buffer as an intermediary. For simplicity, this was a This was intended to eventually be optimized with a ring buffer to remove allocations, do less work, and favor performance over strict LRU order. Unfortunately this improvement never occurred, though it was proven out in the predecessor (CLHM) and further optimized in its successor (Caffeine). After Caffeine began getting some traction, I tried once again to float the idea and provided a patch as a proof-of-concept. Using that patch and your original JMH command, no issues occurs during benchmarking. The ns/op is relatively tight at 5-50ns, with the variance due to whether JMH is observing the amortized drain penalty. After 10 forks with the 20 warmup iterations and 20 actual iterations, there is no stalling. Thus while Guava's cache degrades poorly under synthetic pressure, it does not deadlock. |
This deadlock is not the typical deadlock where 2 threads locked a resource and each one is waiting to lock a second resource already locked by the other thread. The thread owning the account cache lock is parked, which tell us that the locked was not released. I could not determine the exact sequence of events leading to this deadlock making it really hard to report/fix the problem. While investigating, I realized that there quite a few reported issues in Guava that could be major for Gerrit: Our deadlock happening in account cache https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 Other deadlock google/guava#2976 google/guava#2863 Performance google/guava#2063 Race condition google/guava#2971 Because I could not reproduce the deadlock in a dev environment or in a unit test making it almost impossible to fix, I considered other options such as replacing Guava by something else. The maintainer of Caffeine[1] cache claims that Caffeine is a high performance[2], near optimal caching library designed/implemented base on the experience of designing Guava's cache and ConcurrentLinkedHashMap. I also did some benchmarks about spawning a lot of threads reading/writing values from the caches. I ran those benchmarks on both Guava and Caffeine and Guava was always taking at least double the time than Caffeine to complete all operations. Migrating to Caffeine is almost a drop-in replacement. Caffeine interface are very similar to Guava cache and there is an adapter to migrate to Caffeine and keep using Guava's interfaces. After migrating to Caffeine, we saw that deadlock occurrence was reduced from once a day to once every 2 weeks in our production server. The maintainer of Caffeine, Ben Manes pointed us to the possible cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our kernel version is supposed to have the fix but we will try another OS and kernel version. Replacing Guava caches by Caffeine brings 2 things, it reduces the chances of having the deadlock most likely caused by a kernel bug and improve the cache performance. [1]https://github.com/ben-manes/caffeine [2]https://github.com/ben-manes/caffeine/wiki/Benchmarks [3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 [4]torvalds/linux@b0c29f7 [5]torvalds/linux@76835b0 Bug: Issue 7645 Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae
I'm also observing some strange behaviour with guava caches under contention. According to @ben-manes, this is unrelated to #2863. Motivation for opening this issue: #2863 (comment)
While I was doing some JMH benchmarks I noticed that benchmarks that were using guava caches stalled completely and/or had very erratic performance during thread contention. Benchmark iterations very often stalled for seconds, minutes or sometimes seemingly indefinitely. It turned out that the cache was the culprit in all cases.
The issue is very easy to trigger. I published a sample project on https://github.com/hibox-systems/guava-cache-test that demonstrates the issue. In my environment it works fine with 1, 2 and 3 threads. Using 4 threads or more always stalls the benchmark at some point. I have tried the benchmark with all published guava versions that are compatible with the sample code, i.e. guava 11.0 and newer. Tested with these JVM versions on Ubuntu 16.04 with a Intel Core i7-3770 CPU:
FWIW, I tried replacing the guava caches in the benchmarks with caffeine caches, and all kinds of stalling and erratic performance went away.
The text was updated successfully, but these errors were encountered: