-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache performance #2063
Comments
Is the "op" a read, update or insert? The latest stuff I did in cache2k only have a bottleneck for the insert operation, which implies structural changes and therefor needs locking. All other operations run fully concurrently. Can you maybe rerun your benchmarks with a cache2k setup I give you? |
You can send a pull request to add a new cache type to the GetPutBenchmark. I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself. |
Yes, quite normal reaction. I regularly have these concerns, when I look at a code area that I had not touched for some months. I walk through the design and the JMM again and finally find good sleep. So far we had no corruption in production, even for releases marked as experimental. |
I know adding to an unfamiliar code base is confusing, so I made an addition that I'll send you a pull request to review. Stepping through I see where some of my initial concerns regarding races are addressed for the LruCache. On my laptop I get the following results, which matches the Desktop-class benchmarks section.
The performance matches what I'd expect from top-level lock, which appears to be how the LRU is implemented. I haven't looked at why writes are slower, since they should take the same amount of time. This may be due to synchronization or degradation in the custom hash table. |
A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).
We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava's recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.
Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.
I have been advocating this improvement since we first began, originally assumed that we'd get to it in a performance iteration. After leaving I didn't have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive. I don't know why I even bother anymore... |
I'm going to try to find time to take a crack at this, though I'm not 100% confident when I'll be able to. |
Updated the patch so that unit tests compile & pass. |
This deadlock is not the typical deadlock where 2 threads locked a resource and each one is waiting to lock a second resource already locked by the other thread. The thread owning the account cache lock is parked, which tell us that the locked was not released. I could not determine the exact sequence of events leading to this deadlock making it really hard to report/fix the problem. While investigating, I realized that there quite a few reported issues in Guava that could be major for Gerrit: Our deadlock happening in account cache https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 Other deadlock google/guava#2976 google/guava#2863 Performance google/guava#2063 Race condition google/guava#2971 Because I could not reproduce the deadlock in a dev environment or in a unit test making it almost impossible to fix, I considered other options such as replacing Guava by something else. The maintainer of Caffeine[1] cache claims that Caffeine is a high performance[2], near optimal caching library designed/implemented base on the experience of designing Guava's cache and ConcurrentLinkedHashMap. I also did some benchmarks about spawning a lot of threads reading/writing values from the caches. I ran those benchmarks on both Guava and Caffeine and Guava was always taking at least double the time than Caffeine to complete all operations. Migrating to Caffeine is almost a drop-in replacement. Caffeine interface are very similar to Guava cache and there is an adapter to migrate to Caffeine and keep using Guava's interfaces. After migrating to Caffeine, we saw that deadlock occurrence was reduced from once a day to once every 2 weeks in our production server. The maintainer of Caffeine, Ben Manes pointed us to the possible cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our kernel version is supposed to have the fix but we will try another OS and kernel version. Replacing Guava caches by Caffeine brings 2 things, it reduces the chances of having the deadlock most likely caused by a kernel bug and improve the cache performance. [1]https://github.com/ben-manes/caffeine [2]https://github.com/ben-manes/caffeine/wiki/Benchmarks [3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 [4]torvalds/linux@b0c29f7 [5]torvalds/linux@76835b0 Bug: Issue 7645 Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae
Closing since it seems unlikely to be resolved. This would be simple and be a large speed up, but requires an internal champion. |
Guava's cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava's performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.
Of note is that Guava's default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache's overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap's performance compared to what was observed when Guava's cache was in development.
The easiest way to double Guava's performance would be to replace the
recencyQueue
andreadCount
with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.
Read (100%)
Read (75%) / Write (25%)
Write (100%)
The text was updated successfully, but these errors were encountered: