Cache performance #2063

ben-manes · 2015-05-29T14:11:40Z

Guava's cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava's performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.

Of note is that Guava's default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache's overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap's performance compared to what was observed when Guava's cache was in development.

The easiest way to double Guava's performance would be to replace the recencyQueue and readCount with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.

Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.

Read (100%)

Unbounded	ops/s (8 threads)	ops/s (16 threads)
ConcurrentHashMap (v8)	560,367,163	1,171,389,095
ConcurrentHashMap (v7)	301,331,240	542,304,172

Bounded
Caffeine	181,703,298	365,508,966
ConcurrentLinkedHashMap	154,771,582	313,892,223
LinkedHashMap_Lru	9,209,065	13,598,576
Guava (default)	12,434,655	10,647,238
Guava (64)	24,533,922	43,101,468
Ehcache2_Lru	11,252,172	20,750,543
Ehcache3_Lru	11,415,248	17,611,169
Infinispan_Old_Lru	29,073,439	49,719,833
Infinispan_New_Lru	4,888,027	4,749,506

Read (75%) / Write (25%)

Unbounded	ops/s (8 threads)	ops/s (16 threads)
ConcurrentHashMap (v8)	441,965,711	790,602,730
ConcurrentHashMap (v7)	196,215,481	346,479,582

Bounded
Caffeine	112,622,075	235,178,775
ConcurrentLinkedHashMap	63,968,369	122,342,605
LinkedHashMap_Lru	8,668,785	12,779,625
Guava (default)	11,782,063	11,886,673
Guava (64)	22,782,431	37,332,090
Ehcache2_Lru	9,472,810	8,471,016
Ehcache3_Lru	10,958,697	17,302,523
Infinispan_Old_Lru	22,663,359	37,270,102
Infinispan_New_Lru	4,753,313	4,885,061

Write (100%)

Unbounded	ops/s (8 threads)	ops/s (16 threads)
ConcurrentHashMap (v8)	60,477,550	50,591,346
ConcurrentHashMap (v7)	46,204,091	36,659,485

Bounded
Caffeine	55,281,751	47,482,019
ConcurrentLinkedHashMap	23,819,597	39,797,969
LinkedHashMap_Lru	10,179,891	10,859,549
Guava (default)	4,764,056	5,446,282
Guava (64)	8,128,024	7,483,986
Ehcache2_Lru	4,205,936	4,697,745
Ehcache3_Lru	10,051,020	13,939,317
Infinispan_Old_Lru	7,538,859	7,332,973
Infinispan_New_Lru	4,797,502	5,086,305

The text was updated successfully, but these errors were encountered:

cruftex · 2015-05-29T21:27:32Z

Is the "op" a read, update or insert? The latest stuff I did in cache2k only have a bottleneck for the insert operation, which implies structural changes and therefor needs locking. All other operations run fully concurrently. Can you maybe rerun your benchmarks with a cache2k setup I give you?

ben-manes · 2015-05-29T21:53:43Z

You can send a pull request to add a new cache type to the GetPutBenchmark. I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself.

cruftex · 2015-05-30T18:27:16Z

I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself.

Yes, quite normal reaction. I regularly have these concerns, when I look at a code area that I had not touched for some months. I walk through the design and the JMM again and finally find good sleep. So far we had no corruption in production, even for releases marked as experimental.

ben-manes · 2015-05-31T02:51:54Z

I know adding to an unfamiliar code base is confusing, so I made an addition that I'll send you a pull request to review. Stepping through I see where some of my initial concerns regarding races are addressed for the LruCache.

On my laptop I get the following results, which matches the Desktop-class benchmarks section.

Benchmark	ops/s (8 threads)
Read (100%)	8,864,382
Read (75%) / Write (25%)	8,064,139
Write (100%)	2,489,101

The performance matches what I'd expect from top-level lock, which appears to be how the LRU is implemented. I haven't looked at why writes are slower, since they should take the same amount of time. This may be due to synchronization or degradation in the custom hash table.

ben-manes · 2015-05-31T12:26:13Z

A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).

Benchmark	ops/s (8 threads)	ops/s (16 threads)
Read (100%)	121,808,437	260,593,210
Read (75%) / Write (25%)	34,770,671	48,538,555
Write (100%)	5,064,388	6,045,980

We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava's recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.

Benchmark	ops/s (8 threads)	ops/s (16 threads)
Read (100%)	40,935,985	84,109,039
Read (75%) / Write (25%)	40,854,227	64,462,500
Write (100%)	6,757,178	6,570,233

Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.

Computer	sameKey ops/s	spread ops/s
ConcurrentHashMap (v8)	30,321,032	64,427,143
Guava (current)	23,149,556	71,423,966
Guava (patched)	329,220,794	166,448,754
Caffeine	1,558,302,420	533,181,707

I have been advocating this improvement since we first began, originally assumed that we'd get to it in a performance iteration. After leaving I didn't have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive.

I don't know why I even bother anymore...

lowasser · 2015-06-04T22:56:19Z

I'm going to try to find time to take a crack at this, though I'm not 100% confident when I'll be able to.

ben-manes · 2015-06-05T02:30:42Z

Updated the patch so that unit tests compile & pass.

This deadlock is not the typical deadlock where 2 threads locked a resource and each one is waiting to lock a second resource already locked by the other thread. The thread owning the account cache lock is parked, which tell us that the locked was not released. I could not determine the exact sequence of events leading to this deadlock making it really hard to report/fix the problem. While investigating, I realized that there quite a few reported issues in Guava that could be major for Gerrit: Our deadlock happening in account cache https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 Other deadlock google/guava#2976 google/guava#2863 Performance google/guava#2063 Race condition google/guava#2971 Because I could not reproduce the deadlock in a dev environment or in a unit test making it almost impossible to fix, I considered other options such as replacing Guava by something else. The maintainer of Caffeine[1] cache claims that Caffeine is a high performance[2], near optimal caching library designed/implemented base on the experience of designing Guava's cache and ConcurrentLinkedHashMap. I also did some benchmarks about spawning a lot of threads reading/writing values from the caches. I ran those benchmarks on both Guava and Caffeine and Guava was always taking at least double the time than Caffeine to complete all operations. Migrating to Caffeine is almost a drop-in replacement. Caffeine interface are very similar to Guava cache and there is an adapter to migrate to Caffeine and keep using Guava's interfaces. After migrating to Caffeine, we saw that deadlock occurrence was reduced from once a day to once every 2 weeks in our production server. The maintainer of Caffeine, Ben Manes pointed us to the possible cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our kernel version is supposed to have the fix but we will try another OS and kernel version. Replacing Guava caches by Caffeine brings 2 things, it reduces the chances of having the deadlock most likely caused by a kernel bug and improve the cache performance. [1]https://github.com/ben-manes/caffeine [2]https://github.com/ben-manes/caffeine/wiki/Benchmarks [3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 [4]torvalds/linux@b0c29f7 [5]torvalds/linux@76835b0 Bug: Issue 7645 Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae

ben-manes · 2021-02-10T07:39:29Z

Closing since it seems unlikely to be resolved. This would be simple and be a large speed up, but requires an internal champion.

This was referenced Feb 25, 2016

Explore TinyLFU cache elastic/elasticsearch#16802

Closed

Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

Closed

svanoort mentioned this issue Oct 10, 2016

[JENKINS-38867] Optimize Actionable.getAllActions jenkinsci/jenkins#2582

Merged

ben-manes mentioned this issue Oct 19, 2016

JAVA-1308: CodecRegistry performance improvements apache/cassandra-java-driver#757

Closed

kluever added package=cache type=performance Related to performance labels Nov 28, 2017

kluever assigned lowasser Nov 28, 2017

ben-manes mentioned this issue Jan 3, 2018

MapMakerInternalMap potential performance improvement? #1634

Open

ben-manes mentioned this issue Mar 5, 2018

METRON-1460: Create a complementary non-split-join enrichment topology apache/metron#940

Closed

10 tasks

netdpb added the P3 no SLO label Aug 13, 2019

dmitry-worker mentioned this issue Dec 23, 2020

[ETCM-266]-replaced-rate-limiter-built-on-twitter input-output-hk/mantis#873

Merged

ben-manes closed this as completed Feb 10, 2021

ministat mentioned this issue Aug 27, 2022

Performance issue on vertex cache with Guava implementation JanusGraph/janusgraph#3185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache performance #2063

Cache performance #2063

ben-manes commented May 29, 2015

cruftex commented May 29, 2015

ben-manes commented May 29, 2015

cruftex commented May 30, 2015

ben-manes commented May 31, 2015

ben-manes commented May 31, 2015

lowasser commented Jun 4, 2015

ben-manes commented Jun 5, 2015

ben-manes commented Feb 10, 2021

Cache performance #2063

Cache performance #2063

Comments

ben-manes commented May 29, 2015

Read (100%)

Read (75%) / Write (25%)

Write (100%)

cruftex commented May 29, 2015

ben-manes commented May 29, 2015

cruftex commented May 30, 2015

ben-manes commented May 31, 2015

ben-manes commented May 31, 2015

lowasser commented Jun 4, 2015

ben-manes commented Jun 5, 2015

ben-manes commented Feb 10, 2021