Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore TinyLFU cache #16802

Closed
ben-manes opened this issue Feb 25, 2016 · 6 comments
Closed

Explore TinyLFU cache #16802

ben-manes opened this issue Feb 25, 2016 · 6 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label discuss

Comments

@ben-manes
Copy link

When removing Guava as a dependency, @jasontedor wrote an 256-way LRU cache using a Guava-like API. The design uses a read/write lock per table segment and a lock around the global linked for LRU ordering.

There are a few concerns that might be worth addressing. First is that a r/w lock is expensive, often much more than an exclusive lock, and its usage per read is probably best replaced with a ConcurrentHashMap. Second is that the LRU lock is a throttling point, as all accesses contend on it. Third is that computeIfAbsent is racy by allowing redundant computations with last insert wins. Fourth is that LRU is sub-optimal for search workloads, where frequency is a more insightful metric.

The easiest (and biased) solution is to adopt Caffeine. Alternatively a big win would come from integrating TinyLFU into the existing cache. This would increase the hit rate, thereby reducing I/O and GC allocations, resulting in improved latencies.

@jasontedor jasontedor self-assigned this Feb 25, 2016
@jasontedor
Copy link
Member

@ben-manes Thanks for the post.

First is that a r/w lock is expensive, often much more than an exclusive lock, and its usage per read is probably best replaced with a ConcurrentHashMap. Second is that the LRU lock is a throttling point, as all accesses contend on it.

It's difficult to say how impactful these are without search latency benchmarks here (to be clear since I don't want this to read like a dodge, I suspect the LRU lock is impactful under heavy search loads with large search thread pools). Unfortunately, we do not have benchmarks here that would stress the lock contention issues. These sorts of benchmarks are a work-in-progress.

Third is that computeIfAbsent is racy by allowing redundant computations with last insert wins.

This was the case for the initial commit that you linked to, but this has been addressed and (unless I'm mistaken) is no longer the case.

Fourth is that LRU is sub-optimal for search workloads, where frequency is a more insightful metric.

Do you have any literature on this specific point that you can share?

The easiest (and biased) solution is to adopt Caffeine. Alternatively a big win would come from integrating TinyLFU into the existing cache.

I'll take a look. I have some ideas regarding the LRU lock contention as well, but I really don't want to go stabbing in the dark until we can get valid multi-threaded search latency benchmarks in place. One goal we had with the cache implementation was to keep it simple.

@ben-manes
Copy link
Author

It's difficult to say how impactful these are without search latency benchmarks here

Yep, I'd much rather have data to work from than a hypothesis too. I know from micro-benchmarks and past experiences that it can be a problem, but it is too application dependent to draw a conclusion. The hit rate generally has a bigger impact than concurrency, given how expensive the miss penalty often is. I'm more interested in seeing if we can improve that for ES.

This was the case for the initial commit that you linked to, but this has been addressed

Oh, nice. Sorry that I missed that when skimming over the code.

Do you have any literature on this specific point that you can share?

Zipf-like distributions are commonly discussed in information retrieval literature (1, 2, 3, 4, 5). Similarly the distribution holds as a model for caching (6, 7, 8). This is fairly intuitive because the popularity of items is a good signal, but some aging based on recency is logically necessary to avoid holding stale data for too long. The TinyLFU paper includes charts with web search traces from IBM and UMass. This policy uses recency and historic frequency (present and absent entries) to predict whether an item should be retained.

I have some ideas regarding the LRU lock contention as well

This article on HighScalability describes the approach that I took. The concurrency model was proven out in my prior libraries (CLHM, Guava). The former is used by Cassandra (among others) and the latter is of course popular but underperforms due to important optimizations not being ported over. This has very predictable latencies by being O(1) and borrows from database / filesystem theory where similar approaches are used to scale writes.

ben-manes added a commit to ben-manes/caffeine that referenced this issue Apr 29, 2016
The results show that their cache performs very poorly under concurrent
workloads. The striped LRU matches the theoretical, but search is far
more biased by frequency.

For the results see elastic/elasticsearch#16802
@ben-manes
Copy link
Author

ben-manes commented Apr 29, 2016

@jasontedor
I added ElasticSearch's cache to Caffeine's benchmark suite. The results were not positive.

The micro-benchmarks were run on my laptop. These results are slower than a synchronized LinkedHashMap in LRU mode. I ran Caffeine a few times since this isn't a good (isolated) testbed machine and it consumes all the available CPU resources.

Benchmark ElasticSearch LinkedHashMap Caffeine
read-only 2.5 M/s 7.4 M/s 130 - 150 M/s
read-write 2.3 M/s 7.7 M/s 100 - 115 M/s
write-only 1.6 M/s 7.7 M/s 50 - 65 M/s

The simulator compares the hit rate of the caching policies. The current cache stays very close to a pure LRU, which validates the design. Unfortunately LRU is poor for search workloads, which are biased towards frequency instead of recency. The efficiency using the Univ. of Massachusetts WebSearch1 workload was,

Benchmark ElasticSearch Caffeine Optimal
WS1 @ 2M 6.09 % 23.06 % 41.28 %
WS1 @ 4M 21.60 % 41.32 % 57.80 %
WS1 @ 6M 45.74 % 55.02 % 65.85 %

It may very well be that the cache is not a critical bottleneck for ElasticSearch. Unfortunately it appears to leave a lot of performance on the table, which I bet a little profiling would dramatically improve. If the size of the cache is large, there is an opportunity it either achieve the same hit rate at a reduced capacity or increase the hit rate. In either case there should be reduced latencies (less GC / fewer misses).

@ben-manes
Copy link
Author

Congrats on 5.0! Its an impressive release and very timely for me. I was about to start migrating from Algolia to Elastic, and was considering Druid for analytical functions. The new features in Elastic are a wonderful alternative, so I'm looking forward to using them instead.

@colings86 colings86 added the :Core/Infra/Core Core issues without another label label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@ben-manes
Copy link
Author

As ES8 is moving to Java 17, I'll have to remove the cache from my benchmark suite (since Caffeine is Java 11-based). Unfortunately the issues observed with its low hit rate in search workloads and poor throughput in concurrency benchmarks remains unfixed (and reported as a problem by users, e.g. #69646). I noticed that @original-brownbear recently did some nice work on the cache to reduce its memory overhead, but these problems are algorithmic. If later you would like to fix these issues then we can revive the integration for an analysis. Obviously switching (as Solr did) would be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label discuss
Projects
None yet
Development

No branches or pull requests

5 participants