Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Caffeine default cache size to 16 #218

Merged
merged 2 commits into from
Feb 5, 2018

Conversation

j-baker
Copy link
Contributor

@j-baker j-baker commented Feb 2, 2018

Caffeine uses a ConcurrentHashMap under the hood. The way this works is that there are a number of bins, and the bins contain nodes (which contain 0 or more k/v pairs).

When I do a compute or computeIfAbsent, I look for the bin my node should be in, and if it does not exist, I put a placeholder bin there, compute, and then put the value there. If it does exist, I lock the bin, do whatever checks I need to do, put the value if necessary and unlock.

When I am done, I grow the map if necessary (generally increasing it by powers of 2).

Caffeine's default map size is 0, which cases the ConcurrentHashMap to be created with no table. This is cheap, but has the drawback of Caffeine starting off with only 2 buckets.

The issue here is that when I do a computeIfAbsent call, it synchronizes with every other computeIfAbsent call currently in progress on the same bucket (a bigger problem if the compute function is expensive).

For a pathological case, if all of my keys are ending up in one bucket, they will linearize (and end up in either a linked list or a red-black tree). This is expected, and one would anticipate the ConcurrentHashMap to compensate by growing, and indeed it immediately tries to.

However, in order to transfer the node to a new table, (so far as I can read) it joins the synchronization bandwagon which is currently trying to add data to the node. Eventually the transfer should be able to happen (although I suppose synchronization is unfair?), but until this is the case, the transfer cannot complete - if I have 1000 threads which are all trying to synchronize and block for a long time, the one which is trying to move the node is going to have to wait behind expected at least some of them.

So, this means that you can get some pretty brutal throughput characteristics if you dump a load of contention with slow compute functions (which extend the synchronization overhead in a way that ConcurrentHashMap probably does not expect) onto a Caffeine cache with no warmup period.

For example, if I have 1000 threads try and concurrently hammer the same Caffeine loading cache, sleeping for 100ms in their loading function, I get throughput of roughly 10 writes a second. This eventually will grow, but it will take a very long time.

In our case, this ended up with something approaching a deadlock; a low-capacity cache (we think almost empty, with only a few historical inputs) received concurrently some small number of unexpectedly expensive requests between 0 and 60, which each took (say) a minute. In general the system is easily able to deal with such requests - the cache deduplicates such requests, and there are only a few queries that could take this long. However, in this case they caused us to enter a death spiral; they locked all the buckets and stopped any of the other (far cheaper) requests from being served, filling up the thread pool and killing the server.

To compensate for this, this PR presizes the cache to 64 elements. It uses something like 256 additional bytes of memory for each cache created, but pretty much removes this pathological case.

Hope this makes sense! I may have misunderstood some stuff about ConcurrentHashMap here though.

Caffeine uses a ConcurrentHashMap under the hood. The way this works is that there are a number of bins, and the bins contain nodes (which contain 0 or more k/v pairs).

When I do a compute or computeIfAbsent, I look for the bin my node should be in, and if it does not exist, I put a placeholder bin there, compute, and then put the value there. If it does exist, I lock the bin, do whatever checks I need to do, put the value if necessary and unlock.

When I am done, I grow the map if necessary (generally increasing it by powers of 2).

Caffeine's default map size is 0, which cases the ConcurrentHashMap to be created with no table. This is cheap, but has the drawback of Caffeine starting off with only 2 buckets.

The issue here is that when I do a computeIfAbsent call, it synchronizes with every other computeIfAbsent call currently in progress (a bigger problem if the compute function is expensive).

For a pathological case, if all of my keys are ending up in one bucket, they will linearize (and end up in either a linked list or a red-black tree). This is expected, and one would anticipate the ConcurrentHashMap to compensate, and indeed it immediately tries to.

However, in order to transfer the node to a new table, (so far as I can read) it joins the synchronization bandwagon which is currently trying to add data to the node. Eventually the transfer should be able to happen (although I suppose synchronization is unfair?), but until this is the case, the transfer cannot complete.

So, this means that you can get some pretty brutal throughput characteristics if you dump a load of contention onto a Caffeine cache with no warmup period.

For example, if I have 1000 threads try and concurrently hammer the same Caffeine loading cache, sleeping for 100ms in their loading function, I get throughput of roughly 10 writes a second. This eventually will grow, but it will take a very long time.

In our case, this ended up with something approaching a deadlock; a low-capacity cache (we think almost empty, with only a few historical inputs) received concurrently some small number of unexpectedly expensive requests between 0 and 60, which each took (say) a minute. In general the system is easily able to deal with such requests - the cache deduplicates such requests, and there are only a few queries that could take this long. However, in this case they caused us to enter a death spiral; they locked all the buckets and stopped any of the other (far cheaper) requests from being served.

To compensate for this, this PR presizes the cache to 64 elements. It uses something like 256 additional bytes of memory for each cache created, but pretty much removes this pathological case.

Hope this makes sense! I may have misunderstood some stuff about ConcurrentHashMap here though.
@j-baker
Copy link
Contributor Author

j-baker commented Feb 2, 2018

Even a default size of 16 (the same as the default ConcurrentHashMap constructor would be considerably better than at present).

Obviously users can tune this - the issue is that the behavior is very pathological for pretty trivial cases, which is why I wanted to fix this upstream.

@ben-manes
Copy link
Owner

Thanks for the PR!

I think using 16 makes the most sense. I must have misread the default constructor of ConcurrentHashMap and thought it would default to zero when unspecified and lazily update. My intent was to have the same defaults as the underlying map, not to reduce it. Can you confirm that 16 resolves your scenario? Is this something we can add a unit test for?

@j-baker
Copy link
Contributor Author

j-baker commented Feb 2, 2018

Yeah, I can do a unit test. I reckon i can convert production slowdown into unit test deadlock with a cyclicbarrier.

@j-baker
Copy link
Contributor Author

j-baker commented Feb 2, 2018

That said, the unit test behaviour is going to be trivially repeatable for 'some size' of use cases - e.g. with 16 threads i can deadlock 16, with 64 I can deadlock 64, so maybe the test is too like, 'a fancy way to essentially test a constant value'?

Honestly my preference is for more than 64. I've changed it to 16 here, but thinking about it the compute() workflows suffer from this a lot more because they lock the hashmap around external code - so you'd expect to want a larger constant, no?

Happy to defer to whatever you'd like. Thoughts?

@ben-manes
Copy link
Owner

I have heard of usages creating thousands of caches (per-request), e.g. Gatling. That's not the expected use-case, but I do try to be friendly to whatever users throw at it. So for example the internal structures are lazily allocated when possible.

My expectation would be that the cache would resize as it fills up, thereby increasing to an acceptable concurrency. Most caches will be reasonably sized, so it should increase upwards. This would then follow into the behavior as documented of,

Lock contention probability for two threads accessing distinct elements is roughly 1 / (8 * #elements) under random hashes.

Is my understanding that the concurrency increases as the capacity increases incorrect? So this would only impact the warm-up, assuming most operations are reads?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 93.765% when pulling e84866f on j-baker:patch-1 into 7c0e235 on ben-manes:master.

@j-baker j-baker closed this Feb 3, 2018
@j-baker j-baker reopened this Feb 3, 2018
@j-baker
Copy link
Contributor Author

j-baker commented Feb 3, 2018

Ok, makes sense! The use case I occasionally see of an expected very small cache of potentially very expensive to compute entries we can just increase the min size manually for this.

To be clear, it's not just warm-up, you'd also see this in expected-small-cache-but-bursty-accesses workloads. If a cache always stays around 20 elements, but elements are added in groups of 10, you'll likely see something queueing due to lock contention around this.

Not sure why coverage decreased; the test is currently failing but passes for me? Looks like a flake? Not sure how to run the builds again. On the previous commit it passed but then failed while trying to publish some jars some place, which makes sense.

@ben-manes
Copy link
Owner

That's true. But as you said, any setting is arbitrary. Do you think we should add something to the JavaDoc (perhaps on Caffeine.initialCapacity) or the FAQ? If the FAQ, can you provide suggested wording and I'll update it?

It is a flaky test. The coverage might fluctuate due to multi-threaded tests.

@j-baker
Copy link
Contributor Author

j-baker commented Feb 5, 2018

The FAQ seems better - I checked that, but I didn't read the full javadoc. How about

Contention

A case where Caffeine might suffer from contention is when the number of entries currently being computed is similar to or greater than the max number of entries the map has ever contained, and the loading function is very expensive. This corresponds to the currently computing entries being close to the total capacity of the underlying ConcurrentHashMap, which blocks resizing the map on the loading functions completing.

This is expected to happen while the cache is warming up (although likely not at all). If you are seeing contention issues due to an issue like this (manifesting as threads making different requests blocking on the same monitor in ConcurrentHashMap), consider increasing your initial capacity to your expected max concurrency to compensate, or use an async cache.

@ben-manes ben-manes merged commit e84866f into ben-manes:master Feb 5, 2018
@ben-manes
Copy link
Owner

Thanks a lot! I really appreciate that you debugged this and provided a patch, rather than updating the settings on your side and moving on. I updated the FAQ with your wording.

@j-baker j-baker deleted the patch-1 branch February 6, 2018 14:57
@j-baker j-baker changed the title Change Caffeine default cache size to 64 Change Caffeine default cache size to 16 Feb 6, 2018
@ben-manes
Copy link
Owner

Released. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants