-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Caffeine default cache size to 16 #218
Conversation
Caffeine uses a ConcurrentHashMap under the hood. The way this works is that there are a number of bins, and the bins contain nodes (which contain 0 or more k/v pairs). When I do a compute or computeIfAbsent, I look for the bin my node should be in, and if it does not exist, I put a placeholder bin there, compute, and then put the value there. If it does exist, I lock the bin, do whatever checks I need to do, put the value if necessary and unlock. When I am done, I grow the map if necessary (generally increasing it by powers of 2). Caffeine's default map size is 0, which cases the ConcurrentHashMap to be created with no table. This is cheap, but has the drawback of Caffeine starting off with only 2 buckets. The issue here is that when I do a computeIfAbsent call, it synchronizes with every other computeIfAbsent call currently in progress (a bigger problem if the compute function is expensive). For a pathological case, if all of my keys are ending up in one bucket, they will linearize (and end up in either a linked list or a red-black tree). This is expected, and one would anticipate the ConcurrentHashMap to compensate, and indeed it immediately tries to. However, in order to transfer the node to a new table, (so far as I can read) it joins the synchronization bandwagon which is currently trying to add data to the node. Eventually the transfer should be able to happen (although I suppose synchronization is unfair?), but until this is the case, the transfer cannot complete. So, this means that you can get some pretty brutal throughput characteristics if you dump a load of contention onto a Caffeine cache with no warmup period. For example, if I have 1000 threads try and concurrently hammer the same Caffeine loading cache, sleeping for 100ms in their loading function, I get throughput of roughly 10 writes a second. This eventually will grow, but it will take a very long time. In our case, this ended up with something approaching a deadlock; a low-capacity cache (we think almost empty, with only a few historical inputs) received concurrently some small number of unexpectedly expensive requests between 0 and 60, which each took (say) a minute. In general the system is easily able to deal with such requests - the cache deduplicates such requests, and there are only a few queries that could take this long. However, in this case they caused us to enter a death spiral; they locked all the buckets and stopped any of the other (far cheaper) requests from being served. To compensate for this, this PR presizes the cache to 64 elements. It uses something like 256 additional bytes of memory for each cache created, but pretty much removes this pathological case. Hope this makes sense! I may have misunderstood some stuff about ConcurrentHashMap here though.
Even a default size of 16 (the same as the default ConcurrentHashMap constructor would be considerably better than at present). Obviously users can tune this - the issue is that the behavior is very pathological for pretty trivial cases, which is why I wanted to fix this upstream. |
Thanks for the PR! I think using |
Yeah, I can do a unit test. I reckon i can convert production slowdown into unit test deadlock with a cyclicbarrier. |
That said, the unit test behaviour is going to be trivially repeatable for 'some size' of use cases - e.g. with 16 threads i can deadlock 16, with 64 I can deadlock 64, so maybe the test is too like, 'a fancy way to essentially test a constant value'? Honestly my preference is for more than 64. I've changed it to 16 here, but thinking about it the compute() workflows suffer from this a lot more because they lock the hashmap around external code - so you'd expect to want a larger constant, no? Happy to defer to whatever you'd like. Thoughts? |
I have heard of usages creating thousands of caches (per-request), e.g. Gatling. That's not the expected use-case, but I do try to be friendly to whatever users throw at it. So for example the internal structures are lazily allocated when possible. My expectation would be that the cache would resize as it fills up, thereby increasing to an acceptable concurrency. Most caches will be reasonably sized, so it should increase upwards. This would then follow into the behavior as documented of,
Is my understanding that the concurrency increases as the capacity increases incorrect? So this would only impact the warm-up, assuming most operations are reads? |
Ok, makes sense! The use case I occasionally see of an expected very small cache of potentially very expensive to compute entries we can just increase the min size manually for this. To be clear, it's not just warm-up, you'd also see this in expected-small-cache-but-bursty-accesses workloads. If a cache always stays around 20 elements, but elements are added in groups of 10, you'll likely see something queueing due to lock contention around this. Not sure why coverage decreased; the test is currently failing but passes for me? Looks like a flake? Not sure how to run the builds again. On the previous commit it passed but then failed while trying to publish some jars some place, which makes sense. |
That's true. But as you said, any setting is arbitrary. Do you think we should add something to the JavaDoc (perhaps on It is a flaky test. The coverage might fluctuate due to multi-threaded tests. |
The FAQ seems better - I checked that, but I didn't read the full javadoc. How about
|
Thanks a lot! I really appreciate that you debugged this and provided a patch, rather than updating the settings on your side and moving on. I updated the FAQ with your wording. |
Released. Thanks! |
Caffeine uses a ConcurrentHashMap under the hood. The way this works is that there are a number of bins, and the bins contain nodes (which contain 0 or more k/v pairs).
When I do a compute or computeIfAbsent, I look for the bin my node should be in, and if it does not exist, I put a placeholder bin there, compute, and then put the value there. If it does exist, I lock the bin, do whatever checks I need to do, put the value if necessary and unlock.
When I am done, I grow the map if necessary (generally increasing it by powers of 2).
Caffeine's default map size is 0, which cases the ConcurrentHashMap to be created with no table. This is cheap, but has the drawback of Caffeine starting off with only 2 buckets.
The issue here is that when I do a computeIfAbsent call, it synchronizes with every other computeIfAbsent call currently in progress on the same bucket (a bigger problem if the compute function is expensive).
For a pathological case, if all of my keys are ending up in one bucket, they will linearize (and end up in either a linked list or a red-black tree). This is expected, and one would anticipate the ConcurrentHashMap to compensate by growing, and indeed it immediately tries to.
However, in order to transfer the node to a new table, (so far as I can read) it joins the synchronization bandwagon which is currently trying to add data to the node. Eventually the transfer should be able to happen (although I suppose synchronization is unfair?), but until this is the case, the transfer cannot complete - if I have 1000 threads which are all trying to synchronize and block for a long time, the one which is trying to move the node is going to have to wait behind expected at least some of them.
So, this means that you can get some pretty brutal throughput characteristics if you dump a load of contention with slow compute functions (which extend the synchronization overhead in a way that ConcurrentHashMap probably does not expect) onto a Caffeine cache with no warmup period.
For example, if I have 1000 threads try and concurrently hammer the same Caffeine loading cache, sleeping for 100ms in their loading function, I get throughput of roughly 10 writes a second. This eventually will grow, but it will take a very long time.
In our case, this ended up with something approaching a deadlock; a low-capacity cache (we think almost empty, with only a few historical inputs) received concurrently some small number of unexpectedly expensive requests between 0 and 60, which each took (say) a minute. In general the system is easily able to deal with such requests - the cache deduplicates such requests, and there are only a few queries that could take this long. However, in this case they caused us to enter a death spiral; they locked all the buckets and stopped any of the other (far cheaper) requests from being served, filling up the thread pool and killing the server.
To compensate for this, this PR presizes the cache to 64 elements. It uses something like 256 additional bytes of memory for each cache created, but pretty much removes this pathological case.
Hope this makes sense! I may have misunderstood some stuff about ConcurrentHashMap here though.