-
Notifications
You must be signed in to change notification settings - Fork 679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SOLR-16489] CaffeineCache puts thread into infinite loop #1118
Conversation
@@ -275,7 +275,7 @@ public V computeIfAbsent(K key, IOFunction<? super K, ? extends V> mappingFuncti | |||
@Override | |||
public V put(K key, V val) { | |||
inserts.increment(); | |||
V old = cache.asMap().put(key, val); | |||
V old = cache.asMap().compute(key, (k, v) -> val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to use .asMap()
? can we just use cache.get(key, mapping_function)
? This looks to be the same concept w/o trying to access the underlying map.
if using asMap()
- compute
looks correct based on https://www.javadoc.io/static/com.github.ben-manes.caffeine/caffeine/3.1.1/com.github.benmanes.caffeine/com/github/benmanes/caffeine/cache/Cache.html#asMap(), but it looks like we don't even need asMap()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to stay in line with the other methods who also call cache.asMap()
Is this perhaps related to the JIT loop unrolling bug (ben-manes/caffeine#797, SOLR-16463)? (/cc @uschindler) If you think this is instead a bug in Caffeine, please try to provide a reproducer so that we can isolate a fix. When I run a stress and Lincheck tests it is fine, but these can be tricky. The It looks like |
This could be a livelock if Solr is performing recursive writes (e.g. a write within a computation). That is not supported and should fail fast with an error. The last case that I know of that did not error early was fixed in jdk12, but prior versions could spin forever. That last case was related to modifying the map's size counters. |
We came across this bug using JDK11 and Solr 8.11.2, so this is unlikely the same as ben-manes/caffeine#797. |
Can you try jdk12? Does the stacktrace look to be related to the linked patch? |
We ran into the bug using JDK11, JDK17 and JDK19. I'll try to dig out the stack trace / thread dump the threads got stuck in! |
The threads got stuck in this infinite loop in |
Unlikely. The JIT loop unrolling bug is according to my knowledge a false assert. The generated code looks correct, it just hits an assertion recently added to Hotspot (JDK 17). The problem here seems to exist since 11. |
Actually this loop sometimes hits the assertion recently added in JDK 17. I'd rewrite that infinite loop to something more "CAS" like. |
Unfortunately the challenge with this method is that CAS-style would be a putIfAbsent loop, so it would lock even if unsuccessful. In that case then the loop could be removed as locking is the costly work, so a compute would be preferable. Some users have assumed that You can think of this as a TTAS loop which has a similar speedup in comparison to the more straightforward TAS approach. |
Hi Ben, Of course if this is a bug in your code leading to the infinite loop, then it is not a hotspot problem. What I do not understand: How does the current PR fixes the bug? |
@uschindler the PR avoids the |
We will need to see the stacktrace and debug details. Caffeine’s loop retries only if the entry was a stale read and once locked is determined to have been already removed from the map. That state transition can only happen in the lock and within a map removal operation. That leaves a small window for a stale read, with the caveat of Map.remove(key). That method discards a refresh, which could block if something weird happened during registration. Since refresh isn’t being used here, I cannot deduce any reason for a bug in Caffeine’s code. As ConcurrentHashMap also loops forever, and was the originally claimed source, we need the details to assess a possible cause. |
If I remember correctly the "prior" in line 2203 of BoundedLocalCache in caffeineCache 3.3.1 was always non null and had the keyReference DEAD_STRONG_KEY such that the isAlive() check always returned false in line 2248. The thread spins in the enclosing for loop forever. |
Would a stacktrace help? To reproduce the problem I would need to deploy a cluster on our side as we were not able to trigger the bug with a test case. However with a gatling loadtest we were able to reproduced it reliably. To set everything up I need about 4 hours. We were able to pin point the endless loop by attaching to the debug endpoint and pause all threads. |
Thanks @DennisBerger1984.
I see that Solr 8.11.2 is using 2.9.2, whereas currently it is using 3.1.1. Can we confirm that your failure happens using 3.3.1? I don't recall any related bug fixes, but it would help us narrow in on the problem. |
Hi @ben-manes, |
If you do come by a stacktrace please let me know. I am not having luck in a review of the code, but a fresh set of eyes is always welcome. If you can reproduce then look for other exceptions, like out of memory error or stackoverflow, which could leave an invalid state. In those cases the jvm may not be recoverable, e.g. ExitOnOutOfMemoryError flag. |
Thanks to @DennisBerger1984's help, we met to step through the debug session of the cache in this invalid state. Dennis captured this jstack output. We confirmed the findings that the filterCache is stuck in an infinite loop due to retrieving a dead entry. When pairing on a code walkthrough, we came up with a scenario that could explain this problem and would require some Solr experts to advise us on. In short, is the cache key stable in filter cache? This is a During eviction, Caffeine uses a @DennisBerger1984 suggested that the filter cache could use an immutable cache key object. Given the subtypes, Is there an easy way to do this? Would using |
Query is by default an immutable object. So there should not be a separate cache key. |
Thanks @uschindler. I think we need to confirm this hypothesis first. Is there an easy way to make an immutable copy for us to test with? The actual fix might be different, but this would be the easiest way to verify our guess. |
There is no clone method for queries, sorry. I don't know what the best way is to solve this. What type of query was causing this? But the bug with infinite loop should be fixed, so it throws some exception. |
You may temporary use toString as cache key. But this is not working safely for all query types. |
I believe it was a I will think more about how to handle the infinite loop in this scenario. A retry is needed if a stale read, e.g. the reader might lookup the entry, be context switched, it evicted, and then observe it to be dead. That's fine and a subsequent re-read should cause it to see a fresh result. I may be able to handle this case a little better, but since it can happen safely it is tricky. |
WrappedQuery is a solr class. Many of the queries there are old and we're never checked for final fields and correctness. |
This query must die, please make fieldss final and remove the setter: https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/WrappedQuery.java |
@uschindler to make sure I understand - anything in Solr that extends from https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/Query.java should have no setters and final fields. effectively immutable. makes sense to me I filed https://issues.apache.org/jira/browse/SOLR-16509 to fix these. |
For booleanQuery we have many tests in Lucene core and its own cache implementation. So it is three cases where it may break:
|
https://github.com/apache/solr/search?q=setWrappedQuery is definitely not used in Solr main. |
@uschindler does this help? |
Yes this helps. Interestingly: The "clauses" of a BooleanQuery are non-mutable, it is already wrapped with Collections.unmodifiableList(). So nobody can change the clauses after calling BooleanQuery.getCauses(). |
oh yes, I did not notice that |
This is a non-Lucene or non-Solr query from the external Querqy plugin. It looks fine, too. It also implements equals and hashCode. |
DisjunctionMaxQuery unfortunately has a getDisjuncts that returns a mutable Multiset (a Lucene collection subclass): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java |
Hey should we provide additional information like tests without querqy plugin or is it sufficient to wait for ticket #1146? |
If you could temporarily disable querqy it would be a good test. I was not able to find all the queries mentioned. At least we could rule out or confirm that querqy is the problem. The cleanup work in #1146 is good to have, but as WrappedQuery's setQuery was not called from code it wa sin fact immutable. I still have some problems with the DisjunctionMaxQueries.... (they are created by Querqy), this is a Lucene class and not fully sealed against changes in their clauses. Maybe Querqy modifies them, Solr at least does not do this. |
@DennisBerger1984 Are you using the newish |
Hi René, qboost.similarityScore is off, multiMatchTie is set to 0.2. I'm currently running a test with defType=edismax to futher drill down. |
Thanks @DennisBerger1984! There is a smell related to multiMatchTie - querqy/querqy#384. This FieldBoost implementation should be used under the hood with multiMatchTie under certain conditions and be referenced from queries and it misses hashcode/equals (though I cannot see SingleFieldBoost in your object graph). It should still result just in a cache miss and not trigger a loop. Maybe for you test, you could just remove the |
When using defType=edismax the problem doesn't occur. Keeping querqy as defType but removing multiMatchTie doesn't help. But to be honest multiMatchTie was disabled anyway I think. |
A completely missing equals/hashCode should not bring problems, because the cache would simply not work. For the bug to occur the hashcode or equals must suddenly change for an existing query. |
For current lucene queries the problem may occur when you pass a BytesRef (it is a ref) only while building the query. If you change the bytesRef afterwards it fails. IMHO queries should clone the terms passed in. Solr has the big problem of over-reusing mutable references, although this is really no problem anymore as JVM can handle that with escape analysis and won't allocate instances. Actually "reusing" is bad in newer java versions, as it will populate the heap with useless temporary (but reused) objects. |
I briefly checked this but as far as I can see we are always using a fresh BytesRef. Manipulations at the Querqy level (= before going to the analysis chain) happen via character arrays anyway and they aren't being used any longer once a Lucene query has been produced. @DennisBerger1984 Can I ask you to disable query rewriters or/and use empty rule sets with rewriters still being enabled to narrow down the problem? |
@renekrie the querqy.rewriters are completely empty already during this tests |
Released caffeine 3.1.2 with this error detection |
I opened SOLR-16562 to upgrade to Caffeine 3.1.2 which will detect infinite loops (not fix the issue). |
@tboeghk @DennisBerger1984 @mpetris what is the status here? Did you end up finding the Query impl that is changing hashcode/equals? Is this PR necessary still? |
Hi @risdenk I've been rather busy lately. In the mean time did we upgrade to caffeine 3.1.2 and removed our workaround. Since then we did not encountered the problem again, but neither did we saw contract violated messages from caffeine. So we didn't find it, sadly. |
This PR is no longer needed when we have the upgraded CaffeineCache version, or not? |
PR is not necessary but was the underlying bug fixed? The issue was corrupting the cache by modifying the key’s equality while inside of a ConcurrentHashMap. The upgrade just yells if detected, but it was a usage error. |
All in all it's very hard for us to reproduce the bug reliably. I've seen a couple "An invalid state was detected" message two weeks ago and thought now I've found an offending query, but unfortunatbut they've disappeared and the last week was free of those messages. I've scanned around 1TB and 100 million log entries. |
This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution! |
I think we should close this PR. The underlying fixed impl now yells if an issue was detected and no longer hangs. If the warning message is still printed, it's a broken equals/hashcode and should be taken care of separately. |
https://issues.apache.org/jira/browse/SOLR-16489
Description
Under heavy machine load, accessing any Solr CaffeineCache can lead to threads spinning in an endless loop. In our setup we had machines spinning at 70% cpu load without receiving any request for hours.
Solution
The problem is caused by delegating the
SolrCache#put
method toCache#asMap#put
Under heavy machine load with concurrent read and write access to the same key, theConcurrentHashMap#put
method does not terminate an endless loop.Tests
The behavior happens under heavy load when the cache is under congestion and read / write access is happening in different threads. We had this small fix live for a couple of weeks now without any stuck threads.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.