Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SOLR-16489] CaffeineCache puts thread into infinite loop #1118

Closed
wants to merge 2 commits into from

Conversation

tboeghk
Copy link
Contributor

@tboeghk tboeghk commented Oct 24, 2022

https://issues.apache.org/jira/browse/SOLR-16489

Description

Under heavy machine load, accessing any Solr CaffeineCache can lead to threads spinning in an endless loop. In our setup we had machines spinning at 70% cpu load without receiving any request for hours.

Solution

The problem is caused by delegating the SolrCache#put method to Cache#asMap#put Under heavy machine load with concurrent read and write access to the same key, the ConcurrentHashMap#put method does not terminate an endless loop.

Tests

The behavior happens under heavy load when the cache is under congestion and read / write access is happening in different threads. We had this small fix live for a couple of weeks now without any stuck threads.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@tboeghk tboeghk changed the title Issue/solr 16489 [SOLR-16489] CaffeineCache puts thread into infinite loop Oct 24, 2022
@@ -275,7 +275,7 @@ public V computeIfAbsent(K key, IOFunction<? super K, ? extends V> mappingFuncti
@Override
public V put(K key, V val) {
inserts.increment();
V old = cache.asMap().put(key, val);
V old = cache.asMap().compute(key, (k, v) -> val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to use .asMap()? can we just use cache.get(key, mapping_function)? This looks to be the same concept w/o trying to access the underlying map.

https://www.javadoc.io/static/com.github.ben-manes.caffeine/caffeine/3.1.1/com.github.benmanes.caffeine/com/github/benmanes/caffeine/cache/Cache.html#get(K,java.util.function.Function)

if using asMap() - compute looks correct based on https://www.javadoc.io/static/com.github.ben-manes.caffeine/caffeine/3.1.1/com.github.benmanes.caffeine/com/github/benmanes/caffeine/cache/Cache.html#asMap(), but it looks like we don't even need asMap().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to stay in line with the other methods who also call cache.asMap()

@ben-manes
Copy link

Is this perhaps related to the JIT loop unrolling bug (ben-manes/caffeine#797, SOLR-16463)? (/cc @uschindler)

If you think this is instead a bug in Caffeine, please try to provide a reproducer so that we can isolate a fix. When I run a stress and Lincheck tests it is fine, but these can be tricky. The put is optimized as a retry look instead of a pessimistic compute (2x faster), but in a microbenchmark so likely not important in an actual workload and correctness is more important.

It looks like asMap().compute is the correct replacement as Cache.get is the same as Map.computeIfAbsent and this inserts or replaces an existing value.

@ben-manes
Copy link

ConcurrentHashMap#put method does not terminate an endless loop.

This could be a livelock if Solr is performing recursive writes (e.g. a write within a computation). That is not supported and should fail fast with an error. The last case that I know of that did not error early was fixed in jdk12, but prior versions could spin forever. That last case was related to modifying the map's size counters.

@tboeghk
Copy link
Contributor Author

tboeghk commented Oct 24, 2022

We came across this bug using JDK11 and Solr 8.11.2, so this is unlikely the same as ben-manes/caffeine#797.

@ben-manes
Copy link

Can you try jdk12? Does the stacktrace look to be related to the linked patch?

@tboeghk
Copy link
Contributor Author

tboeghk commented Oct 24, 2022

We ran into the bug using JDK11, JDK17 and JDK19. I'll try to dig out the stack trace / thread dump the threads got stuck in!

@tboeghk
Copy link
Contributor Author

tboeghk commented Oct 24, 2022

The threads got stuck in this infinite loop in BoundedLocalCache. I'll ask @DennisBerger1984 and @mpetris for a detailed stack trace tomorrow.

@uschindler
Copy link
Contributor

Is this perhaps related to the JIT loop unrolling bug (ben-manes/caffeine#797, SOLR-16463)? (/cc @uschindler)

Unlikely. The JIT loop unrolling bug is according to my knowledge a false assert. The generated code looks correct, it just hits an assertion recently added to Hotspot (JDK 17).

The problem here seems to exist since 11.

@uschindler
Copy link
Contributor

The threads got stuck in this infinite loop in BoundedLocalCache. I'll ask @DennisBerger1984 and @mpetris for a detailed stack trace tomorrow.

Actually this loop sometimes hits the assertion recently added in JDK 17. I'd rewrite that infinite loop to something more "CAS" like.

@ben-manes
Copy link

ben-manes commented Oct 25, 2022

Actually this loop sometimes hits the assertion recently added in JDK 17. I'd rewrite that infinite loop to something more "CAS" like.

Unfortunately the challenge with this method is that ConcurrentHashMap is pessimistic in its write operations by locking before searching for the entry in the hash bin (or segment in 5-7). Doug had concerns about a full prescreen due to interactions with biased locking, safepoints, and GC pressure. He did eventually add a partial prescreen for the head of the bin, but that cannot be relied upon as the table is designed to prefer a high collision rate (cheap/poor hash offset by red-black tree bins).

CAS-style would be a putIfAbsent loop, so it would lock even if unsuccessful. In that case then the loop could be removed as locking is the costly work, so a compute would be preferable. Some users have assumed that putIfAbsent is a cheap read if present (e.g. apache/openwhisk#2797), so the cache tries to read and fallback to locking only when appropriate.

You can think of this as a TTAS loop which has a similar speedup in comparison to the more straightforward TAS approach.

@uschindler
Copy link
Contributor

Hi Ben,
thanks for the insight. My comment was not directly about the type of loop (CAS, TTAS,...) it was more about the coding style. Hotspot sometimes tries to optimize loops where it thinks it has some common pattern. My proposal would be to mabye form the loop in a way that it follows a common pattern (like mentioned CAS, or use a while with condition up-front,...). This one is a for(;;) with very strange exit conditions and I had the feeling it causes some bugs in hotspot and maybe also triggers the infinite loop. In Lucene's bug we had some similar problems where loop invariants were incorrectly guessed.

Of course if this is a bug in your code leading to the infinite loop, then it is not a hotspot problem. What I do not understand: How does the current PR fixes the bug?

@tboeghk
Copy link
Contributor Author

tboeghk commented Oct 25, 2022

@uschindler the PR avoids the put() method which leads to the infinite loop by using the compute() method with a supplier. This will not reach the infinite loop.

@ben-manes
Copy link

We will need to see the stacktrace and debug details. Caffeine’s loop retries only if the entry was a stale read and once locked is determined to have been already removed from the map. That state transition can only happen in the lock and within a map removal operation. That leaves a small window for a stale read, with the caveat of Map.remove(key). That method discards a refresh, which could block if something weird happened during registration. Since refresh isn’t being used here, I cannot deduce any reason for a bug in Caffeine’s code. As ConcurrentHashMap also loops forever, and was the originally claimed source, we need the details to assess a possible cause.

@DennisBerger1984
Copy link

If I remember correctly the "prior" in line 2203 of BoundedLocalCache in caffeineCache 3.3.1 was always non null and had the keyReference DEAD_STRONG_KEY such that the isAlive() check always returned false in line 2248. The thread spins in the enclosing for loop forever.

@DennisBerger1984
Copy link

Would a stacktrace help? To reproduce the problem I would need to deploy a cluster on our side as we were not able to trigger the bug with a test case. However with a gatling loadtest we were able to reproduced it reliably. To set everything up I need about 4 hours. We were able to pin point the endless loop by attaching to the debug endpoint and pause all threads.

@ben-manes
Copy link

Thanks @DennisBerger1984.

We came across this bug using JDK11 and Solr 8.11.2
If I remember correctly ... in caffeineCache 3.3.1

I see that Solr 8.11.2 is using 2.9.2, whereas currently it is using 3.1.1. Can we confirm that your failure happens using 3.3.1? I don't recall any related bug fixes, but it would help us narrow in on the problem.

@DennisBerger1984
Copy link

DennisBerger1984 commented Oct 25, 2022

Hi @ben-manes,
I can confirm that. We've bumped the version in our gradle build and I just verified that in the resulting docker image the actual jar is 3.1.1.

@ben-manes
Copy link

If you do come by a stacktrace please let me know. I am not having luck in a review of the code, but a fresh set of eyes is always welcome. If you can reproduce then look for other exceptions, like out of memory error or stackoverflow, which could leave an invalid state. In those cases the jvm may not be recoverable, e.g. ExitOnOutOfMemoryError flag.

@ben-manes
Copy link

Thanks to @DennisBerger1984's help, we met to step through the debug session of the cache in this invalid state. Dennis captured this jstack output. We confirmed the findings that the filterCache is stuck in an infinite loop due to retrieving a dead entry. When pairing on a code walkthrough, we came up with a scenario that could explain this problem and would require some Solr experts to advise us on.

In short, is the cache key stable in filter cache? This is a org.apache.lucene.search.Query object and appears to be mutable. Can a query be modified after it was inserted into the cache? If so, then the cache might fail to remove it during eviction, eagerly mark it as dead, and rediscover it later by a lookup. As modifiable keys are violations to the Map contract it is not detected or handled.

During eviction, Caffeine uses a computeIfPresent to remove the victim entry. If the entry was not found in the map, then we assume it was due to an explicit, concurrent removal. This allows the cache to eagerly decrement the current size and avoid unnecessary additional evictions to be within bounds. If a concurrent removal then a RemovalTask is queued and will no-op when run. However if the key's hash/equals were modified then eviction would instead leave a dead entry in the map, and a subsequent lookup might discover it by a map lookup.

@DennisBerger1984 suggested that the filter cache could use an immutable cache key object. Given the subtypes, Is there an easy way to do this? Would using toString() result in a key with the right uniqueness properties? If we can apply this change then Dennis will retest and we'll see if our hypothesis holds.

@uschindler
Copy link
Contributor

Query is by default an immutable object. So there should not be a separate cache key.
If you found a buggy query with unstable hashcode/equals, please report this at Lucene. This is a bug. Please no workarounds here!

@ben-manes
Copy link

Thanks @uschindler. I think we need to confirm this hypothesis first. Is there an easy way to make an immutable copy for us to test with? The actual fix might be different, but this would be the easiest way to verify our guess.

@uschindler
Copy link
Contributor

There is no clone method for queries, sorry. I don't know what the best way is to solve this. What type of query was causing this?

But the bug with infinite loop should be fixed, so it throws some exception.

@uschindler
Copy link
Contributor

You may temporary use toString as cache key. But this is not working safely for all query types.

@ben-manes
Copy link

I believe it was a WrappedQuery holding a complex BooleanQuery, but we did not look enough. I only recall that the key equality was complex with hashmap comparisons, and when we inspected the key it had setters and non-final fields. We didn't think of capturing screenshots of the object states when debugging. Dennis might be able to offer more details.

I will think more about how to handle the infinite loop in this scenario. A retry is needed if a stale read, e.g. the reader might lookup the entry, be context switched, it evicted, and then observe it to be dead. That's fine and a subsequent re-read should cause it to see a fresh result. I may be able to handle this case a little better, but since it can happen safely it is tricky.

@uschindler
Copy link
Contributor

uschindler commented Oct 28, 2022

WrappedQuery is a solr class. Many of the queries there are old and we're never checked for final fields and correctness.
This would be a good job to start fixing the - sorry to be harsh - heavily broken legacy query bullshit in Solr. Solr should only use Lucene queries and not add those useless wrappers causing trouble.
In Lucene every query only has final fields and setters are forbidden. We had some problematic ones but this was fixed long ago.

@uschindler
Copy link
Contributor

This query must die, please make fieldss final and remove the setter: https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/WrappedQuery.java

@risdenk
Copy link
Contributor

risdenk commented Oct 28, 2022

@uschindler to make sure I understand - anything in Solr that extends from https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/Query.java should have no setters and final fields. effectively immutable. makes sense to me

I filed https://issues.apache.org/jira/browse/SOLR-16509 to fix these.

@uschindler
Copy link
Contributor

For booleanQuery we have many tests in Lucene core and its own cache implementation. So it is three cases where it may break:

  • somebody calls getClauses() and modifies them (adding entries). We should prevent that by making the clauses unmodifiable in Lucene. I'll open an issue.
  • somebody modified one query inside the clauses. I would need to get all clauses of the above example
  • the WrappedQuery itsself is the issue. I would start there, see above.

@risdenk
Copy link
Contributor

risdenk commented Nov 2, 2022

https://github.com/apache/solr/search?q=setWrappedQuery is definitely not used in Solr main.

@DennisBerger1984
Copy link

@uschindler does this help?

1
2
3

@uschindler
Copy link
Contributor

Yes this helps. Interestingly: The "clauses" of a BooleanQuery are non-mutable, it is already wrapped with Collections.unmodifiableList(). So nobody can change the clauses after calling BooleanQuery.getCauses().

@ben-manes
Copy link

ben-manes commented Nov 2, 2022

oh yes, I did not notice that Collection<Query> getClauses(Occur occur) is package-private. This is returning a mutable collection from clauseSets, but since it is not public then that is safe (maybe package-private for testing).

@uschindler
Copy link
Contributor

This is a non-Lucene or non-Solr query from the external Querqy plugin.

https://github.com/querqy/querqy/blob/1dbafde2eddc0938b309bbe955f127148671fd41/querqy-for-lucene/querqy-lucene/src/main/java/querqy/lucene/rewrite/FieldBoostTermQueryBuilder.java

It looks fine, too. It also implements equals and hashCode.

@uschindler
Copy link
Contributor

DisjunctionMaxQuery unfortunately has a getDisjuncts that returns a mutable Multiset (a Lucene collection subclass): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java

@DennisBerger1984
Copy link

Hey should we provide additional information like tests without querqy plugin or is it sufficient to wait for ticket #1146?
What do you think?

@uschindler
Copy link
Contributor

Hey should we provide additional information like tests without querqy plugin or is it sufficient to wait for ticket #1146? What do you think?

If you could temporarily disable querqy it would be a good test. I was not able to find all the queries mentioned. At least we could rule out or confirm that querqy is the problem.

The cleanup work in #1146 is good to have, but as WrappedQuery's setQuery was not called from code it wa sin fact immutable.

I still have some problems with the DisjunctionMaxQueries.... (they are created by Querqy), this is a Lucene class and not fully sealed against changes in their clauses. Maybe Querqy modifies them, Solr at least does not do this.

@renekrie
Copy link

renekrie commented Nov 11, 2022

@DennisBerger1984 Are you using the newish multiMatchTie as a Querqy query parameter (related to querqy/querqy#281)? Also, are you setting uq.similarityScore or qboost.similarityScore (see https://querqy.org/docs/querqy/more-about-queries.html#reference) - I'm trying to think of a place where we are manipulating queries more heavily. On the other hand, we never touch anything after the 'parse' step and we never read from the cache, so I can't see how we could touch anything that has already been put into the cache.

@DennisBerger1984
Copy link

Hi René, qboost.similarityScore is off, multiMatchTie is set to 0.2. I'm currently running a test with defType=edismax to futher drill down.

@renekrie
Copy link

Thanks @DennisBerger1984! There is a smell related to multiMatchTie - querqy/querqy#384. This FieldBoost implementation should be used under the hood with multiMatchTie under certain conditions and be referenced from queries and it misses hashcode/equals (though I cannot see SingleFieldBoost in your object graph). It should still result just in a cache miss and not trigger a loop. Maybe for you test, you could just remove the multiMatchTie param in a first step.

@DennisBerger1984
Copy link

When using defType=edismax the problem doesn't occur. Keeping querqy as defType but removing multiMatchTie doesn't help. But to be honest multiMatchTie was disabled anyway I think.

@uschindler
Copy link
Contributor

A completely missing equals/hashCode should not bring problems, because the cache would simply not work. For the bug to occur the hashcode or equals must suddenly change for an existing query.

@uschindler
Copy link
Contributor

uschindler commented Nov 11, 2022

For current lucene queries the problem may occur when you pass a BytesRef (it is a ref) only while building the query. If you change the bytesRef afterwards it fails.

IMHO queries should clone the terms passed in. Solr has the big problem of over-reusing mutable references, although this is really no problem anymore as JVM can handle that with escape analysis and won't allocate instances. Actually "reusing" is bad in newer java versions, as it will populate the heap with useless temporary (but reused) objects.

@renekrie
Copy link

For current lucene queries the problem may occur when you pass a BytesRef (it is a ref) only while building the query. If you change the bytesRef afterwards it fails.

I briefly checked this but as far as I can see we are always using a fresh BytesRef. Manipulations at the Querqy level (= before going to the analysis chain) happen via character arrays anyway and they aren't being used any longer once a Lucene query has been produced.

@DennisBerger1984 Can I ask you to disable query rewriters or/and use empty rule sets with rewriters still being enabled to narrow down the problem?

@debe
Copy link
Contributor

debe commented Nov 11, 2022

@renekrie the querqy.rewriters are completely empty already during this tests

@ben-manes
Copy link

Released caffeine 3.1.2 with this error detection

@risdenk
Copy link
Contributor

risdenk commented Nov 28, 2022

I opened SOLR-16562 to upgrade to Caffeine 3.1.2 which will detect infinite loops (not fix the issue).

@risdenk
Copy link
Contributor

risdenk commented Jan 18, 2023

@tboeghk @DennisBerger1984 @mpetris what is the status here? Did you end up finding the Query impl that is changing hashcode/equals? Is this PR necessary still?

@DennisBerger1984
Copy link

Hi @risdenk I've been rather busy lately. In the mean time did we upgrade to caffeine 3.1.2 and removed our workaround. Since then we did not encountered the problem again, but neither did we saw contract violated messages from caffeine. So we didn't find it, sadly.

@janhoy
Copy link
Contributor

janhoy commented Feb 15, 2023

This PR is no longer needed when we have the upgraded CaffeineCache version, or not?

@ben-manes
Copy link

PR is not necessary but was the underlying bug fixed? The issue was corrupting the cache by modifying the key’s equality while inside of a ConcurrentHashMap. The upgrade just yells if detected, but it was a usage error.

@DennisBerger1984
Copy link

All in all it's very hard for us to reproduce the bug reliably. I've seen a couple "An invalid state was detected" message two weeks ago and thought now I've found an offending query, but unfortunatbut they've disappeared and the last week was free of those messages. I've scanned around 1TB and 100 million log entries.

Copy link

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Feb 18, 2024
@uschindler
Copy link
Contributor

I think we should close this PR. The underlying fixed impl now yells if an issue was detected and no longer hangs.

If the warning message is still printed, it's a broken equals/hashcode and should be taken care of separately.

@uschindler uschindler closed this Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale PR not updated in 60 days
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants