feat: uplift Lucene to 4.10.4 #74

pgj · 2023-05-20T01:33:31Z

Update Clouseau to use the latest available Lucene version from the 4.x series. This change brings many bug fixes while it maintains the backward compatibility for the existing indexes. 4.10.4 has 45 API changes and 116 bug fixes since 4.6.1, for the details see the change log.

pgj · 2023-05-20T01:55:01Z

It passes the make elixir-search tests, but many tests fail when running with make mango-test:

COUCH_USER=adm COUCH_PASS=pass .venv/bin/python3 -m nose2  ... ........s.ss.s.s.............s.ss.s.s.......F.FF.s.ss.s.s........................................................................................................................................FF...................s...................................................FF...............................................................................F.....FF.......
======================================================================
FAIL: test_elem_match (03-operator-test.OperatorTextTests.test_elem_match)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 48, in test_elem_match
    self.assertEqual(len(docs), 1)
AssertionError: 0 != 1

======================================================================
FAIL: test_eq_null_does_not_include_missing (03-operator-test.OperatorTextTests.test_eq_null_does_not_include_missing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 113, in test_eq_null_does_not_include_missing
    self.assertUserIds(user_ids, docs)
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 22, in assertUserIds
    self.assertEqual(user_ids, user_ids_returned)
AssertionError: Lists differ: [9] != []

First list contains 1 additional elements.
First extra element 0:
9

- [9]
?  -

+ []

======================================================================
FAIL: test_exists_false (03-operator-test.OperatorTextTests.test_exists_false)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 106, in test_exists_false
    self.assertUserIds(user_ids, docs)
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 22, in assertUserIds
    self.assertEqual(user_ids, user_ids_returned)
AssertionError: Lists differ: [2, 3, 5, 6, 7, 8, 10, 11, 12, 14] != []

First list contains 10 additional elements.
First extra element 0:
2

- [2, 3, 5, 6, 7, 8, 10, 11, 12, 14]
+ []

======================================================================
FAIL: test_gt (06-basic-text-test.BasicTextTests.test_gt)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/06-basic-text-test.py", line 152, in test_gt
    assert len(docs) == 2
           ^^^^^^^^^^^^^^
AssertionError

======================================================================
FAIL: test_gte (06-basic-text-test.BasicTextTests.test_gte)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/06-basic-text-test.py", line 175, in test_gte
    assert len(docs) == 2
           ^^^^^^^^^^^^^^
AssertionError

======================================================================
FAIL: test_limit_bookmark (08-text-limit-test.LimitTests.test_limit_bookmark)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/08-text-limit-test.py", line 97, in test_limit_bookmark
    self.run_bookmark_check(i)
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/08-text-limit-test.py", line 115, in run_bookmark_check
    assert len(seen_docs) == len(limit_docs.DOCS)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

======================================================================
FAIL: test_limit_field (08-text-limit-test.LimitTests.test_limit_field)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/08-text-limit-test.py", line 23, in test_limit_field
    assert len(docs) == 8
           ^^^^^^^^^^^^^^
AssertionError

======================================================================
FAIL: test_guess_dup_type_sort (09-text-sort-test.SortTests.test_guess_dup_type_sort)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/09-text-sort-test.py", line 79, in test_guess_dup_type_sort
    self.assertEqual(len(docs), 15)
AssertionError: 0 != 15

======================================================================
FAIL: test_number_sort (09-text-sort-test.SortTests.test_number_sort)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/09-text-sort-test.py", line 22, in test_number_sort
    self.assertEqual(len(docs), 15)
AssertionError: 0 != 15

======================================================================
FAIL: test_number_sort_desc (09-text-sort-test.SortTests.test_number_sort_desc)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/09-text-sort-test.py", line 28, in test_number_sort_desc
    self.assertEqual(len(docs), 15)
AssertionError: 0 != 15

----------------------------------------------------------------------
Ran 362 tests in 26.674s

FAILED (failures=10, skipped=16)

The common reason for the failures is that Clouseau does not return any value. Based on my investigations so far, it works as expected up to 4.8.1, and it starts to fail for 4.9.0 and later versions. I have checked the change log, but none of the API changes directly indicate anything related.

jaydoane · 2023-05-20T22:02:58Z

Awesome work!

I tested it, and see the same test failures:

Ran 361 tests in 39.905s

FAILED (failures=10, skipped=16)

Also noting that my machine appears to be somewhat slower than yours (my ~40s vs your ~27s), which I wonder if accounts for why the write lock timeout error seems more reliable on your hardware?

FWIW, I also ran these 2 tests 1000 times in a row, and didn't see any errors:

test_disable_index_array_length (10-disable-array-length-field-test.DisableIndexArrayLengthsTest) ... ok
test_enable_index_array_length (10-disable-array-length-field-test.DisableIndexArrayLengthsTest) ... ok

pgj · 2023-05-20T23:13:17Z

Awesome work!

👍 But, as indicated, I will have to nag @rnewson for hints on why it does not fully work.

I tested it, and see the same test failures:

Thanks for the confirmation!

[M]y machine appears to be somewhat slower than yours [..], which I wonder if accounts for why the write lock timeout error seems more reliable on your hardware?

Yes, that is possible. However, I thought that you have an 2021 M1 Max too.

FWIW, I also ran these 2 tests 1000 times in a row, and didn't see any errors:

You will not, these tests are faulty. And I think I have found the cause for that -- I will soon submit a PR about that for apache/couchdb. Prepare for the facepalm :-D

jaydoane · 2023-05-20T23:19:32Z

these tests are faulty

Yeah, I've been debugging them a bit today. Seems like the user docs get bulk loaded into the db, but then db info only shows the ddocs, but also no deleted docs. I'm looking forward to the 🤦

pgj · 2023-05-21T00:41:29Z

Here are the bugs of 10-disable-array-length-field-test with their proposed fixes: apache/couchdb#4610.

pgj · 2023-05-21T11:40:36Z

One of the queries that return 0 hits starting from 4.9.0:

03-operator-test.test_elem_match, Mango query:

{
  "_id": {"$gt": null},
  "bang": {"$elemMatch": {"foo": {"$gte": 1}, "bam": true}},
}

which translates to:

($fieldnames:_5fid_3a* $fieldnames:_5fid.*) +(+bang._5b_5d.foo_3anumber:[1.0 TO Infinity] +bang._5b_5d.bam_3aboolean:true)

Note that the following queries work at the same time:

($fieldnames:_5fid_3a* $fieldnames:_5fid.*) +bang._5b_5d.foo_3anumber:[1.0 TO Infinity]

and

($fieldnames:_5fid_3a* $fieldnames:_5fid.*) +bang._5b_5d.bam_3aboolean:true

pgj · 2023-05-21T14:27:55Z

After taking a closer look at the rest of the failing tests cases, I noticed that it is the ranges on numbers that cause the problems. More specifically, ranges with no lower or upper limits, i.e. the use of the Infinity constant. Apparently, Lucene 4.9 and higher do not handle properly if Infinity is written with a capital I.

The previous query works if the Infinity is either written as infinity or infinite, that is:

($fieldnames:_5fid_3a* $fieldnames:_5fid.*) +(+bang._5b_5d.foo_3anumber:[1.0 TO infinity] +bang._5b_5d.bam_3aboolean:true)

If CouchDB is modified according to this observation, make mango-test passes:

diff --git a/src/mango/src/mango_selector_text.erl b/src/mango/src/mango_selector_text.erl
index 1f8609ac2..4b46031eb 100644
--- a/src/mango/src/mango_selector_text.erl
+++ b/src/mango/src/mango_selector_text.erl
@@ -304,11 +304,11 @@ range(gt, Arg) ->
     [<<"{">>, value_str(Arg), <<" TO ", Max/binary, "]">>].

 get_range(min, Arg) when is_number(Arg) ->
-    <<"-Infinity">>;
+    <<"-infinity">>;
 get_range(min, _Arg) ->
     <<"\"\"">>;
 get_range(max, Arg) when is_number(Arg) ->
-    <<"Infinity">>;
+    <<"infinity">>;
 get_range(max, _Arg) ->
     <<"\u0x10FFFF">>.

pgj · 2023-05-21T16:59:08Z

Unfortunately, this latest fix does not solve the problem, only masks it. That is because using infinity (basically any other text literal) causes Lucene to return all the documents which are then filtered on the CouchDB side by Mango. That is, more documents are returned than required in result and this is not efficient.

I was playing further with the Lucene queries. For the user_docs sample database used by the Mango integration tests, I noticed the following behavior. Documents are there and all of them are returned when no filtering is applied:

*:*

Infinity as upper bound works when the lower bound is set to the minimum age value, e.g.:

age_3anumber:[22.0 TO Infinity]

This can be lowered until 2.0 when Lucene stops returning any hit, e.g.:

age_3anumber:[2.0 TO Infinity]

When Infinity is not used, even 0.0 (as used in many tests) works as lower bound, e.g.:

age_3anumber:[0.0 TO 100.0]

At the same time, the inclusive ranges do not include the value of the lower bound. That is the following query will not include 73 unless the lower bound is lowered to 72.0:

age_3anumber:[73.0 TO Infinity]

pgj · 2023-05-21T17:06:30Z

For the reference, here are the commands that I used for the investigation:

# Run in the CouchDB git clone:
MANGO_TESTS_KEEP_DBS=1 make mango-test MANGO_TEST_OPTS="03-operator-test.OperatorTextTests.test_elem_match"
# Start CouchDB manually via dev/run (-a adm:pass -n 1), then:
MANGO_DB=$(curl -sS http://adm:pass@localhost:15984/_all_dbs | jq -r '.[]|select(startswith("mango_test_"))')
# Set the query:
QUERY="age_3anumber:[2.1 TO Infinity]"
# Run the query:
curl -s -X POST -H 'Content-Type: application/json' "http://adm:pass@localhost:15984/$MANGO_DB/_design/cedc01a027213706d7260b5e5b73c70b9233743a/_search/cedc01a027213706d7260b5e5b73c70b9233743a" -d '{"q":"'$QUERY'","include_docs":true}' | jq -r '.'

pgj · 2023-05-21T19:21:46Z

I think I have the source of the problem. LUCENE-5609 applies an optimization which technically implies a change in the run-time behavior because DoubleRanges should be created with the precision step of 16 instead of 8. I changed it to this value but we may want to lose it completely, because it is optional.

rnewson · 2023-05-22T17:21:58Z

src/main/scala/com/cloudant/clouseau/ClouseauQueryParser.scala

@@ -62,7 +62,7 @@ class ClouseauQueryParser(version: Version,
                             startInclusive: Boolean,
                             endInclusive: Boolean): Query = {
    if (isNumber(lower) && isNumber(upper)) {
-      NumericRangeQuery.newDoubleRange(field, 8, lower.toDouble,
+      NumericRangeQuery.newDoubleRange(field, 16, lower.toDouble,


This triggered some investigation on my part, coupled with the other comments about precision step default change.

Seems we currently index with precision step of 4 (default for 4.6.1) but query with precision step 8. That's probably a bug, albeit one we've had for a long time.

I don't think we can change this when we upgrade as it might cause search results to change.

Instead, we need to keep the above as 8 and change how we index doubles to continue to do so with precision step of 4 not the 'new' default of 16.

See 742670b for my attempt at doing so.

I have had a feeling that this is going to be something that is problematic. The optimization leads to smaller indexes which means they need to be rebuilt, right? The integration tests work with fresh indexes, but for existing deployments this could be a problem.

Thank you very much for the proposal, I will have take a look at it and integrate soon.

3519e68 rather

no, it means that search results could change. some double values will be quantised to different values before and after this change. Existing indexes will have a mixture of the old and new precision step. Simply put, we can change precision step (and other things) but it requires an index rebuild to get the right answers.

also quoting from 5609;

To be simple, i think a change we make has to be a multiple of 4: the current default.
This way we have easy backwards compat with existing indexes for users that just go with the default (since NumericRangeQuery has ctors that use the default precisionStep, and multiples of the index-time work at query-time).

Sorry, I do not see how does all this translates to the current situation. Should we change the Double field type wrapper to set the numeric precision step to 8 instead of 4?

Sorry, there are several issues here.

Clouseau has apparently always been wrong on this. We are indexing numeric values with precision step of 4 but only ever querying with precision step of 8. Half of those terms are never queried, there's no possible path to them, but we took the time to index them and store them.

Later version of Lucene have changed the default (from 4 to 16), which triggered this conversation.

I don't think we should change to 16 as it would change the results of current queries (I think, but this should be challenged, perhaps I misunderstand the implications).

So, the least we must do is prevent the default from changing to 16, and that requires the FieldType changes I show in my commit above.

The only question remaining is whether we keep exactly what we have today (4 for indexing, 8 for querying), or whether we address that mistake and switch to 8 for indexing and querying.

If I'm wrong about the implications of 16 then we could of course do nothing, but someone other than me will need to do that investigation, or it's just an echo chamber.

Specifically, I worry that a document might no longer match a given query if we go from 4 to 16 because the numeric term that would match in the former case simply doesn't exist in the latter case.

(as a primer, the way old versions of Lucene index numbers is to convert them into a series of specially formatted strings which are then indexed as terms. Newer versions of Lucene use a better system called points and the notion of precision step no longer applies).

Noting there's a good chance that precision step is "just" about performance. the actual numeric value is indexed in all cases, but not all of the speed-up / lower precision terms are. This is all ancient history and poorly remembered (by me).

Thank you for the explanation.

This uplift would require a validation to eliminate the generic worries around the potential index format changes anyway. Once we had the related end-to-end tests implemented, we can also experiment with the various configurations on precision step.

src/main/scala/com/cloudant/clouseau/IndexManagerService.scala

src/main/scala/com/cloudant/clouseau/ClouseauTypeFactory.scala

rnewson · 2023-05-31T13:17:17Z

src/main/scala/com/cloudant/clouseau/IndexService.scala

-                    }
-                    drilldownQuery.add(categoryPaths: _*)
+                val dim = category.head
+                for (path <- category.tail) {


Is this the same effect? It appears to loop over each item in the tail and add each of them to the drilldownQuery. whereas the original code only calls drilldownQuery.add once.

The Facet API has changed, it is not possible to call DrillDownQuery/add with the full list any more, c.f. 4.6.1 vs. 4.10.4. Note that there is a loop inside the 4.6.1 version. It uses the following definition to determine the dim:

String dim = paths[0].components[0];

and that is what I considered the head here, also in accordance with the removed else branch (the one with the note about the removal once leaving Lucene 4.6 behind).

src/main/scala/com/cloudant/clouseau/IndexService.scala

Update Clouseau to use the latest available Lucene version from the 4.x series. This commit brings many bug fixes while it maintains the backward compatibility for the existing indexes. 4.10.4 has 45 API changes and 116 bug fixes since 4.6.1, for the details see the change log [1]. [1] https://lucene.apache.org/core/4_10_4/changes/Changes.html

pgj · 2023-08-21T20:31:15Z

I am closing now in favor of #80. I will leave the changes around and we can revisit the uplift later.

pgj mentioned this pull request May 20, 2023

fix(IndexWriter): work around resource leak on failure of rollback() #73

Closed

pgj force-pushed the lucene-4-10 branch from b9be3fc to 7e7fd66 Compare May 21, 2023 19:15

pgj changed the title ~~feat: Uplift Lucene to 4.10.4~~ feat: uplift Lucene to 4.10.4 May 22, 2023

rnewson reviewed May 22, 2023

View reviewed changes

pgj force-pushed the lucene-4-10 branch from 7e7fd66 to 5b4cb99 Compare May 23, 2023 11:30

rnewson requested changes May 31, 2023

View reviewed changes

pgj force-pushed the lucene-4-10 branch 2 times, most recently from 29df361 to 37c6329 Compare June 2, 2023 16:52

pgj force-pushed the lucene-4-10 branch from 37c6329 to 8892e0f Compare June 6, 2023 13:02

pgj closed this Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: uplift Lucene to 4.10.4 #74

feat: uplift Lucene to 4.10.4 #74

pgj commented May 20, 2023 •

edited

Loading

pgj commented May 20, 2023 •

edited

Loading

jaydoane commented May 20, 2023

pgj commented May 20, 2023

jaydoane commented May 20, 2023 •

edited

Loading

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023 •

edited

Loading

rnewson May 22, 2023

rnewson May 22, 2023

pgj May 22, 2023

rnewson May 22, 2023

rnewson May 22, 2023

rnewson May 23, 2023 •

edited

Loading

pgj May 23, 2023

rnewson May 23, 2023 •

edited

Loading

rnewson May 23, 2023

pgj May 23, 2023

rnewson May 31, 2023

pgj Jun 1, 2023

pgj commented Aug 21, 2023

feat: uplift Lucene to 4.10.4 #74

feat: uplift Lucene to 4.10.4 #74

Conversation

pgj commented May 20, 2023 • edited Loading

pgj commented May 20, 2023 • edited Loading

jaydoane commented May 20, 2023

pgj commented May 20, 2023

jaydoane commented May 20, 2023 • edited Loading

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023

pgj commented May 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnewson May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnewson May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgj commented Aug 21, 2023

pgj commented May 20, 2023 •

edited

Loading

pgj commented May 20, 2023 •

edited

Loading

jaydoane commented May 20, 2023 •

edited

Loading

pgj commented May 21, 2023 •

edited

Loading

rnewson May 23, 2023 •

edited

Loading

rnewson May 23, 2023 •

edited

Loading