Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up advancing on the disjunction iterator. #14052

Merged
merged 3 commits into from
Dec 16, 2024

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Dec 10, 2024

Currently, the disjunction iterator puts all clauses in a heap in order to be able to merge doc IDs in a streaming fashion. This is a good approach for exhaustive evaluation, when only one clause moves to a different doc ID on average and the per-iteration cost is in the order of O(log(N)) where N is the number of clauses.

However, if a selective filter is applied, this could cause many clauses to move to a different doc ID. In the worst-case scenario, all clauses could move to a different doc ID and the cost of maintaiting heap invariants could grow to O(N * log(N)) (every clause introduces a O(log(N)) cost). With many clauses, this is much higher than the cost of checking all clauses sequentially: O(N).

To protect from this reordering overhead, DisjunctionDISIApproximation now only puts the cheapest clauses in a heap in a way that tries to achieve up to 1.5 clauses moving to a different doc ID on average. More expensive clauses are checked linearly.

Currently, the disjunction iterator puts all clauses in a heap in order to be
able to merge doc IDs in a streaming fashion. This is a good approach for
exhaustive evaluation, when only one clause moves to a different doc ID on
average and the per-iteration cost is in the order of O(log(N)) where N is the
number of clauses.

However, if a selective filter is applied, this could cause many clauses to
move to a different doc ID. In the worst-case scenario, all clauses could move
to a different doc ID and the cost of maintaiting heap invariants could grow to
O(N * log(N)) (every clause introduces a O(log(N)) cost). With many clauses,
this is much higher than the cost of checking all clauses sequentially: O(N).

To protect from this reordering overhead, DisjunctionDISIApproximation now only
puts the cheapest clauses in a heap in a way that tries to achieve up to 1.5
clauses moving to a different doc ID on average. More expensive clauses are
checked linearly.
@jpountz jpountz added this to the 10.1.0 milestone Dec 10, 2024
@jpountz
Copy link
Contributor Author

jpountz commented Dec 10, 2024

luceneutil suggests that this change gives a small slowdown when a DisjunctionDISIApproximation leads iteration (AndHighOrMedMed, CombinedOrHighMed, CombinedAndHighMed, CombinedOrHighHigh, CombinedAndHighHigh) in favor of a speedup when it catches up on another clause (CountFilteredOrHighMed, AndMedOrHighHigh, CountFilteredOrHighHigh, CountFilteredOrMany). (Note that the slowdown is a fixed cost due to a few additional checks, while the speedup scales with the number of clauses.)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ      115.88     (17.1%)      110.62     (12.7%)   -4.5% ( -29% -   30%) 0.366
                  FilteredIntNRQ      114.03     (16.1%)      109.46     (12.6%)   -4.0% ( -28% -   29%) 0.405
                 AndHighOrMedMed       46.62      (2.1%)       45.46      (1.3%)   -2.5% (  -5% -    0%) 0.000
                         Prefix3      130.51      (4.4%)      127.43      (5.9%)   -2.4% ( -12% -    8%) 0.175
               CombinedOrHighMed       74.76      (1.1%)       73.29      (1.9%)   -2.0% (  -4% -    1%) 0.000
              CombinedAndHighMed       57.30      (1.8%)       56.19      (2.0%)   -1.9% (  -5% -    1%) 0.002
              CombinedOrHighHigh       19.72      (1.3%)       19.36      (2.0%)   -1.8% (  -5% -    1%) 0.002
             CombinedAndHighHigh       15.75      (1.9%)       15.47      (2.2%)   -1.8% (  -5% -    2%) 0.012
             FilteredAndHighHigh       63.99      (1.8%)       63.06      (1.3%)   -1.4% (  -4% -    1%) 0.006
            FilteredAndStopWords       49.00      (1.7%)       48.40      (1.2%)   -1.2% (  -4% -    1%) 0.013
                     CountPhrase        4.34      (3.0%)        4.29      (3.1%)   -1.2% (  -7% -    5%) 0.251
                    AndStopWords       32.46      (3.0%)       32.11      (3.9%)   -1.1% (  -7% -    5%) 0.344
                        Wildcard       76.33      (3.2%)       75.52      (3.8%)   -1.1% (  -7% -    6%) 0.367
                    CombinedTerm       32.39      (2.7%)       32.08      (3.5%)   -1.0% (  -6% -    5%) 0.349
                       CountTerm     9409.48      (3.1%)     9321.18      (3.3%)   -0.9% (  -7% -    5%) 0.378
                          Fuzzy1       81.72      (2.4%)       81.00      (2.7%)   -0.9% (  -5% -    4%) 0.304
                  FilteredPhrase       30.89      (2.0%)       30.62      (1.8%)   -0.9% (  -4% -    2%) 0.165
                    FilteredTerm      154.99      (3.1%)      153.79      (2.3%)   -0.8% (  -6% -    4%) 0.400
                   TermTitleSort      157.71      (1.9%)      156.57      (1.7%)   -0.7% (  -4% -    2%) 0.235
                            Term      480.15      (6.3%)      476.76      (5.3%)   -0.7% ( -11% -   11%) 0.717
               FilteredAnd3Terms      198.21      (1.5%)      196.86      (1.8%)   -0.7% (  -3% -    2%) 0.218
     FilteredAnd2Terms2StopWords      200.76      (1.1%)      199.40      (1.0%)   -0.7% (  -2% -    1%) 0.055
                 FilteredPrefix3      123.17      (4.3%)      122.39      (5.9%)   -0.6% ( -10% -    9%) 0.712
                     OrStopWords       33.58      (5.5%)       33.37      (6.7%)   -0.6% ( -12% -   12%) 0.764
                          Fuzzy2       76.93      (2.0%)       76.46      (2.3%)   -0.6% (  -4% -    3%) 0.401
                  CountOrHighMed      140.99      (1.6%)      140.14      (1.1%)   -0.6% (  -3% -    2%) 0.198
               TermDayOfYearSort      637.24      (2.3%)      633.60      (2.6%)   -0.6% (  -5% -    4%) 0.484
                        PKLookup      279.61      (1.9%)      278.15      (2.9%)   -0.5% (  -5% -    4%) 0.517
                 CountOrHighHigh       75.89      (2.0%)       75.50      (1.7%)   -0.5% (  -4% -    3%) 0.398
                  FilteredOrMany       17.03      (3.2%)       16.95      (4.1%)   -0.5% (  -7% -    7%) 0.699
                        Or3Terms      172.16      (3.4%)      171.35      (3.7%)   -0.5% (  -7% -    6%) 0.689
                       And3Terms      181.15      (2.0%)      180.35      (2.9%)   -0.4% (  -5% -    4%) 0.597
              FilteredAndHighMed      132.36      (1.6%)      131.82      (2.3%)   -0.4% (  -4% -    3%) 0.535
                     CountOrMany        7.52      (1.9%)        7.49      (1.7%)   -0.4% (  -3% -    3%) 0.522
                     AndHighHigh       45.42      (2.4%)       45.30      (1.9%)   -0.3% (  -4% -    4%) 0.717
                DismaxOrHighHigh      115.82      (5.0%)      115.54      (4.4%)   -0.2% (  -9% -    9%) 0.876
                   TermMonthSort     3413.43      (1.9%)     3405.69      (2.0%)   -0.2% (  -4% -    3%) 0.729
                      AndHighMed      132.72      (2.0%)      132.45      (1.8%)   -0.2% (  -3% -    3%) 0.747
                          OrMany       19.85      (3.4%)       19.82      (2.6%)   -0.2% (  -5% -    6%) 0.880
             CountFilteredPhrase       26.01      (1.8%)       25.97      (1.6%)   -0.1% (  -3% -    3%) 0.832
             And2Terms2StopWords      166.77      (2.4%)      166.60      (2.9%)   -0.1% (  -5% -    5%) 0.908
                 CountAndHighMed      161.46      (1.5%)      161.39      (2.2%)   -0.0% (  -3% -    3%) 0.947
               FilteredOrHighMed      155.06      (1.2%)      155.02      (1.2%)   -0.0% (  -2% -    2%) 0.955
                 DismaxOrHighMed      168.40      (3.1%)      168.37      (2.5%)   -0.0% (  -5% -    5%) 0.981
                FilteredOr3Terms      167.26      (1.2%)      167.34      (1.0%)    0.1% (  -2% -    2%) 0.889
              Or2Terms2StopWords      162.55      (3.5%)      162.65      (3.7%)    0.1% (  -6% -    7%) 0.956
                      OrHighHigh       52.07      (4.3%)       52.11      (4.9%)    0.1% (  -8% -    9%) 0.960
             FilteredOrStopWords       43.70      (2.7%)       43.75      (2.7%)    0.1% (  -5% -    5%) 0.904
      FilteredOr2Terms2StopWords      148.84      (1.3%)      149.02      (1.3%)    0.1% (  -2% -    2%) 0.790
                      DismaxTerm      600.72      (5.2%)      601.50      (3.8%)    0.1% (  -8% -    9%) 0.932
              FilteredOrHighHigh       64.94      (2.5%)       65.03      (2.5%)    0.1% (  -4% -    5%) 0.872
                      OrHighRare      261.76     (10.9%)      262.11     (10.4%)    0.1% ( -19% -   24%) 0.970
                CountAndHighHigh       55.31      (1.8%)       55.45      (1.7%)    0.2% (  -3% -    3%) 0.677
                          Phrase       15.80      (5.4%)       15.84      (4.5%)    0.3% (  -9% -   10%) 0.865
                      TermDTSort      286.63      (5.6%)      287.51      (6.4%)    0.3% ( -11% -   13%) 0.878
                       OrHighMed      191.75      (3.8%)      192.45      (3.6%)    0.4% (  -6% -    8%) 0.766
          CountFilteredOrHighMed       68.17      (1.7%)       68.93      (1.4%)    1.1% (  -1% -    4%) 0.034
                AndMedOrHighHigh       60.11      (2.0%)       62.55      (2.2%)    4.1% (   0% -    8%) 0.000
         CountFilteredOrHighHigh       57.67      (2.0%)       64.28      (2.0%)   11.5% (   7% -   15%) 0.000
             CountFilteredOrMany        3.88      (4.8%)        8.71      (3.3%)  124.6% ( 111% -  139%) 0.000

@javanna javanna modified the milestones: 10.1.0, 10.2.0 Dec 14, 2024
@jpountz jpountz merged commit bc341f2 into apache:main Dec 16, 2024
5 checks passed
@jpountz jpountz deleted the speed_up_disjunction_disi branch December 16, 2024 14:33
jpountz added a commit that referenced this pull request Dec 16, 2024
Currently, the disjunction iterator puts all clauses in a heap in order to be
able to merge doc IDs in a streaming fashion. This is a good approach for
exhaustive evaluation, when only one clause moves to a different doc ID on
average and the per-iteration cost is in the order of O(log(N)) where N is the
number of clauses.

However, if a selective filter is applied, this could cause many clauses to
move to a different doc ID. In the worst-case scenario, all clauses could move
to a different doc ID and the cost of maintaiting heap invariants could grow to
O(N * log(N)) (every clause introduces a O(log(N)) cost). With many clauses,
this is much higher than the cost of checking all clauses sequentially: O(N).

To protect from this reordering overhead, DisjunctionDISIApproximation now only
puts the cheapest clauses in a heap in a way that tries to achieve up to 1.5
clauses moving to a different doc ID on average. More expensive clauses are
checked linearly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants