Query: Adds hybrid search query pipeline stage #4794

neildsh · 2024-10-10T22:11:59Z

Description

Adds hybrid search query pipeline stage. This requires the new Direct package and gateway to be available in order to light up.

Given an input SQL such as:

      SELECT TOP 100 c.text, c.abstract
      FROM c
      ORDER BY RANK RRF(FullTextScore(c.text, ['swim', 'run']), FullTextScore(c.abstract, ['energy']))

The new query plan (encoded below as XML instead of JSON to help readability) is as follows:

        <queryRanges>
          <Item>{"min":[],"max":"Infinity","isMinInclusive":true,"isMaxInclusive":false}</Item>
        </queryRanges>
        <hybridSearchQueryInfo>
          <globalStatisticsQuery><![CDATA[
SELECT 
    COUNT(1) AS documentCount,
    [
        {
            totalWordCount: SUM(_FullTextWordCount(c.text)),
            hitCounts: [
                COUNTIF(FullTextContains(c.text, "swim")),
                COUNTIF(FullTextContains(c.text, "run"))
            ]
        },
        {
            totalWordCount: SUM(_FullTextWordCount(c.abstract)),
            hitCounts: [
                COUNTIF(FullTextContains(c.abstract, "energy"))
            ]
        }
    ] AS fullTextStatistics
FROM c
]]></globalStatisticsQuery>
          <componentQueryInfos>
            <Item>
              <distinctType>None</distinctType>
              <top>200</top>
              <orderBy>
                <Item>Descending</Item>
              </orderBy>
              <orderByExpressions>
                <Item>_FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0})</Item>
              </orderByExpressions>
              <hasSelectValue>false</hasSelectValue>
              <rewrittenQuery><![CDATA[
SELECT TOP 200 
    c._rid,
    [
        {
            item: _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0})
        }
    ] AS orderByItems,
    {
        payload: {
            text: c.text,
            abstract: c.abstract
        },
        componentScores: [
            _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}),
            _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1})
        ]
    } AS payload
FROM c
WHERE {documentdb-formattableorderbyquery-filter}
ORDER BY _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}) DESC
]]></rewrittenQuery>
              <hasNonStreamingOrderBy>true</hasNonStreamingOrderBy>
            </Item>
            <Item>
              <distinctType>None</distinctType>
              <top>200</top>
              <orderBy>
                <Item>Descending</Item>
              </orderBy>
              <orderByExpressions>
                <Item>_FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1})</Item>
              </orderByExpressions>
              <hasSelectValue>false</hasSelectValue>
              <rewrittenQuery><![CDATA[
SELECT TOP 200 
    c._rid,
    [
        {
            item: _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1})
        }
    ] AS orderByItems,
    {
        payload: {
            text: c.text,
            abstract: c.abstract
        },
        componentScores: [
            _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}),
            _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1})
        ]
    } AS payload
FROM c
WHERE {documentdb-formattableorderbyquery-filter}
ORDER BY _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1}) DESC
]]></rewrittenQuery>
              <hasNonStreamingOrderBy>true</hasNonStreamingOrderBy>
            </Item>
          </componentQueryInfos>
          <take>100</take>
          <requiresGlobalStatistics>true</requiresGlobalStatistics>
        </hybridSearchQueryInfo>

We have a custom implementation for the global statistics inside the HybridSearchCrossPartitionQueryPipelineStage because it uses nested aggregates. Each of the component queries in the hybrid search query plan is cross partition, and we run them using the existing cross partition query pipelines.

Note the use of placeholders such as {documentdb-formattablehybridsearchquery-totaldocumentcount} in the query plan. These need to be replaced by the global statistics.

Type of change

New feature (non-breaking change which adds functionality)

github-actions

All good!

first draft of RRF implementation add code for paginating results and respect skip/take

…un of Hybrid Search!

fix build errors Add more integration tests with better validation Add more integration tests with better validation deleted gratuitous rewrite of sql query spec fix up typo that causes build break Fix up build break in OrderByPipelineStageBenchmark

…direct package upgrade

sboshra · 2024-10-16T18:08:08Z

disable ODE for hybrid search queries #Pending

Refers to: Microsoft.Azure.Cosmos/src/Query/Core/Pipeline/CosmosQueryExecutionContextFactory.cs:282 in 458b10c. [](commit_id = 458b10c, deletion_comment = False)

...osoft.Azure.Cosmos/src/Query/Core/Pipeline/CrossPartition/HybridSearch/FullTextStatistics.cs

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs

sboshra

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs

Rename a couple of Hybrid Search methods to conform to code review feedback Update comment to be more helpful Tiny bit of clean up

github-actions bot reviewed Oct 10, 2024

View reviewed changes

neildsh changed the title ~~Query: Add hybrid search query pipeline stage~~ Query: Adds hybrid search query pipeline stage Oct 10, 2024

neildsh marked this pull request as ready for review October 15, 2024 01:57

neildsh requested review from khdang, sboshra, adityasa, kirankumarkolli, FabianMeiswinkel and kirillg as code owners October 15, 2024 01:57

neildsh added 4 commits October 15, 2024 11:43

Draft of Hybrid Search implementation

c2f9a00

first draft of RRF implementation add code for paginating results and respect skip/take

A bunch of bug fixes for Hybrid search, and the first succesful E2E r…

35b3e95

…un of Hybrid Search!

Minor clean up and fix regression caused by moving SqlQuerySpec rewrite

94b14ba

neildsh force-pushed the users/ndeshpan/hybridSearch branch from 8272288 to 94b14ba Compare October 15, 2024 19:05

neildsh added 4 commits October 15, 2024 12:08

mark hybrid search integration tests as ignore while we wait for the …

50c151e

…direct package upgrade

Add optimization for single component pipeline

4f9f5cd

Add an oracle for hybrid search in the form of Lucene

4e6355e

Prevent build break due to lucene

fa9e5ff

sboshra reviewed Oct 16, 2024

View reviewed changes

...osoft.Azure.Cosmos/src/Query/Core/Pipeline/CrossPartition/HybridSearch/FullTextStatistics.cs Outdated Show resolved Hide resolved

sboshra reviewed Oct 16, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

sboshra reviewed Oct 16, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

sboshra reviewed Oct 16, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

sboshra reviewed Oct 16, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

neildsh added 4 commits October 16, 2024 12:27

remove skip take counter from hybrid search query pipeline stage

828952f

fix up build break

923e52e

Minor cleanup in CosmosQueryExecutionContextFactory

4b3b206

incorporate code review feedback

9498e64

sboshra reviewed Oct 17, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

sboshra reviewed Oct 17, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

sboshra previously approved these changes Oct 17, 2024

View reviewed changes

neildsh commented Oct 17, 2024

View reviewed changes

...ry/Core/Pipeline/CrossPartition/HybridSearch/HybridSearchCrossPartitionQueryPipelineStage.cs Outdated Show resolved Hide resolved

neildsh added 2 commits October 17, 2024 15:29

More lucene junk

4e34937

Remove references to Lucene from the project

cb5c5f0

Rename a couple of Hybrid Search methods to conform to code review feedback Update comment to be more helpful Tiny bit of clean up

neildsh dismissed sboshra’s stale review via cb5c5f0 October 18, 2024 01:01

Merge branch 'master' into users/ndeshpan/hybridSearch

143b77e

neildsh added QUERY auto-merge Enables automation to merge PRs labels Oct 18, 2024

microsoft-github-policy-service bot enabled auto-merge (squash) October 18, 2024 01:02

sboshra approved these changes Oct 18, 2024

View reviewed changes

sc978345 approved these changes Oct 18, 2024

View reviewed changes

microsoft-github-policy-service bot merged commit 4e1c033 into master Oct 18, 2024
24 checks passed

microsoft-github-policy-service bot deleted the users/ndeshpan/hybridSearch branch October 18, 2024 03:12

kirankumarkolli mentioned this pull request Oct 19, 2024

Hybrid search support #4824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: Adds hybrid search query pipeline stage #4794

Query: Adds hybrid search query pipeline stage #4794

neildsh commented Oct 10, 2024 •

edited

Loading

github-actions bot left a comment •

edited

Loading

sboshra commented Oct 16, 2024 •

edited by neildsh

Loading

sboshra left a comment

Query: Adds hybrid search query pipeline stage #4794

Query: Adds hybrid search query pipeline stage #4794

Conversation

neildsh commented Oct 10, 2024 • edited Loading

Description

Type of change

github-actions bot left a comment • edited Loading

Choose a reason for hiding this comment

sboshra commented Oct 16, 2024 • edited by neildsh Loading

sboshra left a comment

Choose a reason for hiding this comment

neildsh commented Oct 10, 2024 •

edited

Loading

github-actions bot left a comment •

edited

Loading

sboshra commented Oct 16, 2024 •

edited by neildsh

Loading