Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various changes and fixes to UNNEST. #13892

Merged
merged 11 commits into from
Mar 10, 2023

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Mar 7, 2023

Native changes:

  1. UnnestDataSource: Replace "column" and "outputName" with "virtualColumn".
    This enables pushing expressions into the datasource. This in turn
    allows us to do the next thing...

  2. UnnestStorageAdapter: Logically apply query-level filters and virtual
    columns after the unnest operation. (Physically, filters are pushed down,
    when possible.) This is beneficial because it allows filters and
    virtual columns to reference the unnested column, and because it is
    consistent with how the join datasource works.

  3. Various documentation updates, including declaring "unnest" as an
    experimental feature for now.

SQL changes:

  1. Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel
    is simplified: it only handles the UNNEST part of a correlated join.
    Constant UNNESTs are handled with regular inline rels.

  2. Rework DruidCorrelateUnnestRule to focus on pulling Projects from
    the left side up above the Correlate. Additionally, fix issues related to
    tracking correlation variables.New test testUnnestTwice verifies
    that this works even when two UNNESTs are stacked on the same table.

  3. Include ProjectCorrelateTransposeRule from Calcite to encourage
    pushing mappings down below the left-hand side of the Correlate.

  4. Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule
    to handle pulling Filters up above the Correlate. New tests
    testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify
    this behavior.

  5. Require a context feature flag for SQL UNNEST, since it's undocumented.
    As part of this, also cleaned up how we handle feature flags in SQL.
    They're now hooked into EngineFeatures, which is useful because not
    all engines support all features.

Native changes:

1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn".
   This enables pushing expressions into the datasource. This in turn
   allows us to do the next thing...

2) UnnestStorageAdapter: Logically apply query-level filters and virtual
   columns after the unnest operation. (Physically, filters are pulled up,
   when possible.) This is beneficial because it allows filters and
   virtual columns to reference the unnested column, and because it is
   consistent with how the join datasource works.

3) Various documentation updates, including declaring "unnest" as an
   experimental feature for now.

SQL changes:

1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel
   is simplified: it only handles the UNNEST part of a correlated join.
   Constant UNNESTs are handled with regular inline rels.

2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from
   the left side up above the Correlate. New test testUnnestTwice verifies
   that this works even when two UNNESTs are stacked on the same table.

3) Include ProjectCorrelateTransposeRule from Calcite to encourage
   pushing mappings down below the left-hand side of the Correlate.

4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule
   to handle pulling Filters up above the Correlate. New tests
   testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify
   this behavior.

5) Require a context feature flag for SQL UNNEST, since it's undocumented.
   As part of this, also cleaned up how we handle feature flags in SQL.
   They're now hooked into EngineFeatures, which is useful because not
   all engines support all features.
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems nice to me, and fixes the problem I was having using unnest with array columns in #13803

Screenshot 2023-03-07 at 5 36 51 AM

* thereby reducing the logical plan to:
* {@link DruidUnnestRule} takes care of the Uncollect(last 3 lines) to generate a {@link DruidUnnestRel}
* thereby reducing the logical plan to:
* <pre>
* LogicalCorrelate
* / \
* DruidRel DruidUnnestDataSourceRel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this javadoc needs updated i suppose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Updated it.

@@ -76,6 +76,8 @@ public class CastOperatorConversion implements SqlOperatorConversion
builder.put(type, ExprType.LONG);
}

builder.put(SqlTypeName.ARRAY, ExprType.ARRAY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't probably that cool because it loses the array element type. I think maybe #13890 has a better solution for this by switching cast to use Calcites.getColumnTypeForRelDataType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, yours is better. Merged it in. Thanks.

Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general Q. These 2 queries have the same plan. Both are unnest over a TableDataSource.

SELECT d3 FROM numFoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where dim2 IN ('a','ab')
SELECT d3 FROM (select * from numFoo where dim2 IN ('a','ab')), UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3)

But these 2 queries have different plans, the selector filter inside the table for the first query below plans as an unnest over a QueryDataSource while the second one is an Unnest over a TableDataSource

SELECT d3 FROM (select * from "numFoo" where dim2='a'), UNNEST(MV_TO_ARRAY(dim3)) as unnested(d3) 
SELECT d3 FROM numFoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where dim2='a'

outputColumnName,
allowSet
);
}
return retVal;
return PostJoinCursor.wrap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, there's no need to pass in the allowFilter inside the data source now and the PostJoinCursor will apply the filters on unnested column which are the post join filters

@somu-imply
Copy link
Contributor

somu-imply commented Mar 7, 2023

I think the reason is that there is an additional level of LogicalProject between Correlate and LogicalFilter in the query where SELECT d3 FROM (select * from druid.numfoo where dim2='a'), UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) it plans to

136:LogicalProject(d3=[$17])
  134:LogicalCorrelate(subset=[rel#135:Subset#7.NONE.[]], correlation=[$cor0], joinType=[inner], requiredColumns=[{3}])
    125:LogicalProject(subset=[rel#126:Subset#2.NONE.[]], __time=[$0], dim1=[$1], dim2=[CAST('a':VARCHAR):VARCHAR], dim3=[$3], dim4=[$4], dim5=[$5], dim6=[$6], d1=[$7], d2=[$8], f1=[$9], f2=[$10], l1=[$11], l2=[$12], cnt=[$13], m1=[$14], m2=[$15], unique_dim1=[$16])
      123:LogicalFilter(subset=[rel#124:Subset#1.NONE.[]], condition=[=($2, 'a')])
        9:LogicalTableScan(subset=[rel#122:Subset#0.NONE.[]], table=[[druid, numfoo]])
    130:Uncollect(subset=[rel#131:Subset#5.NONE.[]])
      128:LogicalProject(subset=[rel#129:Subset#4.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)])
        12:LogicalValues(subset=[rel#127:Subset#3.NONE.[0]], tuples=[[{ 0 }]])

which does not match the CorrelateFilterTranspose rule (L or R). We might need to address that but can be done with a later PR as well. While the 2nd query SELECT d3 FROM numFoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where dim2='a' plans to

121:LogicalProject(d3=[$17])
  119:LogicalCorrelate(subset=[rel#120:Subset#6.NONE.[]], correlation=[$cor0], joinType=[inner], requiredColumns=[{3}])
    110:LogicalFilter(subset=[rel#111:Subset#1.NONE.[]], condition=[=($2, 'a')])
      8:LogicalTableScan(subset=[rel#109:Subset#0.NONE.[]], table=[[druid, numfoo]])
    115:Uncollect(subset=[rel#116:Subset#4.NONE.[]])
      113:LogicalProject(subset=[rel#114:Subset#3.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)])
        9:LogicalValues(subset=[rel#112:Subset#2.NONE.[0]], tuples=[[{ 0 }]])

For that extra level of LogicalProject in between #13799 had an additional rule https://github.com/apache/druid/pull/13799/files#diff-048f881d1285568ceba4eb4e798527937950063d472d4eb2623bb95cd8b86758R87. Since the Rels have changed that might not be directly applicable

Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM ! We can merge this and then I can fix up the boundary conditions

new Object[]{"10.1", ImmutableList.of("b", "c"), ImmutableList.of("10", "1"), "1", "bxx"},
new Object[]{"10.1", ImmutableList.of("b", "c"), ImmutableList.of("10", "1"), "1", "cxx"},
new Object[]{"2", ImmutableList.of("d"), ImmutableList.of("2"), "2", "dxx"},
new Object[]{"1", null, ImmutableList.of("1"), "1", "xx"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with new Object[]{"1", useDefault ? null: ImmutableList.of(""), ImmutableList.of("1"), "1", "xx"}

@gianm
Copy link
Contributor Author

gianm commented Mar 9, 2023

But these 2 queries have different plans, the selector filter inside the table for the first query below plans as an unnest over a QueryDataSource while the second one is an Unnest over a TableDataSource

Looks like you're right about the extra Project being the reason. It prevents the filter from being pulled out since it's in the way. One approach we could take here is make a rule that recognizes the situation, and pulls the Project and Filter both out.

Thanks for taking a look. I pushed up fixes to the merge conflicts and the one test you pointed out. I also simplified DruidUnnestRel— in the latest patch it only contains a single RexNode. Finally I also expanded the filter-rewriting logic in UnnestStorageAdapter so it rewrites certain filters to use the input column of the unnest rather than the output column. It's limited to string types and non-expression filters, but it's s start. This piece could be improved in the future.

* input table, this rel resolves the unnest part and delegates the rel to be consumed by other
* rule ({@link org.apache.druid.sql.calcite.rule.DruidCorrelateUnnestRule}
*/
public class DruidUnnestRel extends DruidRel<DruidUnnestRel>

Check failure

Code scanning / CodeQL

No clone method

No clone method, yet implements Cloneable.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize RelNode implements Cloneable. This hasn't been an issue for any of the other DruidRels; none of them implement clone. Calcite's builtin rels don't either, except for MutableRel, which isn't involved here. I think it's fine to not implement this. I'll add an impl to DruidRel that throws an exception and write a comment remarking that it isn't needed.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, 👍 after CI

Comment on lines +296 to +302
// Try to move filter pre-correlate if possible.
final Filter newFilter = rewriteFilterOnUnnestColumnIfPossible(filter, inputColumn, inputColumnCapabilites);
if (newFilter != null) {
preFilters.add(newFilter);
} else {
postFilters.add(filter);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍

@somu-imply
Copy link
Contributor

Thanks, there is also an unused import in DruidQueryRel which would also come in the checkstyle fix.

// Try to move filter pre-correlate if possible.
final Filter newFilter = rewriteFilterOnUnnestColumnIfPossible(filter, inputColumn, inputColumnCapabilites);
if (newFilter != null) {
preFilters.add(newFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think we want to always add the filter as a post filter even when we push down since the pushed down filter returns the whole row if any values match, so we want the value matcher to clean it up to exact matches

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this once CI passes, I'll fix them up. I understand the case where it returns the full row even for a match in MVDs. I'll also need to remove the allowSet and minor other changes. Will followup as soon as this gets merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks for pointing this out. I've written a patch to fix this. In the interests of having this PR pass CI so @somu-imply can keep working on top of it, I'm ok to commit this PR first, and then do another PR fixing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the patch https://gist.github.com/gianm/86d55fb6e5ee58c935152761a9df4a4d. Feel free to roll this into what you're doing @somu-imply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR with the fix is at: #13919. If you want to incorporate this into your own work then feel free and I will close my PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add this up

protected Object clone() throws CloneNotSupportedException
{
// RelNode implements Cloneable, but our class of rels is not cloned, so does not need to implement clone().
throw new UnsupportedOperationException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ inspections is flagging it as it does not throw a CloneNotSupportedException

@somu-imply
Copy link
Contributor

Additionally there is a CI failure for this test testGroupByFloatMinExpressionVsVirtualColumnWithExplicitStringVirtualColumnTypedInput.

The problem seems to trigger from running with -Ddruid.generic.useDefaultValueForNull=false. The fix for this is to change the lines

"minVc",
NullHandling.replaceWithDefault() ? Float.POSITIVE_INFINITY : null

with

"minVc",
Float.POSITIVE_INFINITY

Although I am unsure what caused this to change

@gianm
Copy link
Contributor Author

gianm commented Mar 10, 2023

Although I am unsure what caused this to change

Ah. That's because ExpressionVirtualColumn now becomes a pass-through in cases where the expression doesn't do anything. So the behavior of min and minVc are now the same. I updated the test case to reflect the new expected behavior.

@abhishekagarwal87 abhishekagarwal87 merged commit 4b1ffbc into apache:master Mar 10, 2023
@gianm gianm deleted the unnest-updates branch March 10, 2023 16:04
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants