Various changes and fixes to UNNEST. (#13892)

* Various changes and fixes to UNNEST. Native changes: 1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn". This enables pushing expressions into the datasource. This in turn allows us to do the next thing... 2) UnnestStorageAdapter: Logically apply query-level filters and virtual columns after the unnest operation. (Physically, filters are pulled up, when possible.) This is beneficial because it allows filters and virtual columns to reference the unnested column, and because it is consistent with how the join datasource works. 3) Various documentation updates, including declaring "unnest" as an experimental feature for now. SQL changes: 1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel is simplified: it only handles the UNNEST part of a correlated join. Constant UNNESTs are handled with regular inline rels. 2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from the left side up above the Correlate. New test testUnnestTwice verifies that this works even when two UNNESTs are stacked on the same table. 3) Include ProjectCorrelateTransposeRule from Calcite to encourage pushing mappings down below the left-hand side of the Correlate. 4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule to handle pulling Filters up above the Correlate. New tests testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify this behavior. 5) Require a context feature flag for SQL UNNEST, since it's undocumented. As part of this, also cleaned up how we handle feature flags in SQL. They're now hooked into EngineFeatures, which is useful because not all engines support all features.
apache · Mar 10, 2023 · 4b1ffbc · 4b1ffbc
1 parent 64b67c2
commit 4b1ffbc
Show file tree

Hide file tree

Showing 61 changed files with 2,045 additions and 1,254 deletions.
diff --git a/docs/misc/math-expr.md b/docs/misc/math-expr.md
@@ -54,7 +54,7 @@ For logical operators, a number is true if and only if it is positive (0 or nega
 
 [Multi-value string dimensions](../querying/multi-value-dimensions.md) are supported and may be treated as either scalar or array typed values, as follows:  
 * When treated as a scalar type, the expression is automatically transformed so that the scalar operation is applied across all values of the multi-valued type, mimicking Druid's native behavior. 
-* Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to the `UNNEST` operator available in many SQL dialects. Alternatively, you can use the `array_to_string` function to perform the aggregation on a _stringified_ version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use `string_to_array` in an expression post-aggregator.
+* Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to an implicit SQL `UNNEST` operation. Alternatively, you can use the `array_to_string` function to perform the aggregation on a _stringified_ version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use `string_to_array` in an expression post-aggregator.
 
 
 The following built-in functions are available.

diff --git a/docs/querying/datasource.md b/docs/querying/datasource.md
@@ -371,9 +371,10 @@ future versions:
 
 ### `unnest`
 
-> The unnest datasource is currently only available as part of a native query.
+> The unnest datasource is [experimental](../development/experimental.md). Its API and behavior are subject
+> to change in future releases. It is not recommended to use this feature in production at this time.
 
-Use the `unnest` datasource to unnest a column with multiple values in an array. 
+Use the `unnest` datasource to unnest a column with multiple values in an array.
 For example, you have a source column that looks like this:
 
 | Nested | 
@@ -407,7 +408,10 @@ The `unnest` datasource uses the following syntax:
       "type": "table",
       "name": "nested_data"
     },
-    "column": "nested_source_column",
+    "virtualColumn": {
+      "type": "expression",
+      "expression": "\"column_reference\""
+    },
     "outputName": "unnested_target_column",
     "allowList": []
   },
@@ -416,9 +420,7 @@ The `unnest` datasource uses the following syntax:
 * `dataSource.type`: Set this to `unnest`.
 * `dataSource.base`: Defines the datasource you want to unnest.
   * `dataSource.base.type`: The type of datasource you want to unnest, such as a table.
-  * `dataSource.base.name`: The name of the datasource you want to unnest.
-* `dataSource.column`: The name of the source column that contains the nested values.
-* `dataSource.outputName`: The name you want to assign to the column that will contain the unnested values. You can replace the source column with the unnested column by specifying the source column's name or a new column by specifying a different name. Outputting it to a new column can help you verify that you get the results that you expect but isn't required.
+* `dataSource.virtualColumn`: [Virtual column](virtual-columns.md) that references the nested values. The output name of this column is reused as the name of the column that contains unnested values. You can replace the source column with the unnested column by specifying the source column's name or a new column by specifying a different name. Outputting it to a new column can help you verify that you get the results that you expect but isn't required.
 * `dataSource.allowList`: Optional. The subset of values you want to unnest.
 
 To learn more about how to use the `unnest` datasource, see the [unnest tutorial](../tutorials/tutorial-unnest-datasource.md).
diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md
@@ -140,12 +140,11 @@ This "selector" filter would match row4 of the dataset above:
 ### Grouping
 
 topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
-from matching rows will be used to generate one group per value. This can be thought of as the equivalent to the
-`UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This means it's possible for a query to return
-more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match
-only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
-your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
-improve performance.
+from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL `UNNEST`
+operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the
+dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups:
+`t1`, `t2`, and `t3`. If you only need to include values that match your filter, you can use a
+[filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance.
 
 ## Example: GroupBy query with no filtering
 

diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md
@@ -78,8 +78,8 @@ applied to all values for each row individually. Multi-value string dimensions c
 [multi-value string functions](sql-multivalue-string-functions.md), which can perform powerful array-aware operations.
 
 Grouping by a multi-value expression observes the native Druid multi-value aggregation behavior, which is similar to
-the `UNNEST` functionality available in some other SQL dialects. Refer to the documentation on
-[multi-value string dimensions](multi-value-dimensions.md) for additional details.
+an implicit SQL `UNNEST`. Refer to the documentation on [multi-value string dimensions](multi-value-dimensions.md)
+for additional details.
 
 > Because multi-value dimensions are treated by the SQL planner as `VARCHAR`, there are some inconsistencies between how
 > they are handled in Druid SQL and in native queries. For example, expressions involving multi-value dimensions may be

diff --git a/docs/tutorials/tutorial-unnest-datasource.md b/docs/tutorials/tutorial-unnest-datasource.md
@@ -25,7 +25,8 @@ title: "Tutorial: Unnest data in a column"
 
 > If you're looking for information about how to unnest `COMPLEX<json>` columns, see [Nested columns](../querying/nested-columns.md).
 
-> The unnest datasource is currently only available as part of a native query.
+> The unnest datasource is [experimental](../development/experimental.md). Its API and behavior are subject
+> to change in future releases. It is not recommended to use this feature in production at this time.
 
 This tutorial demonstrates how to use the unnest datasource to unnest a column that has data stored in arrays. For example, if you have a column named `dim3` with values like `[a,b]` or `[c,d,f]`, the unnest datasource can output the data to a new column with individual rows that contain single values like `a` and `b`. When doing this, be mindful of the following:
 
@@ -161,9 +162,11 @@ The following native Scan query returns the rows of the datasource and unnests t
       "type": "table",
       "name": "nested_data"
     },
-    "column": "dim3",
-    "outputName": "unnest-dim3",
-    "allowList": []
+    "virtualColumn": {
+      "type": "expression",
+      "name": "unnest-dim3",
+      "expression": "\"dim3\""
+    }
   },
   "intervals": {
     "type": "intervals",
@@ -222,9 +225,11 @@ The following query returns an unnested version of the column `dim3` as the colu
   "dataSource": {
     "type": "unnest",
     "base": "nested_data",
-    "column": "dim3",
-    "outputName": "unnest-dim3",
-    "allowList": []
+    "virtualColumn": {
+      "type": "expression",
+      "name": "unnest-dim3",
+      "expression": "\"dim3\""
+    }
   },
   "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],
   "granularity": "all",
@@ -264,8 +269,11 @@ The example topN query unnests `dim3` into the column `unnest-dim3`. The query u
       "type": "table",
       "name": "nested_data"
     },
-    "column": "dim3",
-    "outputName": "unnest-dim3",
+    "virtualColumn": {
+      "type": "expression",
+      "name": "unnest-dim3",
+      "expression": "\"dim3\""
+    },
     "allowList": null
   },
   "dimension": {
@@ -369,9 +377,11 @@ This query joins the `nested_data` table with itself and outputs the unnested da
         "condition": "(\"m1\" == \"j0.v0\")",
         "joinType": "INNER"
       },
-    "column": "dim3",
-    "outputName": "unnest-dim3",
-    "allowList": []
+    "virtualColumn": {
+      "type": "expression",
+      "name": "unnest-dim3",
+      "expression": "\"dim3\""
+    }
     },
   "intervals": {
     "type": "intervals",
@@ -520,13 +530,15 @@ When you run the query, pay special attention to how the total number of rows ha
         "type": "table",
         "name": "nested_data2"
       },
-      "column": "dim3",
-      "outputName": "unnest-dim3",
+    "virtualColumn": {
+      "type": "expression",
+      "name": "unnest-dim3",
+      "expression": "\"dim3\""
+    },
       "allowList": []
     },
     "column": "dim2",
-    "outputName": "unnest-dim2",
-    "allowList": []
+    "outputName": "unnest-dim2"
   },
   "intervals": {
     "type": "intervals",
@@ -583,9 +595,11 @@ You can also use the `unnest` datasource to unnest an inline datasource. The fol
         ]
       ]
     },
-    "column": "inline_data",
-    "outputName": "output",
-    "allowList": []
+    "virtualColumn": {
+      "type": "expression",
+      "name": "output",
+      "expression": "\"inline_data\""
+    }
   },
   "intervals": {
     "type": "intervals",
@@ -625,8 +639,11 @@ The following Scan query uses the `nested_data2` table you created in [Load data
       "type": "table",
       "name": "nested_data2"
     },
-    "column": "v0",
-    "outputName": "unnest-v0"
+    "virtualColumn": {
+      "type": "expression",
+      "name": "unnest-v0",
+      "expression": "\"v0\""
+    }
   }
   "intervals": {
     "type": "intervals",

diff --git a/...sions-core/multi-stage-query/src/main/java/org/apache/druid/msq/sql/MSQTaskSqlEngine.java b/...sions-core/multi-stage-query/src/main/java/org/apache/druid/msq/sql/MSQTaskSqlEngine.java
@@ -104,14 +104,17 @@ public RelDataType resultTypeForInsert(RelDataTypeFactory typeFactory, RelDataTy
   }
 
   @Override
-  public boolean feature(EngineFeature feature, PlannerContext plannerContext)
+  public boolean featureAvailable(EngineFeature feature, PlannerContext plannerContext)
   {
     switch (feature) {
       case ALLOW_BINDABLE_PLAN:
       case ALLOW_BROADCAST_RIGHTY_JOIN:
       case TIMESERIES_QUERY:
       case TOPN_QUERY:
       case TIME_BOUNDARY_QUERY:
+      case GROUPING_SETS:
+      case WINDOW_FUNCTIONS:
+      case UNNEST:
         return false;
       case CAN_SELECT:
       case CAN_INSERT:

diff --git a/processing/src/main/java/org/apache/druid/query/UnnestDataSource.java b/processing/src/main/java/org/apache/druid/query/UnnestDataSource.java
@@ -26,6 +26,7 @@
 import org.apache.druid.query.planning.DataSourceAnalysis;
 import org.apache.druid.segment.SegmentReference;
 import org.apache.druid.segment.UnnestSegmentReference;
+import org.apache.druid.segment.VirtualColumn;
 import org.apache.druid.utils.JvmUtils;
 
 import javax.annotation.Nullable;
@@ -48,32 +49,28 @@
 public class UnnestDataSource implements DataSource
 {
   private final DataSource base;
-  private final String column;
-  private final String outputName;
+  private final VirtualColumn virtualColumn;
   private final LinkedHashSet<String> allowList;
 
   private UnnestDataSource(
       DataSource dataSource,
-      String columnName,
-      String outputName,
+      VirtualColumn virtualColumn,
       LinkedHashSet<String> allowList
   )
   {
     this.base = dataSource;
-    this.column = columnName;
-    this.outputName = outputName;
+    this.virtualColumn = virtualColumn;
     this.allowList = allowList;
   }
 
   @JsonCreator
   public static UnnestDataSource create(
       @JsonProperty("base") DataSource base,
-      @JsonProperty("column") String columnName,
-      @JsonProperty("outputName") String outputName,
+      @JsonProperty("virtualColumn") VirtualColumn virtualColumn,
       @Nullable @JsonProperty("allowList") LinkedHashSet<String> allowList
   )
   {
-    return new UnnestDataSource(base, columnName, outputName, allowList);
+    return new UnnestDataSource(base, virtualColumn, allowList);
   }
 
   @JsonProperty("base")
@@ -82,16 +79,10 @@ public DataSource getBase()
     return base;
   }
 
-  @JsonProperty("column")
-  public String getColumn()
+  @JsonProperty("virtualColumn")
+  public VirtualColumn getVirtualColumn()
   {
-    return column;
-  }
-
-  @JsonProperty("outputName")
-  public String getOutputName()
-  {
-    return outputName;
+    return virtualColumn;
   }
 
   @JsonProperty("allowList")
@@ -118,7 +109,7 @@ public DataSource withChildren(List<DataSource> children)
     if (children.size() != 1) {
       throw new IAE("Expected [1] child, got [%d]", children.size());
     }
-    return new UnnestDataSource(children.get(0), column, outputName, allowList);
+    return new UnnestDataSource(children.get(0), virtualColumn, allowList);
   }
 
   @Override
@@ -151,30 +142,21 @@ public Function<SegmentReference, SegmentReference> createSegmentMapFunction(
     );
     return JvmUtils.safeAccumulateThreadCpuTime(
         cpuTimeAccumulator,
-        () -> {
-          if (column == null) {
-            return segmentMapFn;
-          } else if (column.isEmpty()) {
-            return segmentMapFn;
-          } else {
-            return
-                baseSegment ->
-                    new UnnestSegmentReference(
-                        segmentMapFn.apply(baseSegment),
-                        column,
-                        outputName,
-                        allowList
-                    );
-          }
-        }
+        () ->
+            baseSegment ->
+                new UnnestSegmentReference(
+                    segmentMapFn.apply(baseSegment),
+                    virtualColumn,
+                    allowList
+                )
     );
 
   }
 
   @Override
   public DataSource withUpdatedDataSource(DataSource newSource)
   {
-    return new UnnestDataSource(newSource, column, outputName, allowList);
+    return new UnnestDataSource(newSource, virtualColumn, allowList);
   }
 
   @Override
@@ -205,24 +187,22 @@ public boolean equals(Object o)
       return false;
     }
     UnnestDataSource that = (UnnestDataSource) o;
-    return column.equals(that.column)
-           && outputName.equals(that.outputName)
+    return virtualColumn.equals(that.virtualColumn)
            && base.equals(that.base);
   }
 
   @Override
   public int hashCode()
   {
-    return Objects.hash(base, column, outputName);
+    return Objects.hash(base, virtualColumn);
   }
 
   @Override
   public String toString()
   {
     return "UnnestDataSource{" +
            "base=" + base +
-           ", column='" + column + '\'' +
-           ", outputName='" + outputName + '\'' +
+           ", column='" + virtualColumn + '\'' +
            ", allowList=" + allowList +
            '}';
   }

diff --git a/processing/src/main/java/org/apache/druid/query/scan/ScanQueryQueryToolChest.java b/processing/src/main/java/org/apache/druid/query/scan/ScanQueryQueryToolChest.java
@@ -36,6 +36,7 @@
 import org.apache.druid.query.QueryToolChest;
 import org.apache.druid.query.aggregation.MetricManipulationFn;
 import org.apache.druid.segment.VirtualColumn;
+import org.apache.druid.segment.column.ColumnCapabilities;
 import org.apache.druid.segment.column.ColumnType;
 import org.apache.druid.segment.column.RowSignature;
 import org.apache.druid.utils.CloseableUtils;
@@ -172,7 +173,8 @@ public RowSignature resultArraySignature(final ScanQuery query)
 
         final VirtualColumn virtualColumn = query.getVirtualColumns().getVirtualColumn(columnName);
         if (virtualColumn != null) {
-          columnType = virtualColumn.capabilities(columnName).toColumnType();
+          final ColumnCapabilities capabilities = virtualColumn.capabilities(c -> null, columnName);
+          columnType = capabilities != null ? capabilities.toColumnType() : null;
         } else {
           // Unknown type. In the future, it would be nice to have a way to fill these in.
           columnType = null;