Skip to content

Commit

Permalink
Various changes and fixes to UNNEST. (#13892)
Browse files Browse the repository at this point in the history
* Various changes and fixes to UNNEST.

Native changes:

1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn".
   This enables pushing expressions into the datasource. This in turn
   allows us to do the next thing...

2) UnnestStorageAdapter: Logically apply query-level filters and virtual
   columns after the unnest operation. (Physically, filters are pulled up,
   when possible.) This is beneficial because it allows filters and
   virtual columns to reference the unnested column, and because it is
   consistent with how the join datasource works.

3) Various documentation updates, including declaring "unnest" as an
   experimental feature for now.

SQL changes:

1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel
   is simplified: it only handles the UNNEST part of a correlated join.
   Constant UNNESTs are handled with regular inline rels.

2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from
   the left side up above the Correlate. New test testUnnestTwice verifies
   that this works even when two UNNESTs are stacked on the same table.

3) Include ProjectCorrelateTransposeRule from Calcite to encourage
   pushing mappings down below the left-hand side of the Correlate.

4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule
   to handle pulling Filters up above the Correlate. New tests
   testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify
   this behavior.

5) Require a context feature flag for SQL UNNEST, since it's undocumented.
   As part of this, also cleaned up how we handle feature flags in SQL.
   They're now hooked into EngineFeatures, which is useful because not
   all engines support all features.
  • Loading branch information
gianm authored Mar 10, 2023
1 parent 64b67c2 commit 4b1ffbc
Show file tree
Hide file tree
Showing 61 changed files with 2,045 additions and 1,254 deletions.
2 changes: 1 addition & 1 deletion docs/misc/math-expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ For logical operators, a number is true if and only if it is positive (0 or nega

[Multi-value string dimensions](../querying/multi-value-dimensions.md) are supported and may be treated as either scalar or array typed values, as follows:
* When treated as a scalar type, the expression is automatically transformed so that the scalar operation is applied across all values of the multi-valued type, mimicking Druid's native behavior.
* Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to the `UNNEST` operator available in many SQL dialects. Alternatively, you can use the `array_to_string` function to perform the aggregation on a _stringified_ version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use `string_to_array` in an expression post-aggregator.
* Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to an implicit SQL `UNNEST` operation. Alternatively, you can use the `array_to_string` function to perform the aggregation on a _stringified_ version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use `string_to_array` in an expression post-aggregator.


The following built-in functions are available.
Expand Down
14 changes: 8 additions & 6 deletions docs/querying/datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -371,9 +371,10 @@ future versions:

### `unnest`

> The unnest datasource is currently only available as part of a native query.
> The unnest datasource is [experimental](../development/experimental.md). Its API and behavior are subject
> to change in future releases. It is not recommended to use this feature in production at this time.
Use the `unnest` datasource to unnest a column with multiple values in an array.
Use the `unnest` datasource to unnest a column with multiple values in an array.
For example, you have a source column that looks like this:

| Nested |
Expand Down Expand Up @@ -407,7 +408,10 @@ The `unnest` datasource uses the following syntax:
"type": "table",
"name": "nested_data"
},
"column": "nested_source_column",
"virtualColumn": {
"type": "expression",
"expression": "\"column_reference\""
},
"outputName": "unnested_target_column",
"allowList": []
},
Expand All @@ -416,9 +420,7 @@ The `unnest` datasource uses the following syntax:
* `dataSource.type`: Set this to `unnest`.
* `dataSource.base`: Defines the datasource you want to unnest.
* `dataSource.base.type`: The type of datasource you want to unnest, such as a table.
* `dataSource.base.name`: The name of the datasource you want to unnest.
* `dataSource.column`: The name of the source column that contains the nested values.
* `dataSource.outputName`: The name you want to assign to the column that will contain the unnested values. You can replace the source column with the unnested column by specifying the source column's name or a new column by specifying a different name. Outputting it to a new column can help you verify that you get the results that you expect but isn't required.
* `dataSource.virtualColumn`: [Virtual column](virtual-columns.md) that references the nested values. The output name of this column is reused as the name of the column that contains unnested values. You can replace the source column with the unnested column by specifying the source column's name or a new column by specifying a different name. Outputting it to a new column can help you verify that you get the results that you expect but isn't required.
* `dataSource.allowList`: Optional. The subset of values you want to unnest.

To learn more about how to use the `unnest` datasource, see the [unnest tutorial](../tutorials/tutorial-unnest-datasource.md).
11 changes: 5 additions & 6 deletions docs/querying/multi-value-dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,11 @@ This "selector" filter would match row4 of the dataset above:
### Grouping

topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. This can be thought of as the equivalent to the
`UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This means it's possible for a query to return
more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match
only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.
from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL `UNNEST`
operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the
dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups:
`t1`, `t2`, and `t3`. If you only need to include values that match your filter, you can use a
[filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance.

## Example: GroupBy query with no filtering

Expand Down
4 changes: 2 additions & 2 deletions docs/querying/sql-data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ applied to all values for each row individually. Multi-value string dimensions c
[multi-value string functions](sql-multivalue-string-functions.md), which can perform powerful array-aware operations.

Grouping by a multi-value expression observes the native Druid multi-value aggregation behavior, which is similar to
the `UNNEST` functionality available in some other SQL dialects. Refer to the documentation on
[multi-value string dimensions](multi-value-dimensions.md) for additional details.
an implicit SQL `UNNEST`. Refer to the documentation on [multi-value string dimensions](multi-value-dimensions.md)
for additional details.

> Because multi-value dimensions are treated by the SQL planner as `VARCHAR`, there are some inconsistencies between how
> they are handled in Druid SQL and in native queries. For example, expressions involving multi-value dimensions may be
Expand Down
59 changes: 38 additions & 21 deletions docs/tutorials/tutorial-unnest-datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ title: "Tutorial: Unnest data in a column"

> If you're looking for information about how to unnest `COMPLEX<json>` columns, see [Nested columns](../querying/nested-columns.md).
> The unnest datasource is currently only available as part of a native query.
> The unnest datasource is [experimental](../development/experimental.md). Its API and behavior are subject
> to change in future releases. It is not recommended to use this feature in production at this time.
This tutorial demonstrates how to use the unnest datasource to unnest a column that has data stored in arrays. For example, if you have a column named `dim3` with values like `[a,b]` or `[c,d,f]`, the unnest datasource can output the data to a new column with individual rows that contain single values like `a` and `b`. When doing this, be mindful of the following:

Expand Down Expand Up @@ -161,9 +162,11 @@ The following native Scan query returns the rows of the datasource and unnests t
"type": "table",
"name": "nested_data"
},
"column": "dim3",
"outputName": "unnest-dim3",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -222,9 +225,11 @@ The following query returns an unnested version of the column `dim3` as the colu
"dataSource": {
"type": "unnest",
"base": "nested_data",
"column": "dim3",
"outputName": "unnest-dim3",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],
"granularity": "all",
Expand Down Expand Up @@ -264,8 +269,11 @@ The example topN query unnests `dim3` into the column `unnest-dim3`. The query u
"type": "table",
"name": "nested_data"
},
"column": "dim3",
"outputName": "unnest-dim3",
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
},
"allowList": null
},
"dimension": {
Expand Down Expand Up @@ -369,9 +377,11 @@ This query joins the `nested_data` table with itself and outputs the unnested da
"condition": "(\"m1\" == \"j0.v0\")",
"joinType": "INNER"
},
"column": "dim3",
"outputName": "unnest-dim3",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -520,13 +530,15 @@ When you run the query, pay special attention to how the total number of rows ha
"type": "table",
"name": "nested_data2"
},
"column": "dim3",
"outputName": "unnest-dim3",
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
},
"allowList": []
},
"column": "dim2",
"outputName": "unnest-dim2",
"allowList": []
"outputName": "unnest-dim2"
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -583,9 +595,11 @@ You can also use the `unnest` datasource to unnest an inline datasource. The fol
]
]
},
"column": "inline_data",
"outputName": "output",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "output",
"expression": "\"inline_data\""
}
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -625,8 +639,11 @@ The following Scan query uses the `nested_data2` table you created in [Load data
"type": "table",
"name": "nested_data2"
},
"column": "v0",
"outputName": "unnest-v0"
"virtualColumn": {
"type": "expression",
"name": "unnest-v0",
"expression": "\"v0\""
}
}
"intervals": {
"type": "intervals",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,14 +104,17 @@ public RelDataType resultTypeForInsert(RelDataTypeFactory typeFactory, RelDataTy
}

@Override
public boolean feature(EngineFeature feature, PlannerContext plannerContext)
public boolean featureAvailable(EngineFeature feature, PlannerContext plannerContext)
{
switch (feature) {
case ALLOW_BINDABLE_PLAN:
case ALLOW_BROADCAST_RIGHTY_JOIN:
case TIMESERIES_QUERY:
case TOPN_QUERY:
case TIME_BOUNDARY_QUERY:
case GROUPING_SETS:
case WINDOW_FUNCTIONS:
case UNNEST:
return false;
case CAN_SELECT:
case CAN_INSERT:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.apache.druid.query.planning.DataSourceAnalysis;
import org.apache.druid.segment.SegmentReference;
import org.apache.druid.segment.UnnestSegmentReference;
import org.apache.druid.segment.VirtualColumn;
import org.apache.druid.utils.JvmUtils;

import javax.annotation.Nullable;
Expand All @@ -48,32 +49,28 @@
public class UnnestDataSource implements DataSource
{
private final DataSource base;
private final String column;
private final String outputName;
private final VirtualColumn virtualColumn;
private final LinkedHashSet<String> allowList;

private UnnestDataSource(
DataSource dataSource,
String columnName,
String outputName,
VirtualColumn virtualColumn,
LinkedHashSet<String> allowList
)
{
this.base = dataSource;
this.column = columnName;
this.outputName = outputName;
this.virtualColumn = virtualColumn;
this.allowList = allowList;
}

@JsonCreator
public static UnnestDataSource create(
@JsonProperty("base") DataSource base,
@JsonProperty("column") String columnName,
@JsonProperty("outputName") String outputName,
@JsonProperty("virtualColumn") VirtualColumn virtualColumn,
@Nullable @JsonProperty("allowList") LinkedHashSet<String> allowList
)
{
return new UnnestDataSource(base, columnName, outputName, allowList);
return new UnnestDataSource(base, virtualColumn, allowList);
}

@JsonProperty("base")
Expand All @@ -82,16 +79,10 @@ public DataSource getBase()
return base;
}

@JsonProperty("column")
public String getColumn()
@JsonProperty("virtualColumn")
public VirtualColumn getVirtualColumn()
{
return column;
}

@JsonProperty("outputName")
public String getOutputName()
{
return outputName;
return virtualColumn;
}

@JsonProperty("allowList")
Expand All @@ -118,7 +109,7 @@ public DataSource withChildren(List<DataSource> children)
if (children.size() != 1) {
throw new IAE("Expected [1] child, got [%d]", children.size());
}
return new UnnestDataSource(children.get(0), column, outputName, allowList);
return new UnnestDataSource(children.get(0), virtualColumn, allowList);
}

@Override
Expand Down Expand Up @@ -151,30 +142,21 @@ public Function<SegmentReference, SegmentReference> createSegmentMapFunction(
);
return JvmUtils.safeAccumulateThreadCpuTime(
cpuTimeAccumulator,
() -> {
if (column == null) {
return segmentMapFn;
} else if (column.isEmpty()) {
return segmentMapFn;
} else {
return
baseSegment ->
new UnnestSegmentReference(
segmentMapFn.apply(baseSegment),
column,
outputName,
allowList
);
}
}
() ->
baseSegment ->
new UnnestSegmentReference(
segmentMapFn.apply(baseSegment),
virtualColumn,
allowList
)
);

}

@Override
public DataSource withUpdatedDataSource(DataSource newSource)
{
return new UnnestDataSource(newSource, column, outputName, allowList);
return new UnnestDataSource(newSource, virtualColumn, allowList);
}

@Override
Expand Down Expand Up @@ -205,24 +187,22 @@ public boolean equals(Object o)
return false;
}
UnnestDataSource that = (UnnestDataSource) o;
return column.equals(that.column)
&& outputName.equals(that.outputName)
return virtualColumn.equals(that.virtualColumn)
&& base.equals(that.base);
}

@Override
public int hashCode()
{
return Objects.hash(base, column, outputName);
return Objects.hash(base, virtualColumn);
}

@Override
public String toString()
{
return "UnnestDataSource{" +
"base=" + base +
", column='" + column + '\'' +
", outputName='" + outputName + '\'' +
", column='" + virtualColumn + '\'' +
", allowList=" + allowList +
'}';
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import org.apache.druid.query.QueryToolChest;
import org.apache.druid.query.aggregation.MetricManipulationFn;
import org.apache.druid.segment.VirtualColumn;
import org.apache.druid.segment.column.ColumnCapabilities;
import org.apache.druid.segment.column.ColumnType;
import org.apache.druid.segment.column.RowSignature;
import org.apache.druid.utils.CloseableUtils;
Expand Down Expand Up @@ -172,7 +173,8 @@ public RowSignature resultArraySignature(final ScanQuery query)

final VirtualColumn virtualColumn = query.getVirtualColumns().getVirtualColumn(columnName);
if (virtualColumn != null) {
columnType = virtualColumn.capabilities(columnName).toColumnType();
final ColumnCapabilities capabilities = virtualColumn.capabilities(c -> null, columnName);
columnType = capabilities != null ? capabilities.toColumnType() : null;
} else {
// Unknown type. In the future, it would be nice to have a way to fill these in.
columnType = null;
Expand Down
Loading

0 comments on commit 4b1ffbc

Please sign in to comment.