Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various changes and fixes to UNNEST. #13892

Merged
merged 11 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/misc/math-expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ For logical operators, a number is true if and only if it is positive (0 or nega

[Multi-value string dimensions](../querying/multi-value-dimensions.md) are supported and may be treated as either scalar or array typed values, as follows:
* When treated as a scalar type, the expression is automatically transformed so that the scalar operation is applied across all values of the multi-valued type, mimicking Druid's native behavior.
* Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to the `UNNEST` operator available in many SQL dialects. Alternatively, you can use the `array_to_string` function to perform the aggregation on a _stringified_ version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use `string_to_array` in an expression post-aggregator.
* Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to an implicit SQL `UNNEST` operation. Alternatively, you can use the `array_to_string` function to perform the aggregation on a _stringified_ version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use `string_to_array` in an expression post-aggregator.


The following built-in functions are available.
Expand Down
14 changes: 8 additions & 6 deletions docs/querying/datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,9 +373,10 @@ always be correct.

### `unnest`

> The unnest datasource is currently only available as part of a native query.
> The unnest datasource is [experimental](../development/experimental.md). Its API and behavior are subject
> to change in future releases. It is not recommended to use this feature in production at this time.

Use the `unnest` datasource to unnest a column with multiple values in an array.
Use the `unnest` datasource to unnest a column with multiple values in an array.
For example, you have a source column that looks like this:

| Nested |
Expand Down Expand Up @@ -409,7 +410,10 @@ The `unnest` datasource uses the following syntax:
"type": "table",
"name": "nested_data"
},
"column": "nested_source_column",
"virtualColumn": {
"type": "expression",
"expression": "\"column_reference\""
},
"outputName": "unnested_target_column",
"allowList": []
},
Expand All @@ -418,9 +422,7 @@ The `unnest` datasource uses the following syntax:
* `dataSource.type`: Set this to `unnest`.
* `dataSource.base`: Defines the datasource you want to unnest.
* `dataSource.base.type`: The type of datasource you want to unnest, such as a table.
* `dataSource.base.name`: The name of the datasource you want to unnest.
* `dataSource.column`: The name of the source column that contains the nested values.
* `dataSource.outputName`: The name you want to assign to the column that will contain the unnested values. You can replace the source column with the unnested column by specifying the source column's name or a new column by specifying a different name. Outputting it to a new column can help you verify that you get the results that you expect but isn't required.
* `dataSource.virtualColumn`: [Virtual column](virtual-columns.md) that references the nested values. The output name of this column is reused as the name of the column that contains unnested values. You can replace the source column with the unnested column by specifying the source column's name or a new column by specifying a different name. Outputting it to a new column can help you verify that you get the results that you expect but isn't required.
* `dataSource.allowList`: Optional. The subset of values you want to unnest.

To learn more about how to use the `unnest` datasource, see the [unnest tutorial](../tutorials/tutorial-unnest-datasource.md).
11 changes: 5 additions & 6 deletions docs/querying/multi-value-dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,11 @@ This "selector" filter would match row4 of the dataset above:
### Grouping

topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. This can be thought of as the equivalent to the
`UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This means it's possible for a query to return
more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match
only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.
from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL `UNNEST`
operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the
dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups:
`t1`, `t2`, and `t3`. If you only need to include values that match your filter, you can use a
[filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance.

## Example: GroupBy query with no filtering

Expand Down
4 changes: 2 additions & 2 deletions docs/querying/sql-data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ applied to all values for each row individually. Multi-value string dimensions c
[multi-value string functions](sql-multivalue-string-functions.md), which can perform powerful array-aware operations.

Grouping by a multi-value expression observes the native Druid multi-value aggregation behavior, which is similar to
the `UNNEST` functionality available in some other SQL dialects. Refer to the documentation on
[multi-value string dimensions](multi-value-dimensions.md) for additional details.
an implicit SQL `UNNEST`. Refer to the documentation on [multi-value string dimensions](multi-value-dimensions.md)
for additional details.

> Because multi-value dimensions are treated by the SQL planner as `VARCHAR`, there are some inconsistencies between how
> they are handled in Druid SQL and in native queries. For example, expressions involving multi-value dimensions may be
Expand Down
59 changes: 38 additions & 21 deletions docs/tutorials/tutorial-unnest-datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ title: "Tutorial: Unnest data in a column"

> If you're looking for information about how to unnest `COMPLEX<json>` columns, see [Nested columns](../querying/nested-columns.md).

> The unnest datasource is currently only available as part of a native query.
> The unnest datasource is [experimental](../development/experimental.md). Its API and behavior are subject
> to change in future releases. It is not recommended to use this feature in production at this time.

This tutorial demonstrates how to use the unnest datasource to unnest a column that has data stored in arrays. For example, if you have a column named `dim3` with values like `[a,b]` or `[c,d,f]`, the unnest datasource can output the data to a new column with individual rows that contain single values like `a` and `b`. When doing this, be mindful of the following:

Expand Down Expand Up @@ -161,9 +162,11 @@ The following native Scan query returns the rows of the datasource and unnests t
"type": "table",
"name": "nested_data"
},
"column": "dim3",
"outputName": "unnest-dim3",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -222,9 +225,11 @@ The following query returns an unnested version of the column `dim3` as the colu
"dataSource": {
"type": "unnest",
"base": "nested_data",
"column": "dim3",
"outputName": "unnest-dim3",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],
"granularity": "all",
Expand Down Expand Up @@ -264,8 +269,11 @@ The example topN query unnests `dim3` into the column `unnest-dim3`. The query u
"type": "table",
"name": "nested_data"
},
"column": "dim3",
"outputName": "unnest-dim3",
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
},
"allowList": null
},
"dimension": {
Expand Down Expand Up @@ -369,9 +377,11 @@ This query joins the `nested_data` table with itself and outputs the unnested da
"condition": "(\"m1\" == \"j0.v0\")",
"joinType": "INNER"
},
"column": "dim3",
"outputName": "unnest-dim3",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -520,13 +530,15 @@ When you run the query, pay special attention to how the total number of rows ha
"type": "table",
"name": "nested_data2"
},
"column": "dim3",
"outputName": "unnest-dim3",
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
},
"allowList": []
},
"column": "dim2",
"outputName": "unnest-dim2",
"allowList": []
"outputName": "unnest-dim2"
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -583,9 +595,11 @@ You can also use the `unnest` datasource to unnest an inline datasource. The fol
]
]
},
"column": "inline_data",
"outputName": "output",
"allowList": []
"virtualColumn": {
"type": "expression",
"name": "output",
"expression": "\"inline_data\""
}
},
"intervals": {
"type": "intervals",
Expand Down Expand Up @@ -625,8 +639,11 @@ The following Scan query uses the `nested_data2` table you created in [Load data
"type": "table",
"name": "nested_data2"
},
"column": "v0",
"outputName": "unnest-v0"
"virtualColumn": {
"type": "expression",
"name": "unnest-v0",
"expression": "\"v0\""
}
}
"intervals": {
"type": "intervals",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,13 +104,16 @@ public RelDataType resultTypeForInsert(RelDataTypeFactory typeFactory, RelDataTy
}

@Override
public boolean feature(EngineFeature feature, PlannerContext plannerContext)
public boolean featureAvailable(EngineFeature feature, PlannerContext plannerContext)
{
switch (feature) {
case ALLOW_BINDABLE_PLAN:
case TIMESERIES_QUERY:
case TOPN_QUERY:
case TIME_BOUNDARY_QUERY:
case GROUPING_SETS:
case WINDOW_FUNCTIONS:
case UNNEST:
return false;
case CAN_SELECT:
case CAN_INSERT:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.apache.druid.query.planning.DataSourceAnalysis;
import org.apache.druid.segment.SegmentReference;
import org.apache.druid.segment.UnnestSegmentReference;
import org.apache.druid.segment.VirtualColumn;
import org.apache.druid.utils.JvmUtils;

import javax.annotation.Nullable;
Expand All @@ -48,32 +49,28 @@
public class UnnestDataSource implements DataSource
{
private final DataSource base;
private final String column;
private final String outputName;
private final VirtualColumn virtualColumn;
private final LinkedHashSet<String> allowList;

private UnnestDataSource(
DataSource dataSource,
String columnName,
String outputName,
VirtualColumn virtualColumn,
LinkedHashSet<String> allowList
)
{
this.base = dataSource;
this.column = columnName;
this.outputName = outputName;
this.virtualColumn = virtualColumn;
this.allowList = allowList;
}

@JsonCreator
public static UnnestDataSource create(
@JsonProperty("base") DataSource base,
@JsonProperty("column") String columnName,
@JsonProperty("outputName") String outputName,
@JsonProperty("virtualColumn") VirtualColumn virtualColumn,
@Nullable @JsonProperty("allowList") LinkedHashSet<String> allowList
)
{
return new UnnestDataSource(base, columnName, outputName, allowList);
return new UnnestDataSource(base, virtualColumn, allowList);
}

@JsonProperty("base")
Expand All @@ -82,16 +79,10 @@ public DataSource getBase()
return base;
}

@JsonProperty("column")
public String getColumn()
@JsonProperty("virtualColumn")
public VirtualColumn getVirtualColumn()
{
return column;
}

@JsonProperty("outputName")
public String getOutputName()
{
return outputName;
return virtualColumn;
}

@JsonProperty("allowList")
Expand All @@ -118,7 +109,7 @@ public DataSource withChildren(List<DataSource> children)
if (children.size() != 1) {
throw new IAE("Expected [1] child, got [%d]", children.size());
}
return new UnnestDataSource(children.get(0), column, outputName, allowList);
return new UnnestDataSource(children.get(0), virtualColumn, allowList);
}

@Override
Expand Down Expand Up @@ -151,30 +142,21 @@ public Function<SegmentReference, SegmentReference> createSegmentMapFunction(
);
return JvmUtils.safeAccumulateThreadCpuTime(
cpuTimeAccumulator,
() -> {
if (column == null) {
return segmentMapFn;
} else if (column.isEmpty()) {
return segmentMapFn;
} else {
return
baseSegment ->
new UnnestSegmentReference(
segmentMapFn.apply(baseSegment),
column,
outputName,
allowList
);
}
}
() ->
baseSegment ->
new UnnestSegmentReference(
segmentMapFn.apply(baseSegment),
virtualColumn,
allowList
)
);

}

@Override
public DataSource withUpdatedDataSource(DataSource newSource)
{
return new UnnestDataSource(newSource, column, outputName, allowList);
return new UnnestDataSource(newSource, virtualColumn, allowList);
}

@Override
Expand Down Expand Up @@ -205,24 +187,22 @@ public boolean equals(Object o)
return false;
}
UnnestDataSource that = (UnnestDataSource) o;
return column.equals(that.column)
&& outputName.equals(that.outputName)
return virtualColumn.equals(that.virtualColumn)
&& base.equals(that.base);
}

@Override
public int hashCode()
{
return Objects.hash(base, column, outputName);
return Objects.hash(base, virtualColumn);
}

@Override
public String toString()
{
return "UnnestDataSource{" +
"base=" + base +
", column='" + column + '\'' +
", outputName='" + outputName + '\'' +
", column='" + virtualColumn + '\'' +
", allowList=" + allowList +
'}';
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import org.apache.druid.query.QueryToolChest;
import org.apache.druid.query.aggregation.MetricManipulationFn;
import org.apache.druid.segment.VirtualColumn;
import org.apache.druid.segment.column.ColumnCapabilities;
import org.apache.druid.segment.column.ColumnType;
import org.apache.druid.segment.column.RowSignature;
import org.apache.druid.utils.CloseableUtils;
Expand Down Expand Up @@ -172,7 +173,8 @@ public RowSignature resultArraySignature(final ScanQuery query)

final VirtualColumn virtualColumn = query.getVirtualColumns().getVirtualColumn(columnName);
if (virtualColumn != null) {
columnType = virtualColumn.capabilities(columnName).toColumnType();
final ColumnCapabilities capabilities = virtualColumn.capabilities(c -> null, columnName);
columnType = capabilities != null ? capabilities.toColumnType() : null;
} else {
// Unknown type. In the future, it would be nice to have a way to fill these in.
columnType = null;
Expand Down
Loading