-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: sql unnest and cleanup unnest datasource #13736
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think maybe it is a mistake to document this before we document ARRAY types themselves... #12549
@@ -147,6 +147,12 @@ only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you | |||
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also | |||
improve performance. | |||
|
|||
## Unnesting | |||
|
|||
You can unnest a column that contains multi-value dimensions (arrays) by using either the [UNNEST function (SQL)](../querying/sql.md#unnest) and the helper function MV_TO_ARRAY or the [`unnest` datasource (native)](../querying/datasource.md#unnest) . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... multi-value dimensions (arrays) ...
please do not conflate multi-value dimensions with actual druid array types, they are totally separate internally. Multi-value strings are represented internally as the native STRING
type and show up as VARCHAR
in the SQL schema. Druid also has ARRAY
types, such as ARRAY<STRING>
, ARRAY<LONG>
, etc, that are not well documented yet on purpose to not lock ourselves into a specific behavior prematurely.
Also, its very nearly pointless to use UNNEST
on a multi-value string column, because all multi-value strings have an implicit unnesting that occurs when grouping. The only case for using on a STRING
is as part of a scan query
docs/querying/sql-functions.md
Outdated
|
||
`UNNEST(source)) as UNNESTED (target)` | ||
|
||
Unnests a source column that includes arrays (multi-value dimensions) into a target column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again please don't conflate multi-value strings with arrays, they are not arrays, they just have special functions that allow interacting with them as if they were actual arrays.
docs/querying/sql.md
Outdated
@@ -82,6 +83,27 @@ FROM clause, metadata tables are not considered datasources. They exist only in | |||
For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md) | |||
documentation. | |||
|
|||
## UNNEST | |||
|
|||
The UNNEST clause unnests values stored in arrays within a column. It's the SQL equivalent to the [unnest datasource](./datasource.md#unnest). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnest isn't limited to working on arrays in a column (we don't currently have any array typed columns). MV_TO_ARRAY
effectively allows casting a multi-value string to an ARRAY<STRING>
, but arrays could also come from other virtual columns, such as anything using the ARRAY
constructor or anything that can build arrays, such as ARRAY_AGG
@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits. | |||
|
|||
Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value. | |||
|
|||
## UNNEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be constant and call it Unnest or Unnesting at all places ? multi-value-dimensions.md
calls this Unnesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Clint's comment, I've removed it from the mvd page.
docs/querying/sql.md
Outdated
|
||
* The `datasource` for UNNEST can be any of the following: | ||
* A table, such as `FROM a_table` | ||
* A subset of a table based on a query, such as `FROM (SELECT columnA,columnB,columnC from a_table)` or a filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also be a query or a join data source. Basically the data source can be any data source in Druid.
docs/querying/sql.md
Outdated
The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested: | ||
|
||
```sql | ||
SELECT target_column FROM datasource, UNNEST(source) as UNNESTED(target_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source_expression might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same query can be invoked as
SELECT foo.target_column FROM datasource, UNNEST(source) as foo(target_column)
The unnested is not a mandatory keyword and user can use any name they want as a table alias for the unnest. Think of this as unnest is a cross join between two data sources and each can be aliased as a table name.
docs/querying/sql.md
Outdated
* A table, such as `FROM a_table` | ||
* A subset of a table based on a query, such as `FROM (SELECT columnA,columnB,columnC from a_table)` or a filter. | ||
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])` | ||
* The `source` for the UNNEST function must be an array that exists in the `datasource`. Depending on how the `source` column is formatted, you may need to use helper functions. For example, if your column includes multi-dimension strings, you'll need to use MV_TO_ARRAY. Or if you're trying to join 2 columns with arrays, you'd need to use `ARRAY_CONCAT(column1,column2)` as the source.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reformat slightly differently.
If the dimension you are unnesting is a MVD you have to specify MV_TO_ARRAY(dimension) to convert to an implicit ARRAY<> type. You can also specify any expression that has an sql array datatype. For example ARRAY[dim1,dim2] if you want to make an array out of 2 dimensions or ARRAY_CONCAT(dim1,dim2) if you have to concat two MVDs. The goal is to pass Unnest an ARRAY. The array can come from any expression
docs/querying/sql-functions.md
Outdated
@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits. | |||
|
|||
Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value. | |||
|
|||
## UNNEST | |||
|
|||
`UNNEST(source)) as UNNESTED (target)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually should be UNNEST(source)) as table_alias_name (column_alias_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of calling the table alias as UNNESTED
always, this can be anything the user wants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of things on SQL unnest:
- The unnest sql function does not remove any duplicates/nulls in an array. Nulls will be treated as any other value in an array. If there are multiple nulls within the array, a record corresponding to each of the null will be created
- The native unnest has an option of specifying an allowList which cannot be specified through the SQL counterpart
- Unnest does not work on arrays inside complex JSON types yet.
- Unnest cannot be used at ingestion time
- Unnest preserves the ordering in the array which is being unnested
- Unnest is not supported in MSQ yet (some work needs to be done there)
docs/querying/sql.md
Outdated
@@ -55,6 +55,7 @@ Druid SQL supports SELECT queries with the following structure: | |||
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ] | |||
SELECT [ ALL | DISTINCT ] { * | exprs } | |||
FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition } | |||
[, UNNEST(<input>) as unnested(<output>) ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNNEST(<input>) as <table_alias>(<output>)
docs/querying/sql.md
Outdated
The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested: | ||
|
||
```sql | ||
SELECT target_column FROM datasource, UNNEST(source) as UNNESTED(target_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same query can be invoked as
SELECT foo.target_column FROM datasource, UNNEST(source) as foo(target_column)
The unnested is not a mandatory keyword and user can use any name they want as a table alias for the unnest. Think of this as unnest is a cross join between two data sources and each can be aliased as a table name.
docs/querying/sql.md
Outdated
* The `datasource` for UNNEST can be any Druid datasource, such as the following: | ||
* A table, such as `FROM a_table` | ||
* A subset of a table based on a query, such as `FROM (SELECT columnA,columnB,columnC from a_table)`, a filter, or a JOIN. | ||
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])` | |
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])`. |
select d45 from nested_data, UNNEST(ARRAY_CONCAT(dim4,dim5)) AS UNNESTED (d45) | ||
``` | ||
|
||
Decide which method to use based on what your goals are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems kind of vague. Expand on it? Or does it need to be added at all?
docs/querying/sql-functions.md
Outdated
@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits. | |||
|
|||
Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value. | |||
|
|||
## UNNEST | |||
|
|||
`UNNEST(source)) as UNNESTED (target)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of calling the table alias as UNNESTED
always, this can be anything the user wants
@@ -55,6 +55,7 @@ Druid SQL supports SELECT queries with the following structure: | |||
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ] | |||
SELECT [ ALL | DISTINCT ] { * | exprs } | |||
FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition } | |||
[, UNNEST(source_expression) as table_alias_name(column_alias_name) ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the right one, we should do the previous thing the same name <table_alias_name>
docs/querying/sql.md
Outdated
* The `datasource` for UNNEST can be any Druid datasource, such as the following: | ||
* A table, such as `FROM a_table`. | ||
* A subset of a table based on a query, a filter, or a JOIN. For example, `FROM (SELECT columnA,columnB,columnC from a_table)`. | ||
* An inline array, which is treated as the `datasource` and the `source_expression`, such as `FROM UNNEST(ARRAY[1,2,3])`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to unnest a constant inline expression like [1,2,3] we do not need the datasource. The query comes up as select * FROM UNNEST(ARRAY[1,2,3])
. We need to make the distinction that Unnest can be used independently if you are operating on an explicit inline data source
docs/querying/sql.md
Outdated
* `ARRAY_CONCAT(dim1,dim2)` if you have to concatenate two multi-value dimensions. | ||
* The `AS table_alias_name(column_alias_name)` clause is not required but is highly recommended. Use it to specify the output, which can be an existing column or a new one. Replace `table_alias_name` and `column_alias_name` with a table and column name you want to alias the unnested results to. If you don't provide this, Druid uses a nondescriptive name, such as `EXPR$0`. | ||
|
||
Notice the comma between the datasource and the UNNEST function. This is needed in most cases of the UNNEST function. Specifically, it is not needed when you're unnesting an inline array since the array itself is the datasource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct the same should be reflected in the syntax correctly
docs/querying/sql-functions.md
Outdated
@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits. | |||
|
|||
Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value. | |||
|
|||
## UNNEST | |||
|
|||
`UNNEST(source)) as UNNESTED (target)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of things on SQL unnest:
- The unnest sql function does not remove any duplicates/nulls in an array. Nulls will be treated as any other value in an array. If there are multiple nulls within the array, a record corresponding to each of the null will be created
- The native unnest has an option of specifying an allowList which cannot be specified through the SQL counterpart
- Unnest does not work on arrays inside complex JSON types yet.
- Unnest cannot be used at ingestion time
- Unnest preserves the ordering in the array which is being unnested
- Unnest is not supported in MSQ yet (some work needs to be done there)
If you unnest that same inline array while using a table as the datasource, Druid treats this as a JOIN between a left datasource and a constant datasource. For example: | ||
|
||
```sql | ||
SELECT longs FROM nested_data, UNNEST(ARRAY[1,2,3]) AS example_table (longs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep consistent on whether or not there's a space between the table alias and column alias? The general syntax in sql.md lists no space:
SELECT column_alias_name FROM datasource, UNNEST(source_expression) AS table_alias_name(column_alias_name)
The following query uses only three columns from the `nested_data` table as the datasource. From that subset, it unnests the column `dim3` into `d3` and returns `d3`. | ||
|
||
```sql | ||
SELECT d3 FROM (select dim1, dim2, dim3 from "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT d3 FROM (select dim1, dim2, dim3 from "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3) | |
SELECT d3 FROM (SELECT dim1, dim2, dim3 FROM "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3) |
* Returns the records for the unnested `d3` that have a `dim2` record that matches the filter | ||
|
||
```sql | ||
SELECT d3 FROM (SELCT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT d3 FROM (SELCT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3) | |
SELECT d3 FROM (SELECT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3) |
You can now do the following operations with TupleSketches in Post Aggregation Step Get the Sketch Output as Base64 String Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch
* Bump CycloneDX module to address POM errors * Including web-console in the PR --------- Co-authored-by: Elliott Freis <elliottfreis@Elliott-Freis.earth.dynamic.blacklight.net>
…service (apache#13872) Allow druid-kubernetes-overlord-extensions to be loaded in any druid service
* initial renames * add comaction history diff * final fixes * update snapshots * use maps * update test
This function is notorious for causing memory exhaustion and excessive CPU usage; so much so that it was valuable to work around it in the SQL planner in apache#13206. Hopefully, a warning comment will encourage developers to stay away and come up with solutions that do not involve computing all possible buckets.
…sion (apache#13890) * use Calcites.getColumnTypeForRelDataType for SQL CAST operator conversion * fix comment * intervals are strings but also longs
* Use TaskConfig to get task dir in KubernetesTaskRunner * Use the first path specified in baseTaskDirPaths instead of deprecated baseTaskDirPath * Use getBaseTaskDirPaths in generate command
* Add validation for aggregations on __time
…apache#13815) Improved error message when topic name changes within same supervisor Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
This PR is apache#13899 plus spotbugs fix to fix the failures introduced by apache#13815
There was an unused parameter causing the unpack to fail.
* Sort-merge join and hash shuffles for MSQ. The main changes are in the processing, multi-stage-query, and sql modules. processing module: 1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder. This makes it nicer to model hash keys, which use KeyOrder.NONE. 2) Add nullability checkers to the FieldReader interface, and an "isPartiallyNullKey" method to FrameComparisonWidget. The join processor uses this to detect null keys. 3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady so callers can tell which OutputChannels are ready for reading and which aren't. 4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access implementation. The join processor uses this to rewind when it needs to replay a set of rows with a particular key. 5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory instead of a particular MemoryAllocator. This allows FrameWriterFactory to be shared in more scenarios. multi-stage-query module: 1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers figure out what kind of shuffle is happening. The change from SortColumn to KeyColumn allows ClusterBy to be used for both hash-based and sort-based shuffling. 2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic to be more readable by moving the work-order-running code to the inner class RunWorkOrder, and the shuffle-pipeline-building code to the inner class ShufflePipelineBuilder. 3) Add SortMergeJoinFrameProcessor and factory. 4) WorkerMemoryParameters: Adjust logic to reserve space for output frames for hash partitioning. (We need one frame per partition.) sql module: 1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or "sortMerge". With native, it must always be "broadcast", or it's a validation error. MSQ supports both. Default is "broadcast" in both engines. 2) Validate that MSQs do not use broadcast join with RIGHT or FULL join, as results are not correct for broadcast join with those types. Allow this in native for two reasons: legacy (the docs caution against it, but it's always been allowed), and the fact that it actually *does* generate correct results in native when the join is processed on the Broker. It is much less likely that MSQ will plan in such a way that generates correct results. 3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge join, because subqueries are always required, so there's no reason to penalize them. 4) Move previously-disabled join reordering and manipulation rules to FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps get to better plans where projections and filters are pushed down. * Work around compiler problem. * Updates from static analysis. * Fix @param tag. * Fix declared exception. * Fix spelling. * Minor adjustments. * wip * Merge fixups * fixes * Fix CalciteSelectQueryMSQTest * Empty keys are sortable. * Address comments from code review. Rename mux -> mix. * Restore inspection config. * Restore original doc. * Reorder imports. * Adjustments * Fix. * Fix imports. * Adjustments from review. * Update header. * Adjust docs.
* fix KafkaInputFormat when used with Sampler API * handle key format sampling the same as value format sampling
fix OOMs using a different logic for generating tombstones --------- Co-authored-by: Paul Rogers <paul-rogers@users.noreply.github.com>
…#13897) * use custom case operator conversion instead of direct operator conversion, to produce native nvl expression for SQL NVL and 2 argument COALESCE, and add optimization for certain case filters from coalesce and nvl statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after CI pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised a few nits. Overall LGTM
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Description
Adds the documentation for the SQL UNNEST function (#13576) based on the existing layout of the Druid docs. Examples can be found in the tutorial.
It also does some cleanup on the tutorial stuff that was added for the native version of unnest, mainly condensing the 2 tables into 1 so that users only need to ingest 1 thing for the tutorial.
If you build the docs locally, here are the URLs for the pages:
This PR has: