Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: sql unnest and cleanup unnest datasource #13736

Merged
merged 48 commits into from
Apr 4, 2023
Merged

Conversation

317brian
Copy link
Contributor

@317brian 317brian commented Feb 1, 2023

Description

Adds the documentation for the SQL UNNEST function (#13576) based on the existing layout of the Druid docs. Examples can be found in the tutorial.

It also does some cleanup on the tutorial stuff that was added for the native version of unnest, mainly condensing the 2 tables into 1 so that users only need to ingest 1 thing for the tutorial.

If you build the docs locally, here are the URLs for the pages:

This PR has:

  • been self-reviewed.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe it is a mistake to document this before we document ARRAY types themselves... #12549

@@ -147,6 +147,12 @@ only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.

## Unnesting

You can unnest a column that contains multi-value dimensions (arrays) by using either the [UNNEST function (SQL)](../querying/sql.md#unnest) and the helper function MV_TO_ARRAY or the [`unnest` datasource (native)](../querying/datasource.md#unnest) .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... multi-value dimensions (arrays) ...

please do not conflate multi-value dimensions with actual druid array types, they are totally separate internally. Multi-value strings are represented internally as the native STRING type and show up as VARCHAR in the SQL schema. Druid also has ARRAY types, such as ARRAY<STRING>, ARRAY<LONG>, etc, that are not well documented yet on purpose to not lock ourselves into a specific behavior prematurely.

Also, its very nearly pointless to use UNNEST on a multi-value string column, because all multi-value strings have an implicit unnesting that occurs when grouping. The only case for using on a STRING is as part of a scan query


`UNNEST(source)) as UNNESTED (target)`

Unnests a source column that includes arrays (multi-value dimensions) into a target column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again please don't conflate multi-value strings with arrays, they are not arrays, they just have special functions that allow interacting with them as if they were actual arrays.

@@ -82,6 +83,27 @@ FROM clause, metadata tables are not considered datasources. They exist only in
For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
documentation.

## UNNEST

The UNNEST clause unnests values stored in arrays within a column. It's the SQL equivalent to the [unnest datasource](./datasource.md#unnest).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnest isn't limited to working on arrays in a column (we don't currently have any array typed columns). MV_TO_ARRAY effectively allows casting a multi-value string to an ARRAY<STRING>, but arrays could also come from other virtual columns, such as anything using the ARRAY constructor or anything that can build arrays, such as ARRAY_AGG

@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits.

Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.

## UNNEST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be constant and call it Unnest or Unnesting at all places ? multi-value-dimensions.md calls this Unnesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Clint's comment, I've removed it from the mvd page.


* The `datasource` for UNNEST can be any of the following:
* A table, such as `FROM a_table`
* A subset of a table based on a query, such as `FROM (SELECT columnA,columnB,columnC from a_table)` or a filter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also be a query or a join data source. Basically the data source can be any data source in Druid.

The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested:

```sql
SELECT target_column FROM datasource, UNNEST(source) as UNNESTED(target_column)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_expression might be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same query can be invoked as

SELECT foo.target_column FROM datasource, UNNEST(source) as foo(target_column)

The unnested is not a mandatory keyword and user can use any name they want as a table alias for the unnest. Think of this as unnest is a cross join between two data sources and each can be aliased as a table name.

* A table, such as `FROM a_table`
* A subset of a table based on a query, such as `FROM (SELECT columnA,columnB,columnC from a_table)` or a filter.
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])`
* The `source` for the UNNEST function must be an array that exists in the `datasource`. Depending on how the `source` column is formatted, you may need to use helper functions. For example, if your column includes multi-dimension strings, you'll need to use MV_TO_ARRAY. Or if you're trying to join 2 columns with arrays, you'd need to use `ARRAY_CONCAT(column1,column2)` as the source..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reformat slightly differently.

If the dimension you are unnesting is a MVD you have to specify MV_TO_ARRAY(dimension) to convert to an implicit ARRAY<> type. You can also specify any expression that has an sql array datatype. For example ARRAY[dim1,dim2] if you want to make an array out of 2 dimensions or ARRAY_CONCAT(dim1,dim2) if you have to concat two MVDs. The goal is to pass Unnest an ARRAY. The array can come from any expression

@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits.

Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.

## UNNEST

`UNNEST(source)) as UNNESTED (target)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually should be UNNEST(source)) as table_alias_name (column_alias_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of calling the table alias as UNNESTED always, this can be anything the user wants

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of things on SQL unnest:

  1. The unnest sql function does not remove any duplicates/nulls in an array. Nulls will be treated as any other value in an array. If there are multiple nulls within the array, a record corresponding to each of the null will be created
  2. The native unnest has an option of specifying an allowList which cannot be specified through the SQL counterpart
  3. Unnest does not work on arrays inside complex JSON types yet.
  4. Unnest cannot be used at ingestion time
  5. Unnest preserves the ordering in the array which is being unnested
  6. Unnest is not supported in MSQ yet (some work needs to be done there)

@@ -55,6 +55,7 @@ Druid SQL supports SELECT queries with the following structure:
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]
SELECT [ ALL | DISTINCT ] { * | exprs }
FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition }
[, UNNEST(<input>) as unnested(<output>) ]
Copy link
Contributor

@somu-imply somu-imply Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNNEST(<input>) as <table_alias>(<output>)

The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested:

```sql
SELECT target_column FROM datasource, UNNEST(source) as UNNESTED(target_column)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same query can be invoked as

SELECT foo.target_column FROM datasource, UNNEST(source) as foo(target_column)

The unnested is not a mandatory keyword and user can use any name they want as a table alias for the unnest. Think of this as unnest is a cross join between two data sources and each can be aliased as a table name.

docs/querying/sql-functions.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
* The `datasource` for UNNEST can be any Druid datasource, such as the following:
* A table, such as `FROM a_table`
* A subset of a table based on a query, such as `FROM (SELECT columnA,columnB,columnC from a_table)`, a filter, or a JOIN.
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])`
* An inline array, which is treated as the `datasource` and the `source`, such as `FROM UNNEST(ARRAY[1,2,3])`.

docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
select d45 from nested_data, UNNEST(ARRAY_CONCAT(dim4,dim5)) AS UNNESTED (d45)
```

Decide which method to use based on what your goals are.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kind of vague. Expand on it? Or does it need to be added at all?

docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits.

Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.

## UNNEST

`UNNEST(source)) as UNNESTED (target)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of calling the table alias as UNNESTED always, this can be anything the user wants

@@ -55,6 +55,7 @@ Druid SQL supports SELECT queries with the following structure:
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]
SELECT [ ALL | DISTINCT ] { * | exprs }
FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition }
[, UNNEST(source_expression) as table_alias_name(column_alias_name) ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right one, we should do the previous thing the same name <table_alias_name>

* The `datasource` for UNNEST can be any Druid datasource, such as the following:
* A table, such as `FROM a_table`.
* A subset of a table based on a query, a filter, or a JOIN. For example, `FROM (SELECT columnA,columnB,columnC from a_table)`.
* An inline array, which is treated as the `datasource` and the `source_expression`, such as `FROM UNNEST(ARRAY[1,2,3])`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to unnest a constant inline expression like [1,2,3] we do not need the datasource. The query comes up as select * FROM UNNEST(ARRAY[1,2,3]). We need to make the distinction that Unnest can be used independently if you are operating on an explicit inline data source

* `ARRAY_CONCAT(dim1,dim2)` if you have to concatenate two multi-value dimensions.
* The `AS table_alias_name(column_alias_name)` clause is not required but is highly recommended. Use it to specify the output, which can be an existing column or a new one. Replace `table_alias_name` and `column_alias_name` with a table and column name you want to alias the unnested results to. If you don't provide this, Druid uses a nondescriptive name, such as `EXPR$0`.

Notice the comma between the datasource and the UNNEST function. This is needed in most cases of the UNNEST function. Specifically, it is not needed when you're unnesting an inline array since the array itself is the datasource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct the same should be reflected in the syntax correctly

@@ -1357,6 +1357,13 @@ Truncates a numerical expression to a specific number of decimal digits.

Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.

## UNNEST

`UNNEST(source)) as UNNESTED (target)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of things on SQL unnest:

  1. The unnest sql function does not remove any duplicates/nulls in an array. Nulls will be treated as any other value in an array. If there are multiple nulls within the array, a record corresponding to each of the null will be created
  2. The native unnest has an option of specifying an allowList which cannot be specified through the SQL counterpart
  3. Unnest does not work on arrays inside complex JSON types yet.
  4. Unnest cannot be used at ingestion time
  5. Unnest preserves the ordering in the array which is being unnested
  6. Unnest is not supported in MSQ yet (some work needs to be done there)

@317brian 317brian marked this pull request as ready for review February 9, 2023 19:59
@317brian 317brian requested a review from somu-imply February 9, 2023 23:45
If you unnest that same inline array while using a table as the datasource, Druid treats this as a JOIN between a left datasource and a constant datasource. For example:

```sql
SELECT longs FROM nested_data, UNNEST(ARRAY[1,2,3]) AS example_table (longs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep consistent on whether or not there's a space between the table alias and column alias? The general syntax in sql.md lists no space:

SELECT column_alias_name FROM datasource, UNNEST(source_expression) AS table_alias_name(column_alias_name)

The following query uses only three columns from the `nested_data` table as the datasource. From that subset, it unnests the column `dim3` into `d3` and returns `d3`.

```sql
SELECT d3 FROM (select dim1, dim2, dim3 from "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT d3 FROM (select dim1, dim2, dim3 from "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3)
SELECT d3 FROM (SELECT dim1, dim2, dim3 FROM "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3)

* Returns the records for the unnested `d3` that have a `dim2` record that matches the filter

```sql
SELECT d3 FROM (SELCT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT d3 FROM (SELCT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3)
SELECT d3 FROM (SELECT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table (d3)

317brian and others added 11 commits February 22, 2023 13:19
You can now do the following operations with TupleSketches in Post Aggregation Step

Get the Sketch Output as Base64 String
Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations
Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch
* Bump CycloneDX module to address POM errors

* Including web-console in the PR

---------

Co-authored-by: Elliott Freis <elliottfreis@Elliott-Freis.earth.dynamic.blacklight.net>
…service (apache#13872)

Allow druid-kubernetes-overlord-extensions to be loaded in any druid service
vogievetsky and others added 20 commits March 10, 2023 13:49
* initial renames

* add comaction history diff

* final fixes

* update snapshots

* use maps

* update test
This function is notorious for causing memory exhaustion and excessive
CPU usage; so much so that it was valuable to work around it in the
SQL planner in apache#13206. Hopefully, a warning comment will encourage
developers to stay away and come up with solutions that do not involve
computing all possible buckets.
…sion (apache#13890)

* use Calcites.getColumnTypeForRelDataType for SQL CAST operator conversion

* fix comment

* intervals are strings but also longs
* Use TaskConfig to get task dir in KubernetesTaskRunner

* Use the first path specified in baseTaskDirPaths instead of deprecated baseTaskDirPath

* Use getBaseTaskDirPaths in generate command
* Add validation for aggregations on __time
…apache#13815)

Improved error message when topic name changes within same supervisor

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
This PR is apache#13899 plus spotbugs fix to fix the failures introduced by apache#13815
There was an unused parameter causing the unpack to fail.
* Sort-merge join and hash shuffles for MSQ.

The main changes are in the processing, multi-stage-query, and sql modules.

processing module:

1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder.
   This makes it nicer to model hash keys, which use KeyOrder.NONE.

2) Add nullability checkers to the FieldReader interface, and an
   "isPartiallyNullKey" method to FrameComparisonWidget. The join
   processor uses this to detect null keys.

3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady
   so callers can tell which OutputChannels are ready for reading and which
   aren't.

4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access
   implementation. The join processor uses this to rewind when it needs to
   replay a set of rows with a particular key.

5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory
   instead of a particular MemoryAllocator. This allows FrameWriterFactory
   to be shared in more scenarios.

multi-stage-query module:

1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers
   figure out what kind of shuffle is happening. The change from SortColumn
   to KeyColumn allows ClusterBy to be used for both hash-based and sort-based
   shuffling.

2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic
   to be more readable by moving the work-order-running code to the inner
   class RunWorkOrder, and the shuffle-pipeline-building code to the inner
   class ShufflePipelineBuilder.

3) Add SortMergeJoinFrameProcessor and factory.

4) WorkerMemoryParameters: Adjust logic to reserve space for output frames
   for hash partitioning. (We need one frame per partition.)

sql module:

1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or
   "sortMerge". With native, it must always be "broadcast", or it's a
   validation error. MSQ supports both. Default is "broadcast" in
   both engines.

2) Validate that MSQs do not use broadcast join with RIGHT or FULL join,
   as results are not correct for broadcast join with those types. Allow
   this in native for two reasons: legacy (the docs caution against it,
   but it's always been allowed), and the fact that it actually *does*
   generate correct results in native when the join is processed on the
   Broker. It is much less likely that MSQ will plan in such a way that
   generates correct results.

3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge
   join, because subqueries are always required, so there's no reason
   to penalize them.

4) Move previously-disabled join reordering and manipulation rules to
   FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps
   get to better plans where projections and filters are pushed down.

* Work around compiler problem.

* Updates from static analysis.

* Fix @param tag.

* Fix declared exception.

* Fix spelling.

* Minor adjustments.

* wip

* Merge fixups

* fixes

* Fix CalciteSelectQueryMSQTest

* Empty keys are sortable.

* Address comments from code review. Rename mux -> mix.

* Restore inspection config.

* Restore original doc.

* Reorder imports.

* Adjustments

* Fix.

* Fix imports.

* Adjustments from review.

* Update header.

* Adjust docs.
* fix KafkaInputFormat when used with Sampler API

* handle key format sampling the same as value format sampling
fix OOMs using a different logic for generating tombstones

---------

Co-authored-by: Paul Rogers <paul-rogers@users.noreply.github.com>
)

* Avoid creating new RelDataTypeFactory during SQL planning.

Reduces unnecessary CPU cycles.

* Fix.
…#13897)

* use custom case operator conversion instead of direct operator conversion, to produce native nvl expression for SQL NVL and 2 argument COALESCE, and add optimization for certain case filters from coalesce and nvl statements
Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI pass

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised a few nits. Overall LGTM

docs/querying/sql.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-unnest-arrays.md Outdated Show resolved Hide resolved
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
docs/querying/sql.md Outdated Show resolved Hide resolved
docs/querying/sql.md Outdated Show resolved Hide resolved
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@clintropolis clintropolis added this to the 26.0 milestone Apr 1, 2023
@vtlim vtlim merged commit 7e572ee into apache:master Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.