Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extend unnest to support Struct datatype #10429

Merged
merged 22 commits into from
May 22, 2024

Conversation

duongcongtoai
Copy link
Contributor

@duongcongtoai duongcongtoai commented May 8, 2024

Which issue does this PR close?

Closes #10264.

Rationale for this change

What changes are included in this PR?

  • Unnest logical plan is aware of the type of columns to be unnested
  • UnnestExec (physical plan) expands the input batch both vertically and horizontally, depends on the set of columns to be unnested
  • Modify try_process_unnest so it also interested in unnest(struct) expr, and translate this Expr into the projection of multiple columns for each subfield of the current struct in the column. Detailed expected behavior is in this test
  • Add unnest(struct) scehanrios to unnest.slt, and one case where unnest is executed for different columns of type list/struct simultaneously

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels May 8, 2024
@duongcongtoai duongcongtoai changed the title feat: extend unnest for struct feat: extend unnest to support Struct datatype May 8, 2024
@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels May 9, 2024
@github-actions github-actions bot added the optimizer Optimizer rules label May 12, 2024
@duongcongtoai duongcongtoai marked this pull request as ready for review May 19, 2024 20:15
@@ -123,7 +123,8 @@ impl ExprSchemable for Expr {
Ok(field.data_type().clone())
}
DataType::Struct(_) => {
not_impl_err!("unnest() does not support struct yet")
// TODO: this is not correct, because unnest(struct) wll result into multiple data_type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still finding a way to solve this TODO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we end up with an unest Expr (vs an LogicalPlan::Unnest).

Could we simply return not_yet_implemented error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we returns error here, the error will stop the query plan from continue.
Initially unnest comes into projection step as an Expr here:

exprs.iter().map(|e| e.to_field(input_schema)).collect()
, where it check data types of each expr, then later on we call
try_process_unnest to do the transformation from unnest(struct) -> struct.field1, struct.field 2 ...

I think unnest is the only expr so far that represent a set of fields instead of singular value here, so i'm thinking of adding an extra check at this function

exprs.iter().map(|e| e.to_field(input_schema)).collect()

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @duongcongtoai -- I think this is really cool.

I played around with this PR a bit locally and it works nicely. I have a few suggested tests and some simplification suggestions, but I also think those could be done as follow on PRs. Please let me know what you think

@@ -123,7 +123,8 @@ impl ExprSchemable for Expr {
Ok(field.data_type().clone())
}
DataType::Struct(_) => {
not_impl_err!("unnest() does not support struct yet")
// TODO: this is not correct, because unnest(struct) wll result into multiple data_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we end up with an unest Expr (vs an LogicalPlan::Unnest).

Could we simply return not_yet_implemented error here?

unnest_with_options(input, columns, UnnestOptions::default())
}

pub fn get_unnested_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some documentation explaining what this function does?

Also, I wonder if it needs to be pub or if we could leave it pub(crate) or not pub?

5 [4, 5] 3 4 {c0: 3, c1: 4}
6 [6] NULL NULL NULL
12 [12] 7 8 {c0: 7, c1: 8}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some other tests:

  1. Test the output type (e.g. select arrow_typeof(unnest(column1)))
  2. Test nested structs (I tested it works well, but I think we should have some coverage)

Nested structs

> create or replace table t as values (struct('a', 'b', struct('c'))), (struct('d', 'e', struct('f')));
0 row(s) fetched.
Elapsed 0.031 seconds.

> select * from t;
+-----------------------------+
| column1                     |
+-----------------------------+
| {c0: a, c1: b, c2: {c0: c}} |
| {c0: d, c1: e, c2: {c0: f}} |
+-----------------------------+
2 row(s) fetched.
Elapsed 0.006 seconds.

> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a                    | b                    | {c0: c}              |
| d                    | e                    | {c0: f}              |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
Elapsed 0.013 seconds.

And

> create or replace table t as values (struct('a', 'b', [1,2,3])), (struct('x', 'y', [10,20]));
0 row(s) fetched.
Elapsed 0.010 seconds.

> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a                    | b                    | [1, 2, 3]            |
| x                    | y                    | [10, 20]             |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the output type doesn't seem to work (which probably makes sense as unnest needs to be converted to a LogicalPlan). I don't think we have to fix this in the PR (but adding coverage would be good)

> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a                    | b                    | [1, 2, 3]            |
| x                    | y                    | [10, 20]             |
+----------------------+----------------------+----------------------+
2 row(s) fetched.

> select arrow_typeof(unnest(column1)) from t;
Internal error: unnest on struct can ony be applied at the root level of select expression.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, i added more coverage


// unnest(struct_col)
let original_expr = Expr::Unnest(Unnest {
expr: Box::new(Expr::Column(Column::from_name("struct_col"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make this a lot more concise using the expr_fn API, like:

Suggested change
expr: Box::new(Expr::Column(Column::from_name("struct_col"))),
expr: Box::new(col("struct_col"))

In fact maybe we should add an expr_fn for unnest

/// Call `unnest` on the expression
fn unnest(arg: Expr) -> Expr { 
  Expr::Unnest(Unnest { expr:: Box::new(arg) } )
}

let original_expr = Expr::Unnest(Unnest {
expr: Box::new(Expr::Column(Column::from_name("array_col"))),
})
.add(Expr::Literal(ScalarValue::Int64(Some(1))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.add(Expr::Literal(ScalarValue::Int64(Some(1))));
.add(lit(1i64));

There are similar simplifications we could do below too

@alamb alamb changed the title feat: extend unnest to support Struct datatype feat: extend unnest to support Struct datatype May 21, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @duongcongtoai -- really nice improvement

@alamb alamb merged commit 0355713 into apache:main May 22, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: extend unnest for struct

* compile err

* debugging

* finish basic

* chore: complete impl

* chore: clean garbage

* chore: more test

* test: fix df test

* prettify display

* fix unit test

* chore: compile err

* chore: fix physical exec err

* add sqllogic test

* chore: more doc

* chore: refactor

* fix doc

* fmt

* fix doc

* ut for recursive transform unnest

* a small integration test

* fix comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support unnest for struct data type
2 participants