-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: extend unnest
to support Struct datatype
#10429
Conversation
datafusion/expr/src/expr_schema.rs
Outdated
@@ -123,7 +123,8 @@ impl ExprSchemable for Expr { | |||
Ok(field.data_type().clone()) | |||
} | |||
DataType::Struct(_) => { | |||
not_impl_err!("unnest() does not support struct yet") | |||
// TODO: this is not correct, because unnest(struct) wll result into multiple data_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still finding a way to solve this TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we end up with an unest
Expr (vs an LogicalPlan::Unnest
).
Could we simply return not_yet_implemented
error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we returns error here, the error will stop the query plan from continue.
Initially unnest comes into projection step as an Expr here:
datafusion/datafusion/expr/src/utils.rs
Line 734 in d58bae4
exprs.iter().map(|e| e.to_field(input_schema)).collect() |
try_process_unnest to do the transformation from unnest(struct) -> struct.field1, struct.field 2 ...
I think unnest is the only expr so far that represent a set of fields instead of singular value here, so i'm thinking of adding an extra check at this function
datafusion/datafusion/expr/src/utils.rs
Line 734 in d58bae4
exprs.iter().map(|e| e.to_field(input_schema)).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @duongcongtoai -- I think this is really cool.
I played around with this PR a bit locally and it works nicely. I have a few suggested tests and some simplification suggestions, but I also think those could be done as follow on PRs. Please let me know what you think
datafusion/expr/src/expr_schema.rs
Outdated
@@ -123,7 +123,8 @@ impl ExprSchemable for Expr { | |||
Ok(field.data_type().clone()) | |||
} | |||
DataType::Struct(_) => { | |||
not_impl_err!("unnest() does not support struct yet") | |||
// TODO: this is not correct, because unnest(struct) wll result into multiple data_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we end up with an unest
Expr (vs an LogicalPlan::Unnest
).
Could we simply return not_yet_implemented
error here?
unnest_with_options(input, columns, UnnestOptions::default()) | ||
} | ||
|
||
pub fn get_unnested_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some documentation explaining what this function does?
Also, I wonder if it needs to be pub
or if we could leave it pub(crate)
or not pub?
5 [4, 5] 3 4 {c0: 3, c1: 4} | ||
6 [6] NULL NULL NULL | ||
12 [12] 7 8 {c0: 7, c1: 8} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add some other tests:
- Test the output type (e.g.
select arrow_typeof(unnest(column1))
) - Test nested structs (I tested it works well, but I think we should have some coverage)
Nested structs
> create or replace table t as values (struct('a', 'b', struct('c'))), (struct('d', 'e', struct('f')));
0 row(s) fetched.
Elapsed 0.031 seconds.
> select * from t;
+-----------------------------+
| column1 |
+-----------------------------+
| {c0: a, c1: b, c2: {c0: c}} |
| {c0: d, c1: e, c2: {c0: f}} |
+-----------------------------+
2 row(s) fetched.
Elapsed 0.006 seconds.
> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a | b | {c0: c} |
| d | e | {c0: f} |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
Elapsed 0.013 seconds.
And
> create or replace table t as values (struct('a', 'b', [1,2,3])), (struct('x', 'y', [10,20]));
0 row(s) fetched.
Elapsed 0.010 seconds.
> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a | b | [1, 2, 3] |
| x | y | [10, 20] |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly the output type doesn't seem to work (which probably makes sense as unnest needs to be converted to a LogicalPlan). I don't think we have to fix this in the PR (but adding coverage would be good)
> select unnest(column1) from t;
+----------------------+----------------------+----------------------+
| unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
+----------------------+----------------------+----------------------+
| a | b | [1, 2, 3] |
| x | y | [10, 20] |
+----------------------+----------------------+----------------------+
2 row(s) fetched.
> select arrow_typeof(unnest(column1)) from t;
Internal error: unnest on struct can ony be applied at the root level of select expression.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, i added more coverage
datafusion/sql/src/utils.rs
Outdated
|
||
// unnest(struct_col) | ||
let original_expr = Expr::Unnest(Unnest { | ||
expr: Box::new(Expr::Column(Column::from_name("struct_col"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make this a lot more concise using the expr_fn API, like:
expr: Box::new(Expr::Column(Column::from_name("struct_col"))), | |
expr: Box::new(col("struct_col")) |
In fact maybe we should add an expr_fn for unnest
/// Call `unnest` on the expression
fn unnest(arg: Expr) -> Expr {
Expr::Unnest(Unnest { expr:: Box::new(arg) } )
}
datafusion/sql/src/utils.rs
Outdated
let original_expr = Expr::Unnest(Unnest { | ||
expr: Box::new(Expr::Column(Column::from_name("array_col"))), | ||
}) | ||
.add(Expr::Literal(ScalarValue::Int64(Some(1)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.add(Expr::Literal(ScalarValue::Int64(Some(1)))); | |
.add(lit(1i64)); |
There are similar simplifications we could do below too
unnest
to support Struct datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @duongcongtoai -- really nice improvement
* feat: extend unnest for struct * compile err * debugging * finish basic * chore: complete impl * chore: clean garbage * chore: more test * test: fix df test * prettify display * fix unit test * chore: compile err * chore: fix physical exec err * add sqllogic test * chore: more doc * chore: refactor * fix doc * fmt * fix doc * ut for recursive transform unnest * a small integration test * fix comment
Which issue does this PR close?
Closes #10264.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?