feat: Add GetStructField expression #731

Kimahriman · 2024-07-26T22:18:42Z

Which issue does this PR close?

Closes #730

Rationale for this change

To support struct types in expressions, you need to be able to pull out values from structs, which Spark does through the GetStructField expression.

What changes are included in this PR?

Adds a new PhysicalExpr GetStructField that gets a field from within a struct.

Additionally to support this and testing it:

Creates a new trait DataTypeSupport that lets operators opt-in to supporting certain data types. This will help incrementally add complex data type support.
Updates CometRowToColumnar to support columnar sources. This is so I can use Spark's vectorized parquet read to read complex data, and then immediately forward to Comet via COMET_ROW_TO_COLUMNAR_SUPPORTED_OPERATOR_LIST. I do this in the UT I made.

How are these changes tested?

New UT showing comet operators take affect, existing Spark tests for functionality

Kimahriman

Did my best at implementing a new expression. Had to adjust some typing checking around, since nearly all of the type checking is off a single list or two of supported types, but I'm only trying to add some struct support. Got all the tests to pass at least.

spark/src/main/scala/org/apache/spark/sql/comet/CometRowToColumnarExec.scala

codecov-commenter · 2024-07-26T23:07:01Z

Codecov Report

Attention: Patch coverage is 74.54545% with 14 lines in your changes missing coverage. Please review.

Project coverage is 33.66%. Comparing base (bd7834c) to head (9b96633).
Report is 3 commits behind head on main.

Files	Patch %	Lines
...org/apache/comet/CometSparkSessionExtensions.scala	38.46%	4 Missing and 4 partials ⚠️
.../main/scala/org/apache/comet/DataTypeSupport.scala	76.47%	1 Missing and 3 partials ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala	0.00%	1 Missing ⚠️
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main     #731       +/-   ##
=============================================
- Coverage     53.78%   33.66%   -20.13%     
- Complexity      815      860       +45     
=============================================
  Files           107      111        +4     
  Lines         10279    42679    +32400     
  Branches       1934     9379     +7445     
=============================================
+ Hits           5529    14366     +8837     
- Misses         3773    25350    +21577     
- Partials        977     2963     +1986

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

native/core/src/execution/datafusion/expressions/structs.rs

andygrove · 2024-07-27T14:38:39Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

@@ -1128,7 +1133,7 @@ object CometSparkSessionExtensions extends Logging {
    // Only consider converting leaf nodes to columnar currently, so that all the following
    // operators can have a chance to be converted to columnar.
    // TODO: consider converting other intermediate operators to columnar.
-    op.isInstanceOf[LeafExecNode] && !op.supportsColumnar && isSchemaSupported(op.schema) &&


Could you update the comments here to cover this change in functionality?

Added a little bit more, but still is just a little weird since it's technically still called RowToColumnar

spark/src/main/scala/org/apache/spark/sql/comet/CometRowToColumnarExec.scala

andygrove

Thanks for the contribution @Kimahriman. I think this looks good.

Blizzara · 2024-07-29T09:10:50Z

native/spark-expr/src/structs.rs

+    }
+}
+
+impl PhysicalExpr for GetStructField {


FWIW, if it's possible to implement these as ScalarUDFImpl instead of PhysicalExpr, that makes reusing them elsewhere easier (at least for me, but maybe for otheres too, since I think going from ScalarUDF -> PhysicalExpr is easier than the other way around) :)

Don't know enough about DataFusion to really know what the difference is. Just on the Spark side, UDFs are usually slightly less performant, so if you don't have to do a UDF you're usually better off. DataFusion does have a get_field ScalarUDF already it looks like, but that's by name and not index, and there seems like a lot more ceremony about checking all the input types, vs the PhysicalExpr is more tailored to what we already know about the input data.

I agree that we should start implementing functions as ScalarUDFImpl instead of PhysicalExpr. I think it would be fine to convert this one as a follow on PR.

What's the tl;dr on the benefits of that? Easier to use outside of the Comet/Spark use case?

Yes, I think it makes it easier for DataFusion users to switch between different function implementations and can be used from the logical plan as well as from the physical plan

In this particular case GetStructField might not even make sense as a UDF because it's created by the Spark analyzer from an ExtractValue expression, which resolves the name to an ordinal in the struct. It'd be odd to use GetStructField directly, since you would normally want a nested column by name, and not by index. I think the existing get_field UDF in DataFusion covers that use case, but using this PhysicalExpr on the Spark side seems like it makes more sense instead of having to convert the ordinal back into a field name to then use get_field

andygrove

LGTM. Thanks @Kimahriman

comphead · 2024-07-31T15:33:12Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -59,12 +59,13 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim
    logWarning(s"Comet native execution is disabled due to: $reason")
  }

-  def supportedDataType(dt: DataType): Boolean = dt match {
+  def supportedDataType(dt: DataType, allowComplex: Boolean = false): Boolean = dt match {


just thinking do we really need allowComplex, which more looks like a supportComplexTypes or we can use the Comet param like its done for operator/expressions.

Moreover the isSchemaSupported doesn't have this flag which causes some inconsistency imho

Yes there's a lot of oddities in how supported data types works right now. It should really just be operator/expression dependent that gets bubbled up to whether the whole plan is supported or not.

I'm not sure what you mean by

or we can use the Comet param like its done for operator/expressions.

Like

private[comet] def isCometAllOperatorEnabled(conf: SQLConf): Boolean = { COMET_EXEC_ALL_OPERATOR_ENABLED.get(conf) }

It'd be a little odd to have a config for struct type support IMO, since either the code supports it or it doesn't. Maybe you would want to override to disable struct support, but you would still need all the same code to support it as well

spark/src/main/scala/org/apache/spark/sql/comet/CometBatchScanExec.scala

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

parthchandra

lgtm (some minor comments)

parthchandra · 2024-07-31T23:11:56Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

@@ -1115,6 +1119,7 @@ object CometSparkSessionExtensions extends Logging {
        BinaryType | StringType | _: DecimalType | DateType | TimestampType =>
      true
    case t: DataType if t.typeName == "timestamp_ntz" => true
+    case s: StructType => isSchemaSupported(s)


Is this change needed? In general, structs are not supported (yet) and since this method is not operator specific, we probably shouldn't have this here.

See #731 (comment), it's still needed by shouldApplyRowToColumnar

Hmm. I'm concerned that someone might use this method and get unexpected behavior. A comment to explain why this is here would be justified, I think?

Yeah it probably makes sense just to move it into CometRowToColumnarExec

parthchandra · 2024-07-31T23:17:33Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -59,12 +59,13 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim
    logWarning(s"Comet native execution is disabled due to: $reason")
  }

-  def supportedDataType(dt: DataType): Boolean = dt match {
+  def supportedDataType(dt: DataType, allowComplex: Boolean = false): Boolean = dt match {


nit: Can we consider renaming this to allowStruct to make it explicit that this is only for structs (and not maps and arrays).

spark/src/main/scala/org/apache/spark/sql/comet/CometRowToColumnarExec.scala

andygrove · 2024-08-02T12:47:59Z

Thanks for addressing the feedback @Kimahriman. Could you fix the merge conflict?

Kimahriman · 2024-08-02T14:52:26Z

Thanks for addressing the feedback @Kimahriman. Could you fix the merge conflict?

Done

andygrove

LGTM. I think this is ready to merge @parthchandra / @kazuyukitanimura ?

comphead

lgtm thanks @Kimahriman this PR def brings up the benefit

kazuyukitanimura

Thanks @Kimahriman Would you mind merge with the latest main to resolve the conflict?

Kimahriman · 2024-08-02T19:11:57Z

Thanks @Kimahriman Would you mind merge with the latest main to resolve the conflict?

Done again!

* Add GetStructField support * Add custom types to CometBatchScanExec * Remove test explain * Rust fmt * Fix struct type support checks * Support converting StructArray to native * fix style * Attempt to fix scalar subquery issue * Fix other unit test * Cleanup * Default query plan supporting complex type to false * Migrate struct expressions to spark-expr * Update shouldApplyRowToColumnar comment * Add nulls to test * Rename to allowStruct * Add DataTypeSupport trait * Fix parquet datatype test (cherry picked from commit 5b5142b)

Kimahriman added 10 commits July 24, 2024 20:20

Add GetStructField support

2c8ec85

Add custom types to CometBatchScanExec

cd4abdb

Remove test explain

9e058f2

Rust fmt

23ae0be

Fix struct type support checks

7164005

Support converting StructArray to native

c16925d

fix style

e33426f

Attempt to fix scalar subquery issue

93e2eba

Fix other unit test

81f393d

Cleanup

8937acd

Kimahriman commented Jul 26, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/comet/CometRowToColumnarExec.scala Show resolved Hide resolved

parthchandra reviewed Jul 27, 2024

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Outdated Show resolved Hide resolved

Default query plan supporting complex type to false

51cec4c

andygrove reviewed Jul 27, 2024

View reviewed changes

native/core/src/execution/datafusion/expressions/structs.rs Outdated Show resolved Hide resolved

andygrove reviewed Jul 27, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/comet/CometRowToColumnarExec.scala Show resolved Hide resolved

andygrove reviewed Jul 27, 2024

View reviewed changes

Kimahriman added 3 commits July 28, 2024 20:40

Merge branch 'main' into get-struct-field

bf89a6f

Migrate struct expressions to spark-expr

71808c7

Update shouldApplyRowToColumnar comment

979b624

Blizzara reviewed Jul 29, 2024

View reviewed changes

andygrove approved these changes Jul 30, 2024

View reviewed changes

comphead reviewed Jul 31, 2024

View reviewed changes

kazuyukitanimura reviewed Jul 31, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/comet/CometBatchScanExec.scala Outdated Show resolved Hide resolved

kazuyukitanimura reviewed Jul 31, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Outdated Show resolved Hide resolved

Add nulls to test

26f4aef

parthchandra approved these changes Jul 31, 2024

View reviewed changes

Kimahriman added 2 commits August 1, 2024 20:27

Rename to allowStruct

17c528d

Merge branch 'main' into get-struct-field

dcc42cc

Kimahriman added 2 commits August 2, 2024 02:29

Add DataTypeSupport trait

cb67240

Fix parquet datatype test

9b96633

Merge branch 'main' into get-struct-field

2adb250

andygrove approved these changes Aug 2, 2024

View reviewed changes

comphead approved these changes Aug 2, 2024

View reviewed changes

kazuyukitanimura approved these changes Aug 2, 2024

View reviewed changes

Merge branch 'main' into get-struct-field

51a8bb5

Kimahriman mentioned this pull request Aug 2, 2024

Rename CometRowToColumnarExec #770

Closed

andygrove merged commit 5b5142b into apache:main Aug 3, 2024
74 checks passed

This was referenced Aug 9, 2024

Supported for nested structs #799

Closed

fix: Improve support for nested structs #800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add GetStructField expression #731

feat: Add GetStructField expression #731

Kimahriman commented Jul 26, 2024 •

edited

Loading

Kimahriman left a comment

codecov-commenter commented Jul 26, 2024 •

edited

Loading

andygrove Jul 27, 2024

Kimahriman Jul 29, 2024

andygrove left a comment

Blizzara Jul 29, 2024

Kimahriman Jul 29, 2024

andygrove Jul 29, 2024

Kimahriman Jul 29, 2024

andygrove Jul 30, 2024

Kimahriman Aug 1, 2024

andygrove left a comment

comphead Jul 31, 2024

Kimahriman Jul 31, 2024

comphead Jul 31, 2024

Kimahriman Aug 1, 2024

parthchandra left a comment

parthchandra Jul 31, 2024

Kimahriman Aug 1, 2024

parthchandra Aug 1, 2024

Kimahriman Aug 1, 2024

parthchandra Jul 31, 2024

Kimahriman Aug 1, 2024

andygrove commented Aug 2, 2024

Kimahriman commented Aug 2, 2024

andygrove left a comment

comphead left a comment

kazuyukitanimura left a comment

Kimahriman commented Aug 2, 2024

feat: Add GetStructField expression #731

feat: Add GetStructField expression #731

Conversation

Kimahriman commented Jul 26, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Kimahriman left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 26, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Aug 2, 2024

Kimahriman commented Aug 2, 2024

andygrove left a comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Kimahriman commented Aug 2, 2024

Kimahriman commented Jul 26, 2024 •

edited

Loading

codecov-commenter commented Jul 26, 2024 •

edited

Loading