[VL] Fix wrong plan equality due to case class inheritance #4893

zhztheplayer · 2024-03-08T05:47:16Z

We had some case classes inheriting Spark's case class BatchScanExec or FileSourceScanExec or HiveTableScanExec. The case class inheritance is usually considered a bad practice since it breaks case class's equality convention.

The patch will fix the issue by putting Vanilla spark's code as abstract classes into shim layers.

Another pending PR would require for this change so as the test of this patch, along with all the existing UTs.

The following UTs will be disabled for this path since they are based on vanilla Spark's non-overridable code that checks against exact scan class types:

File source v2: support passing data filters to FileScan without partitionFilters
File source v2: support partition pruning
disable bucketing when the output doesn't contain all bucketing columns
Fallback Parquet V2 to V1
Aggregates with no groupby over tables having 1 BUCKET, return multiple rows
SPARK-32859: disable unnecessary bucketed table scan - other operators test
SPARK-32859: disable unnecessary bucketed table scan - multiple bucketed columns test
SPARK-32859: disable unnecessary bucketed table scan - multiple joins test
SPARK-32859: disable unnecessary bucketed table scan - basic test

github-actions · 2024-03-08T05:47:33Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-03-08T05:47:48Z

Run Gluten Clickhouse CI

github-actions · 2024-03-08T06:26:28Z

Run Gluten Clickhouse CI

github-actions · 2024-03-08T07:01:23Z

Run Gluten Clickhouse CI

github-actions · 2024-03-08T08:02:55Z

Run Gluten Clickhouse CI

github-actions · 2024-03-08T08:09:25Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-03-08T08:19:57Z

/Benchmark Velox

github-actions · 2024-03-08T08:24:09Z

Run Gluten Clickhouse CI

github-actions · 2024-03-08T13:03:49Z

Run Gluten Clickhouse CI

github-actions · 2024-03-11T02:46:29Z

Run Gluten Clickhouse CI

github-actions · 2024-03-11T06:36:58Z

Run Gluten Clickhouse CI

github-actions · 2024-03-11T08:59:59Z

Run Gluten Clickhouse CI

github-actions · 2024-03-12T01:08:46Z

Run Gluten Clickhouse CI

ulysses-you · 2024-03-12T01:29:22Z

Thank you @zhztheplayer for the improvement. I'm wondering why we need to make our transformer plan inherit spark operator. Is there any history reason ?

Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec

zhztheplayer · 2024-03-12T01:55:24Z

Thank you @zhztheplayer for the improvement. I'm wondering why we need to make our transformer plan inherit spark operator. Is there any history reason ?

Yes it was mainly because of historical reasons of this project. We used to have different backgrounds of developers with various knowledge on Scala at the very first launching phase of this project.

Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec

That looks like a good idea and we may need to check if it's feasible to use that approach for Gluten too. Probably it can be done in shim layer without altering too much core module code. Although it may require for considerable refactor work. Once doing that, we can just remove the pasted files from this PR.

github-actions · 2024-03-12T04:55:43Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-03-12T05:00:57Z

gluten-ut/spark33/src/test/scala/io/glutenproject/utils/clickhouse/ClickHouseTestSettings.scala

+    // DISABLED: GLUTEN-4893 Vanilla UT checks scan operator by exactly matching the class type
+    .exclude("File source v2: support passing data filters to FileScan without partitionFilters")
+    // DISABLED: GLUTEN-4893 Vanilla UT checks scan operator by exactly matching the class type
+    .exclude("File source v2: support partition pruning")


@zzcclp

I am disabling some UTs (including CH ones) that are based on strict type checking of the scan plan. We'd fix them case by case in later PRs but probably by pasting Spark's code to our UT folder. Or there might be better solutions, we can keep thinking of it. At this time it's more important to have the case class inheritance issue corrected, I don't have enough resource to do everything in one patch. So would that work for you?

It's ok to me.

github-actions · 2024-03-12T05:40:50Z

Run Gluten Clickhouse CI

github-actions · 2024-03-12T06:30:11Z

Run Gluten Clickhouse CI

Yohahaha · 2024-03-12T06:54:43Z

seems copy Apache Spark source file into Gluten brings a limitation that require user must use same commit id or tag of these copied source file, if user has modified these source file in their own Spark, they may need apply changes in Gluten copied source file too, correct me if I was wrong.

I believe the way below is better than copy more and more Spark source files into Gluten.

Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec

zhztheplayer · 2024-03-12T08:11:05Z

f user has modified these source file in their own Spark

Thanks for the comment here. It sounds like a very valid use case we should consider about. Do you already know some people are doing things with this way, especially changing Spark scan's code? That may drive us to migrate to the other solution as soon as possible.

BTW although ideally a Gluten's transformer should not be guaranteed to rely any code of vanilla Spark, but this is still the way we have been adopting. So based on my understanding it's more related to backward-compatibility.

zhztheplayer · 2024-03-12T08:35:40Z

I believe the way below is better than copy more and more Spark source files into Gluten.

In case of ambiguity, the files are pasted to Gluten with 'Abstract' prefix so it should not have any class type conflictions with vanilla Spark. I understood your point based on this assumption: If one changes vanilla Spark's FileSourceScanExec, then Gluten's FileSourceScanExecTransformer works fine with it before, then it may not work after this patch since Gluten now uses unmodified code from Vanilla Spark. That may be considered a backward-compatibility issue. However it's not about directly overriding the same class with same name in Gluten which may lead to a bunch of problems like class loadings. Are we aligned here?

Yohahaha · 2024-03-12T08:38:26Z

f user has modified these source file in their own Spark

Thanks for the comment here. It sounds like a very valid use case we should consider about. Do you already know some people are doing things with this way, especially changing Spark scan's code? That may drive us to migrate to the other solution as soon as possible.

I'm not sure others' way, we maintain an internal branch of vanilla spark with lots of changes, but not change scan's interface, so I guess this PR may not introduce conflicts but not verified yet.

We found conflicts before, when ParquetFileFormat was copied in spark33. So, copy vanilla spark's source file into Gluten is a risky way.

Yohahaha · 2024-03-12T08:44:53Z

However it's not about directly overriding the same class with same name in Gluten which may lead to a bunch of problems like class loadings. Are we aligned here?

yes, I'm not going to blocking this PR, shim layer always complex, but copy source code is risky and hard to maintain, hope we can keep iterating shim layer more clean, thank you!

github-actions · 2024-03-12T08:49:08Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-03-12T08:57:28Z

shim layer always complex, but copy source code is risky and hard to maintain,

Once the abstract classes are added, I actually don't want them to be maintained frequently. They should remain no change and just be like some essential copies. If we observed that they need to be "maintained", we'd remove them quickly anyway. I think we are on the same tune at this perspective.

github-actions · 2024-03-12T10:43:53Z

Run Gluten Clickhouse CI

github-actions · 2024-03-12T13:34:19Z

Run Gluten Clickhouse CI

github-actions · 2024-03-13T00:06:46Z

Run Gluten Clickhouse CI

github-actions · 2024-03-13T01:13:10Z

Run Gluten Clickhouse CI

github-actions · 2024-03-13T03:41:06Z

Run Gluten Clickhouse CI

github-actions · 2024-03-13T06:19:52Z

Run Gluten Clickhouse CI

fixup fixup fixup fixup fixup fixup Unit testing fixup fixup fixup fixup fixup fixup fixes spark34 ut spark33 ut spark32 ut fixup fixup fixup style Spark35 ? Spark34 Spark33 Spark32 Spark32

…to keep git history

github-actions · 2024-03-13T08:46:38Z

Run Gluten Clickhouse CI

… code to keep git history

zhztheplayer marked this pull request as draft March 8, 2024 05:47

zhztheplayer force-pushed the wip-equality branch from 8d8cfea to b13fc32 Compare March 8, 2024 07:00

zwangsheng mentioned this pull request Mar 8, 2024

[GLUTEN-4896] Upgrade Spark33 version to Spark3.3.4 #4897

Closed

zhztheplayer changed the title ~~WIP: [VL] Fix wrong plan equality due to case class inheritance~~ [VL] Fix wrong plan equality due to case class inheritance Mar 8, 2024

zhztheplayer marked this pull request as ready for review March 8, 2024 08:27

zhztheplayer commented Mar 12, 2024

View reviewed changes

zhztheplayer mentioned this pull request Mar 12, 2024

[CORE] Prior to #4893, add vanilla Spark's original scan source code to keep git history #4931

Merged

zhztheplayer force-pushed the wip-equality branch from d7e35f5 to 6ef4cdc Compare March 12, 2024 06:29

zhztheplayer force-pushed the wip-equality branch from ad9a249 to b48ea32 Compare March 13, 2024 01:12

fixup

061fce2

fixup fixup fixup fixup fixup fixup Unit testing fixup fixup fixup fixup fixup fixup fixes spark34 ut spark33 ut spark32 ut fixup fixup fixup style Spark35 ? Spark34 Spark33 Spark32 Spark32

zhztheplayer added a commit that referenced this pull request Mar 13, 2024

[CORE] Prior to #4893, add vanilla Spark's original scan source code …

5644fc2

…to keep git history

zhztheplayer force-pushed the wip-equality branch from 1d1e5bc to 061fce2 Compare March 13, 2024 08:46

zhztheplayer merged commit 0f0da89 into apache:main Mar 13, 2024
3 checks passed

taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Mar 25, 2024

[CORE] Prior to apache#4893, add vanilla Spark's original scan source…

bd67839

… code to keep git history

yma11 mentioned this pull request Apr 2, 2024

[GLUTEN-5252] Fix condition eliminate for Iceberg/delta scan with filter rel #5246

Closed

taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 8, 2024

[CORE] Prior to apache#4893, add vanilla Spark's original scan source…

3af0c84

… code to keep git history

taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 9, 2024

[CORE] Prior to apache#4893, add vanilla Spark's original scan source…

10de26c

… code to keep git history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Fix wrong plan equality due to case class inheritance #4893

[VL] Fix wrong plan equality due to case class inheritance #4893

zhztheplayer commented Mar 8, 2024 •

edited

Loading

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

zhztheplayer commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 12, 2024

ulysses-you commented Mar 12, 2024

zhztheplayer commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

zhztheplayer Mar 12, 2024

zzcclp Mar 12, 2024

github-actions bot commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

Yohahaha commented Mar 12, 2024

zhztheplayer commented Mar 12, 2024 •

edited

Loading

zhztheplayer commented Mar 12, 2024

Yohahaha commented Mar 12, 2024

Yohahaha commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

zhztheplayer commented Mar 12, 2024 •

edited

Loading

github-actions bot commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

[VL] Fix wrong plan equality due to case class inheritance #4893

[VL] Fix wrong plan equality due to case class inheritance #4893

Conversation

zhztheplayer commented Mar 8, 2024 • edited Loading

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

zhztheplayer commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 12, 2024

ulysses-you commented Mar 12, 2024

zhztheplayer commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

zhztheplayer Mar 12, 2024

Choose a reason for hiding this comment

zzcclp Mar 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

Yohahaha commented Mar 12, 2024

zhztheplayer commented Mar 12, 2024 • edited Loading

zhztheplayer commented Mar 12, 2024

Yohahaha commented Mar 12, 2024

Yohahaha commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

zhztheplayer commented Mar 12, 2024 • edited Loading

github-actions bot commented Mar 12, 2024

github-actions bot commented Mar 12, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

github-actions bot commented Mar 13, 2024

zhztheplayer commented Mar 8, 2024 •

edited

Loading

zhztheplayer commented Mar 12, 2024 •

edited

Loading

zhztheplayer commented Mar 12, 2024 •

edited

Loading