Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Fix wrong plan equality due to case class inheritance #4893

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Mar 8, 2024

We had some case classes inheriting Spark's case class BatchScanExec or FileSourceScanExec or HiveTableScanExec. The case class inheritance is usually considered a bad practice since it breaks case class's equality convention.

The patch will fix the issue by putting Vanilla spark's code as abstract classes into shim layers.

Another pending PR would require for this change so as the test of this patch, along with all the existing UTs.

The following UTs will be disabled for this path since they are based on vanilla Spark's non-overridable code that checks against exact scan class types:

File source v2: support passing data filters to FileScan without partitionFilters
File source v2: support partition pruning
disable bucketing when the output doesn't contain all bucketing columns
Fallback Parquet V2 to V1
Aggregates with no groupby over tables having 1 BUCKET, return multiple rows
SPARK-32859: disable unnecessary bucketed table scan - other operators test
SPARK-32859: disable unnecessary bucketed table scan - multiple bucketed columns test
SPARK-32859: disable unnecessary bucketed table scan - multiple joins test
SPARK-32859: disable unnecessary bucketed table scan - basic test

@zhztheplayer zhztheplayer marked this pull request as draft March 8, 2024 05:47
Copy link

github-actions bot commented Mar 8, 2024

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member Author

/Benchmark Velox

Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer changed the title WIP: [VL] Fix wrong plan equality due to case class inheritance [VL] Fix wrong plan equality due to case class inheritance Mar 8, 2024
@zhztheplayer zhztheplayer marked this pull request as ready for review March 8, 2024 08:27
Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

4 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@ulysses-you
Copy link
Contributor

Thank you @zhztheplayer for the improvement. I'm wondering why we need to make our transformer plan inherit spark operator. Is there any history reason ?

Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec

@zhztheplayer
Copy link
Member Author

Thank you @zhztheplayer for the improvement. I'm wondering why we need to make our transformer plan inherit spark operator. Is there any history reason ?

Yes it was mainly because of historical reasons of this project. We used to have different backgrounds of developers with various knowledge on Scala at the very first launching phase of this project.

Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec

That looks like a good idea and we may need to check if it's feasible to use that approach for Gluten too. Probably it can be done in shim layer without altering too much core module code. Although it may require for considerable refactor work. Once doing that, we can just remove the pasted files from this PR.

Copy link

Run Gluten Clickhouse CI

Comment on lines +342 to +345
// DISABLED: GLUTEN-4893 Vanilla UT checks scan operator by exactly matching the class type
.exclude("File source v2: support passing data filters to FileScan without partitionFilters")
// DISABLED: GLUTEN-4893 Vanilla UT checks scan operator by exactly matching the class type
.exclude("File source v2: support partition pruning")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzcclp

I am disabling some UTs (including CH ones) that are based on strict type checking of the scan plan. We'd fix them case by case in later PRs but probably by pasting Spark's code to our UT folder. Or there might be better solutions, we can keep thinking of it. At this time it's more important to have the case class inheritance issue corrected, I don't have enough resource to do everything in one patch. So would that work for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to me.

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@Yohahaha
Copy link
Contributor

seems copy Apache Spark source file into Gluten brings a limitation that require user must use same commit id or tag of these copied source file, if user has modified these source file in their own Spark, they may need apply changes in Gluten copied source file too, correct me if I was wrong.

I believe the way below is better than copy more and more Spark source files into Gluten.

Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Mar 12, 2024

f user has modified these source file in their own Spark

Thanks for the comment here. It sounds like a very valid use case we should consider about. Do you already know some people are doing things with this way, especially changing Spark scan's code? That may drive us to migrate to the other solution as soon as possible.

BTW although ideally a Gluten's transformer should not be guaranteed to rely any code of vanilla Spark, but this is still the way we have been adopting. So based on my understanding it's more related to backward-compatibility.

@zhztheplayer
Copy link
Member Author

I believe the way below is better than copy more and more Spark source files into Gluten.

In case of ambiguity, the files are pasted to Gluten with 'Abstract' prefix so it should not have any class type conflictions with vanilla Spark. I understood your point based on this assumption: If one changes vanilla Spark's FileSourceScanExec, then Gluten's FileSourceScanExecTransformer works fine with it before, then it may not work after this patch since Gluten now uses unmodified code from Vanilla Spark. That may be considered a backward-compatibility issue. However it's not about directly overriding the same class with same name in Gluten which may lead to a bunch of problems like class loadings. Are we aligned here?

@Yohahaha
Copy link
Contributor

f user has modified these source file in their own Spark

Thanks for the comment here. It sounds like a very valid use case we should consider about. Do you already know some people are doing things with this way, especially changing Spark scan's code? That may drive us to migrate to the other solution as soon as possible.

I'm not sure others' way, we maintain an internal branch of vanilla spark with lots of changes, but not change scan's interface, so I guess this PR may not introduce conflicts but not verified yet.

We found conflicts before, when ParquetFileFormat was copied in spark33. So, copy vanilla spark's source file into Gluten is a risky way.

@Yohahaha
Copy link
Contributor

However it's not about directly overriding the same class with same name in Gluten which may lead to a bunch of problems like class loadings. Are we aligned here?

yes, I'm not going to blocking this PR, shim layer always complex, but copy source code is risky and hard to maintain, hope we can keep iterating shim layer more clean, thank you!

Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Mar 12, 2024

shim layer always complex, but copy source code is risky and hard to maintain,

Once the abstract classes are added, I actually don't want them to be maintained frequently. They should remain no change and just be like some essential copies. If we observed that they need to be "maintained", we'd remove them quickly anyway. I think we are on the same tune at this perspective.

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

fixup

fixup

fixup

fixup

fixup

fixup

Unit testing

fixup

fixup

fixup

fixup

fixup

fixup

fixes

spark34 ut

spark33 ut

spark32 ut

fixup

fixup

fixup

style

Spark35 ?

Spark34

Spark33

Spark32

Spark32
zhztheplayer added a commit that referenced this pull request Mar 13, 2024
Copy link

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer merged commit 0f0da89 into apache:main Mar 13, 2024
3 checks passed
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Mar 25, 2024
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 8, 2024
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants