-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Fix wrong plan equality due to case class inheritance #4893
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
8d8cfea
to
b13fc32
Compare
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
/Benchmark Velox |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
4 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Thank you @zhztheplayer for the improvement. I'm wondering why we need to make our transformer plan inherit spark operator. Is there any history reason ? Is it possible that just pass spark operator to transformer plan as a parameter ? For example, like the comet: CometScanExec |
Yes it was mainly because of historical reasons of this project. We used to have different backgrounds of developers with various knowledge on Scala at the very first launching phase of this project.
That looks like a good idea and we may need to check if it's feasible to use that approach for Gluten too. Probably it can be done in shim layer without altering too much core module code. Although it may require for considerable refactor work. Once doing that, we can just remove the pasted files from this PR. |
Run Gluten Clickhouse CI |
// DISABLED: GLUTEN-4893 Vanilla UT checks scan operator by exactly matching the class type | ||
.exclude("File source v2: support passing data filters to FileScan without partitionFilters") | ||
// DISABLED: GLUTEN-4893 Vanilla UT checks scan operator by exactly matching the class type | ||
.exclude("File source v2: support partition pruning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am disabling some UTs (including CH ones) that are based on strict type checking of the scan plan. We'd fix them case by case in later PRs but probably by pasting Spark's code to our UT folder. Or there might be better solutions, we can keep thinking of it. At this time it's more important to have the case class inheritance issue corrected, I don't have enough resource to do everything in one patch. So would that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to me.
Run Gluten Clickhouse CI |
d7e35f5
to
6ef4cdc
Compare
Run Gluten Clickhouse CI |
seems copy Apache Spark source file into Gluten brings a limitation that require user must use same commit id or tag of these copied source file, if user has modified these source file in their own Spark, they may need apply changes in Gluten copied source file too, correct me if I was wrong. I believe the way below is better than copy more and more Spark source files into Gluten.
|
Thanks for the comment here. It sounds like a very valid use case we should consider about. Do you already know some people are doing things with this way, especially changing Spark scan's code? That may drive us to migrate to the other solution as soon as possible. BTW although ideally a Gluten's transformer should not be guaranteed to rely any code of vanilla Spark, but this is still the way we have been adopting. So based on my understanding it's more related to backward-compatibility. |
In case of ambiguity, the files are pasted to Gluten with 'Abstract' prefix so it should not have any class type conflictions with vanilla Spark. I understood your point based on this assumption: If one changes vanilla Spark's |
I'm not sure others' way, we maintain an internal branch of vanilla spark with lots of changes, but not change scan's interface, so I guess this PR may not introduce conflicts but not verified yet. We found conflicts before, when ParquetFileFormat was copied in spark33. So, copy vanilla spark's source file into Gluten is a risky way. |
yes, I'm not going to blocking this PR, shim layer always complex, but copy source code is risky and hard to maintain, hope we can keep iterating shim layer more clean, thank you! |
Run Gluten Clickhouse CI |
Once the abstract classes are added, I actually don't want them to be maintained frequently. They should remain no change and just be like some essential copies. If we observed that they need to be "maintained", we'd remove them quickly anyway. I think we are on the same tune at this perspective. |
Run Gluten Clickhouse CI |
2 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
ad9a249
to
b48ea32
Compare
Run Gluten Clickhouse CI |
2 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
1d1e5bc
to
061fce2
Compare
Run Gluten Clickhouse CI |
… code to keep git history
… code to keep git history
… code to keep git history
We had some case classes inheriting Spark's case class
BatchScanExec
orFileSourceScanExec
orHiveTableScanExec
. The case class inheritance is usually considered a bad practice since it breaks case class's equality convention.The patch will fix the issue by putting Vanilla spark's code as abstract classes into shim layers.
Another pending PR would require for this change so as the test of this patch, along with all the existing UTs.
The following UTs will be disabled for this path since they are based on vanilla Spark's non-overridable code that checks against exact scan class types: