-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706
Conversation
cc @cloud-fan |
*/ | ||
protected def addTagForRootRepartition(plan: LogicalPlan): LogicalPlan = { | ||
var isRootRepartition = true | ||
plan.transformDownWithPruning(_.containsPattern(repartitionTreePattern)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to use a manual traversal here. We can stop traversal once we hit a node that is not repartition/project/filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1310,6 +1310,14 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper { | |||
assert(topKs.size == 1) | |||
assert(sorts.isEmpty) | |||
} | |||
|
|||
test("SPARK-39915: Dataset.repartition(N) may not create N partitions") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a planner bug... We can probably add an end-to-end test in DataFrameSuite
thanks, merging to master! |
@ulysses-you can you open a backport PR for 3.3? it has conflicts. |
… Non-AQE part Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan <empty>, [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation <empty>, [col1#0] ``` yes, the empty plan may change add test Closes apache#37706 from ulysses-you/empty. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thank you @cloud-fan , craeted #37730 |
* Add a [[ROOT_REPARTITION]] tag for the root user-specified repartition so this rule can | ||
* skip optimize it. | ||
*/ | ||
private def addTagForRootRepartition(plan: LogicalPlan): LogicalPlan = plan match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: we can skip this earlier with something like if (!plan.containsPattern(REPARTITION))
…tions Non-AQE part ### What changes were proposed in this pull request? backport #37706 for branch-3.3 Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. ### Why are the changes needed? Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan <empty>, [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation <empty>, [col1#0] ``` ### Does this PR introduce _any_ user-facing change? yes, the empty plan may change ### How was this patch tested? add test Closes #37730 from ulysses-you/empty-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This pr upgrade Apache Arrow from 13.0.0 to 14.0.0. ### Why are the changes needed? The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes. In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)). The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)). The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)). Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)). The full release notes as follows: - https://arrow.apache.org/release/14.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43650 from LuciferYang/arrow-14. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Skip optimize the root user-specified repartition in
PropagateEmptyRelation
.Why are the changes needed?
Spark should preserve the final repatition which can affect the final output partition which is user-specified.
For example:
Does this PR introduce any user-facing change?
yes, the empty plan may change
How was this patch tested?
add test