-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347
Conversation
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/classification/impl/GLMClassificationModel.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR. I have a question about this pattern because there are too many changes about this.
- sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))
+ spark.createDataFrame(Seq(Tuple1(metadata))).write.text(Loader.metadataPath(path))
Could you shed some lights about why we need this in this PR?
The reasons of doing it is as follows:
I can separate PR for replacing RDD to DataFrame when writing the text out (#47347 (comment)). I piggy backed there because I thought it's trivial. The code what it does is virtually same except those differences above. This is the same pattern as #47341 We also changed many RDD things to Dataset. Here is another example 50e3644. I can bring more examples if you need. |
Thank you for the reason. It makes sense.
For the following, if you don't mind, please split them once more because it's not related to avoiding repartition.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except one minor comment about spin-off). Thank you.
Sure, I will separate the PR. Thanks for reviewing this closely 👍 |
#47366 👍 |
Thank you, @HyukjinKwon and all. |
…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See #47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See apache#47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See apache#47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR proposes to remove
repartition(1)
when writing metadata in ML/MLlib. It already writes one file.Why are the changes needed?
In order to remove unnecessary shuffle, see also #47341
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests should verify them.
Was this patch authored or co-authored using generative AI tooling?
No