[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

HyukjinKwon · 2024-07-14T23:45:22Z

What changes were proposed in this pull request?

This PR proposes to remove repartition(1) when writing metadata in ML/MLlib. It already writes one file.

Why are the changes needed?

In order to remove unnecessary shuffle, see also #47341

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests should verify them.

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon · 2024-07-14T23:45:43Z

cc @WeichenXu123 @zhengruifeng @dongjoon-hyun

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

WeichenXu123

LGTM

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

mllib/src/main/scala/org/apache/spark/mllib/classification/impl/GLMClassificationModel.scala

dongjoon-hyun

Thank you for making a PR. I have a question about this pattern because there are too many changes about this.

- sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))
+ spark.createDataFrame(Seq(Tuple1(metadata))).write.text(Loader.metadataPath(path))

Could you shed some lights about why we need this in this PR?

HyukjinKwon · 2024-07-16T00:15:19Z

Could you shed some lights about why we need this in this PR?

The reasons of doing it is as follows:

This is because of consistency. We're already using SparkSession to write Parquet.
Using Data Source can benefit vs. using RDD as we have made such changes in [SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063, [SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813, [SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource #17255 and SPARK-19918. For example, we can use compress option when writing the data out (although we don't have a dedicated SQLConf for it yet).
It will use UTF-8 encoded strings which should be cheaper than plan unicode JDK string ser/de

I can separate PR for replacing RDD to DataFrame when writing the text out (#47347 (comment)). I piggy backed there because I thought it's trivial. The code what it does is virtually same except those differences above.

This is the same pattern as #47341

We also changed many RDD things to Dataset. Here is another example 50e3644. I can bring more examples if you need.

dongjoon-hyun · 2024-07-16T06:44:55Z

Thank you for the reason. It makes sense.

This is because of consistency. We're already using SparkSession to write Parquet.

For the following, if you don't mind, please split them once more because it's not related to avoiding repartition.

I can separate PR for replacing RDD to DataFrame when writing the text out (#47347 (comment)). I piggy backed there because I thought it's trivial. The code what it does is virtually same except those differences above.

dongjoon-hyun

+1, LGTM (except one minor comment about spin-off). Thank you.

HyukjinKwon · 2024-07-16T09:46:22Z

Sure, I will separate the PR. Thanks for reviewing this closely 👍

HyukjinKwon · 2024-07-16T10:31:44Z

#47366 👍

dongjoon-hyun · 2024-07-16T15:56:27Z

Thank you, @HyukjinKwon and all.
Merged to master for Apache Spark 4.0.0-preview2.

…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See #47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See apache#47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See apache#47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Avoid repartition when writing out the metadata

430ed99

HyukjinKwon mentioned this pull request Jul 14, 2024

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

Closed

github-actions bot added ML MLLIB labels Jul 14, 2024

HyukjinKwon added 2 commits July 15, 2024 08:54

fixup

4fe591b

consistency

da168c3

HyukjinKwon commented Jul 15, 2024

View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Outdated Show resolved Hide resolved

zhengruifeng approved these changes Jul 15, 2024

View reviewed changes

WeichenXu123 approved these changes Jul 15, 2024

View reviewed changes

dongjoon-hyun reviewed Jul 15, 2024

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 15, 2024

View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 15, 2024

View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/classification/impl/GLMClassificationModel.scala Outdated Show resolved Hide resolved

dongjoon-hyun requested changes Jul 15, 2024

View reviewed changes

dongjoon-hyun approved these changes Jul 16, 2024

View reviewed changes

HyukjinKwon mentioned this pull request Jul 16, 2024

[SPARK-48909][ML][MLLIB] Uses SparkSession over SparkContext when writing metadata #47366

Closed

separate the PR

8349453

dongjoon-hyun closed this in 2e1a39c Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

HyukjinKwon commented Jul 14, 2024

HyukjinKwon commented Jul 14, 2024 •

edited

Loading

WeichenXu123 left a comment

dongjoon-hyun left a comment •

edited

Loading

HyukjinKwon commented Jul 16, 2024 •

edited

Loading

dongjoon-hyun commented Jul 16, 2024

dongjoon-hyun left a comment

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 16, 2024

dongjoon-hyun commented Jul 16, 2024

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

Conversation

HyukjinKwon commented Jul 14, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Jul 14, 2024 • edited Loading

WeichenXu123 left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Jul 16, 2024 • edited Loading

dongjoon-hyun commented Jul 16, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 16, 2024

dongjoon-hyun commented Jul 16, 2024

HyukjinKwon commented Jul 14, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

HyukjinKwon commented Jul 16, 2024 •

edited

Loading