Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

Closed
wants to merge 4 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to remove repartition(1) when writing metadata in ML/MLlib. It already writes one file.

Why are the changes needed?

In order to remove unnecessary shuffle, see also #47341

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests should verify them.

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 14, 2024

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR. I have a question about this pattern because there are too many changes about this.

- sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))
+ spark.createDataFrame(Seq(Tuple1(metadata))).write.text(Loader.metadataPath(path))

Could you shed some lights about why we need this in this PR?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 16, 2024

Could you shed some lights about why we need this in this PR?

The reasons of doing it is as follows:

  1. This is because of consistency. We're already using SparkSession to write Parquet.
  2. Using Data Source can benefit vs. using RDD as we have made such changes in [SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063, [SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813, [SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource #17255 and SPARK-19918. For example, we can use compress option when writing the data out (although we don't have a dedicated SQLConf for it yet).
  3. It will use UTF-8 encoded strings which should be cheaper than plan unicode JDK string ser/de

I can separate PR for replacing RDD to DataFrame when writing the text out (#47347 (comment)). I piggy backed there because I thought it's trivial. The code what it does is virtually same except those differences above.

This is the same pattern as #47341

We also changed many RDD things to Dataset. Here is another example 50e3644. I can bring more examples if you need.

@dongjoon-hyun
Copy link
Member

Thank you for the reason. It makes sense.

This is because of consistency. We're already using SparkSession to write Parquet.

For the following, if you don't mind, please split them once more because it's not related to avoiding repartition.

I can separate PR for replacing RDD to DataFrame when writing the text out (#47347 (comment)). I piggy backed there because I thought it's trivial. The code what it does is virtually same except those differences above.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except one minor comment about spin-off). Thank you.

@HyukjinKwon
Copy link
Member Author

Sure, I will separate the PR. Thanks for reviewing this closely 👍

@HyukjinKwon
Copy link
Member Author

#47366 👍

@dongjoon-hyun
Copy link
Member

Thank you, @HyukjinKwon and all.
Merged to master for Apache Spark 4.0.0-preview2.

dongjoon-hyun pushed a commit that referenced this pull request Jul 16, 2024
…ting metadata

### What changes were proposed in this pull request?

This PR proposes to use SparkSession over SparkContext when writing metadata

### Why are the changes needed?

See #47347 (comment)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should cover it.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47366 from HyukjinKwon/SPARK-48909.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?

This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file.

### Why are the changes needed?

In order to remove unnecessary shuffle, see also apache#47341

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47347 from HyukjinKwon/SPARK-48896.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…ting metadata

### What changes were proposed in this pull request?

This PR proposes to use SparkSession over SparkContext when writing metadata

### Why are the changes needed?

See apache#47347 (comment)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should cover it.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47366 from HyukjinKwon/SPARK-48909.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file.

### Why are the changes needed?

In order to remove unnecessary shuffle, see also apache#47341

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47347 from HyukjinKwon/SPARK-48896.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…ting metadata

### What changes were proposed in this pull request?

This PR proposes to use SparkSession over SparkContext when writing metadata

### Why are the changes needed?

See apache#47347 (comment)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should cover it.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47366 from HyukjinKwon/SPARK-48909.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants