[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328

WeichenXu123 · 2024-07-12T12:40:00Z

What changes were proposed in this pull request?

Replace RDD read / write API invocation with Dataframe read / write API

Why are the changes needed?

In certain storage configuration cases, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

dongjoon-hyun

Hi, @WeichenXu123 and @zhengruifeng . This is Apache Software Foundation repository. Please avoid to mention a specific commercial vender in the PR description and commit logs.

In databricks runtime, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works.

dongjoon-hyun · 2024-07-12T18:33:15Z

mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala

@@ -129,7 +129,9 @@ private[r] object AFTSurvivalRegressionWrapper extends MLReadable[AFTSurvivalReg
      val rMetadata = ("class" -> instance.getClass.getName) ~
        ("features" -> instance.features.toImmutableArraySeq)
      val rMetadataJson: String = compact(render(rMetadata))
-      sc.parallelize(Seq(rMetadataJson), 1).saveAsTextFile(rMetadataPath)
+      sparkSession.createDataFrame(


This is a regression because SparkSession is heavier than SparkContext.

I think it is not a regression, we are using active spark session but not creating a new one.

this is what ML scala using for model save and load:

spark/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

Lines 493 to 523 in 3ae78c4

private case class Data(coefficients: Vector, intercept: Double, scale: Double)

override protected def saveImpl(path: String): Unit = {

// Save metadata and Params

DefaultParamsWriter.saveMetadata(instance, path, sc)

// Save model data: coefficients, intercept, scale

val data = Data(instance.coefficients, instance.intercept, instance.scale)

val dataPath = new Path(path, "data").toString

sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)

}

}

private class AFTSurvivalRegressionModelReader extends MLReader[AFTSurvivalRegressionModel] {

/** Checked against metadata when loading model */

private val className = classOf[AFTSurvivalRegressionModel].getName

override def load(path: String): AFTSurvivalRegressionModel = {

val metadata = DefaultParamsReader.loadMetadata(path, sc, className)

val dataPath = new Path(path, "data").toString

val data = sparkSession.read.parquet(dataPath)

val Row(coefficients: Vector, intercept: Double, scale: Double) =

MLUtils.convertVectorColumnsToML(data, "coefficients")

.select("coefficients", "intercept", "scale")

.head()

val model = new AFTSurvivalRegressionModel(metadata.uid, coefficients, intercept, scale)

metadata.getAndSetParams(model)

model

}

for this R wrapper, I think it is fine to do it in the same way

WeichenXu123 · 2024-07-13T02:22:37Z

Hi, @WeichenXu123 and @zhengruifeng . This is Apache Software Foundation repository. Please avoid to mention a specific commercial vender in the PR description and commit logs.

In databricks runtime, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works.

Oh my bad. I updated PR description. Thanks for your reminder.

HyukjinKwon · 2024-07-13T02:30:34Z

mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala

-      sc.parallelize(Seq(rMetadataJson), 1).saveAsTextFile(rMetadataPath)
+      sparkSession.createDataFrame(
+        Seq(Tuple1(rMetadataJson))
+      ).repartition(1).write.text(rMetadataPath)


Can we avoid repartition?

No, the single partition is needed to avoid creating empty files, even though there is only one row:

spark.range(1).select(sf.lit("123")).write.text("/tmp/test-test")

(base) ➜ runtime git:(master) ls /tmp/test-test _SUCCESS part-00011-e548682e-21a5-41ea-93bc-0089efccbf4f-c000.txt part-00000-e548682e-21a5-41ea-93bc-0089efccbf4f-c000.txt (base) ➜ runtime git:(master) wc -l /tmp/test-test/* 0 /tmp/test-test/_SUCCESS 0 /tmp/test-test/part-00000-e548682e-21a5-41ea-93bc-0089efccbf4f-c000.txt 1 /tmp/test-test/part-00011-e548682e-21a5-41ea-93bc-0089efccbf4f-c000.txt 1 total

But I think we can use coalesce(1) to avoid this small shuffle

This is single row, and will produce one partition, see LocalTableScanExec.

spark.createDataFrame(Seq(Tuple1("a"))).write.text("/tmp/abc")

write single file.

Made a followup: #47341

ah, I see, this LocalTableScanExec ganrantees single row -> single partition.

It will be better if we can specify the numPartitions somewhere

Yeah I think so but removing it should work for now.

HyukjinKwon · 2024-07-13T08:33:51Z

Reverted per the dev mailing list discussion

dongjoon-hyun · 2024-07-14T17:18:48Z

Thank you for reverting this, Hyukjin.

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…aframe read / write API ### What changes were proposed in this pull request? Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? In databricks runtime, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47328 from WeichenXu123/ml-df-writer-save-2. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…aframe read / write API ### What changes were proposed in this pull request? Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? In databricks runtime, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47328 from WeichenXu123/ml-df-writer-save-2. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

update

9441157

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added ML R labels Jul 12, 2024

zhengruifeng approved these changes Jul 12, 2024

View reviewed changes

zhengruifeng changed the title ~~[SPARK-48883][ML] Replace RDD read / write API invocation with Dataframe read / write API~~ [SPARK-48883][ML][R} Replace RDD read / write API invocation with Dataframe read / write API Jul 12, 2024

zhengruifeng changed the title ~~[SPARK-48883][ML][R} Replace RDD read / write API invocation with Dataframe read / write API~~ [SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API Jul 12, 2024

WeichenXu123 closed this in 0fa5787 Jul 12, 2024

WeichenXu123 deleted the ml-df-writer-save-2 branch July 12, 2024 14:21

dongjoon-hyun reviewed Jul 12, 2024

View reviewed changes

HyukjinKwon reviewed Jul 13, 2024

View reviewed changes

HyukjinKwon mentioned this pull request Jul 13, 2024

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328

WeichenXu123 commented Jul 12, 2024 •

edited

Loading

dongjoon-hyun left a comment

dongjoon-hyun Jul 12, 2024

WeichenXu123 Jul 13, 2024

zhengruifeng Jul 13, 2024 •

edited

Loading

WeichenXu123 commented Jul 13, 2024

HyukjinKwon Jul 13, 2024

zhengruifeng Jul 13, 2024 •

edited

Loading

HyukjinKwon Jul 13, 2024

HyukjinKwon Jul 13, 2024

zhengruifeng Jul 13, 2024

HyukjinKwon Jul 13, 2024

HyukjinKwon commented Jul 13, 2024

dongjoon-hyun commented Jul 14, 2024

	private case class Data(coefficients: Vector, intercept: Double, scale: Double)

	override protected def saveImpl(path: String): Unit = {
	// Save metadata and Params
	DefaultParamsWriter.saveMetadata(instance, path, sc)
	// Save model data: coefficients, intercept, scale
	val data = Data(instance.coefficients, instance.intercept, instance.scale)
	val dataPath = new Path(path, "data").toString
	sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
	}
	}

	private class AFTSurvivalRegressionModelReader extends MLReader[AFTSurvivalRegressionModel] {

	/** Checked against metadata when loading model */
	private val className = classOf[AFTSurvivalRegressionModel].getName

	override def load(path: String): AFTSurvivalRegressionModel = {
	val metadata = DefaultParamsReader.loadMetadata(path, sc, className)

	val dataPath = new Path(path, "data").toString
	val data = sparkSession.read.parquet(dataPath)
	val Row(coefficients: Vector, intercept: Double, scale: Double) =
	MLUtils.convertVectorColumnsToML(data, "coefficients")
	.select("coefficients", "intercept", "scale")
	.head()
	val model = new AFTSurvivalRegressionModel(metadata.uid, coefficients, intercept, scale)

	metadata.getAndSetParams(model)
	model
	}

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328

Conversation

WeichenXu123 commented Jul 12, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun Jul 12, 2024

Choose a reason for hiding this comment

WeichenXu123 Jul 13, 2024

Choose a reason for hiding this comment

zhengruifeng Jul 13, 2024 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 commented Jul 13, 2024

HyukjinKwon Jul 13, 2024

Choose a reason for hiding this comment

zhengruifeng Jul 13, 2024 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jul 13, 2024

Choose a reason for hiding this comment

HyukjinKwon Jul 13, 2024

Choose a reason for hiding this comment

zhengruifeng Jul 13, 2024

Choose a reason for hiding this comment

HyukjinKwon Jul 13, 2024

Choose a reason for hiding this comment

HyukjinKwon commented Jul 13, 2024

dongjoon-hyun commented Jul 14, 2024

WeichenXu123 commented Jul 12, 2024 •

edited

Loading

zhengruifeng Jul 13, 2024 •

edited

Loading

zhengruifeng Jul 13, 2024 •

edited

Loading