-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -129,7 +129,9 @@ private[r] object AFTSurvivalRegressionWrapper extends MLReadable[AFTSurvivalReg | |
val rMetadata = ("class" -> instance.getClass.getName) ~ | ||
("features" -> instance.features.toImmutableArraySeq) | ||
val rMetadataJson: String = compact(render(rMetadata)) | ||
sc.parallelize(Seq(rMetadataJson), 1).saveAsTextFile(rMetadataPath) | ||
sparkSession.createDataFrame( | ||
Seq(Tuple1(rMetadataJson)) | ||
).repartition(1).write.text(rMetadataPath) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we avoid repartition? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, the single partition is needed to avoid creating empty files, even though there is only one row:
But I think we can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is single row, and will produce one partition, see spark.createDataFrame(Seq(Tuple1("a"))).write.text("/tmp/abc") write single file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Made a followup: #47341 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, I see, this It will be better if we can specify the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think so but removing it should work for now. |
||
|
||
instance.pipeline.save(pipelinePath) | ||
} | ||
|
@@ -142,7 +144,8 @@ private[r] object AFTSurvivalRegressionWrapper extends MLReadable[AFTSurvivalReg | |
val rMetadataPath = new Path(path, "rMetadata").toString | ||
val pipelinePath = new Path(path, "pipeline").toString | ||
|
||
val rMetadataStr = sc.textFile(rMetadataPath, 1).first() | ||
val rMetadataStr = sparkSession.read.text(rMetadataPath) | ||
.first().getString(0) | ||
val rMetadata = parse(rMetadataStr) | ||
val features = (rMetadata \ "features").extract[Array[String]] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression because
SparkSession
is heavier thanSparkContext
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not a regression, we are using active spark session but not creating a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what ML scala using for model save and load:
spark/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
Lines 493 to 523 in 3ae78c4
for this R wrapper, I think it is fine to do it in the same way