[SPARK-48988][ML] Make `DefaultParamsReader/Writer` handle metadata with spark session #47467

zhengruifeng · 2024-07-24T02:53:58Z

What changes were proposed in this pull request?

DefaultParamsReader/Writer handle metadata with spark session

Why are the changes needed?

In existing ml implementations, when loading/saving a model, it loads/saves the metadata with SparkContext then loads/saves the coefficients with SparkSession.

This PR aims to also load/save the metadata with SparkSession, by introducing new helper functions.

Note I: 3-rd libraries (e.g. xgboost ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are private[ml].
Note II: this PR only handles loadMetadata and saveMetadata, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

No

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

WeichenXu123

LGTM

HyukjinKwon · 2024-07-24T10:13:20Z

Merged to master.

dongjoon-hyun

+1, LGTM.

Thank you so much, @zhengruifeng , @HyukjinKwon , @WeichenXu123 .

…ith spark session ### What changes were proposed in this pull request? `DefaultParamsReader/Writer` handle metadata with spark session ### Why are the changes needed? In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`. This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions. - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`. - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47467 from zhengruifeng/ml_load_with_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…t spark session ### What changes were proposed in this pull request? Make model save/load helper functions accept spark session ### Why are the changes needed? 1, avoid unnecessary spark session creations; 2, to be consistent with scala side changes: #47467 and #47477 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47527 from zhengruifeng/py_ml_save_metadata_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ith spark session ### What changes were proposed in this pull request? `DefaultParamsReader/Writer` handle metadata with spark session ### Why are the changes needed? In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`. This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions. - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`. - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47467 from zhengruifeng/ml_load_with_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…t spark session ### What changes were proposed in this pull request? Make model save/load helper functions accept spark session ### Why are the changes needed? 1, avoid unnecessary spark session creations; 2, to be consistent with scala side changes: apache#47467 and apache#47477 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47527 from zhengruifeng/py_ml_save_metadata_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ith spark session ### What changes were proposed in this pull request? `DefaultParamsReader/Writer` handle metadata with spark session ### Why are the changes needed? In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`. This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions. - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`. - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47467 from zhengruifeng/ml_load_with_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…t spark session ### What changes were proposed in this pull request? Make model save/load helper functions accept spark session ### Why are the changes needed? 1, avoid unnecessary spark session creations; 2, to be consistent with scala side changes: apache#47467 and apache#47477 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47527 from zhengruifeng/py_ml_save_metadata_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the ML label Jul 24, 2024

zhengruifeng commented Jul 24, 2024

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala Outdated Show resolved Hide resolved

zhengruifeng added 3 commits July 24, 2024 12:40

init

a237131

init

1ad1303

address comments

d829fce

zhengruifeng force-pushed the ml_load_with_spark branch from f9d6ee8 to d829fce Compare July 24, 2024 04:40

zhengruifeng changed the title ~~[WIP][ML] DefaultParamsReader/Writer handle metadata with spark session~~ [WIP][ML] Make DefaultParamsReader/Writer handle metadata with spark session Jul 24, 2024

HyukjinKwon approved these changes Jul 24, 2024

View reviewed changes

zhengruifeng changed the title ~~[WIP][ML] Make DefaultParamsReader/Writer handle metadata with spark session~~ [SPARK-48988][ML] Make DefaultParamsReader/Writer handle metadata with spark session Jul 24, 2024

zhengruifeng requested review from dongjoon-hyun and WeichenXu123 July 24, 2024 04:57

WeichenXu123 approved these changes Jul 24, 2024

View reviewed changes

HyukjinKwon closed this in 8597b78 Jul 24, 2024

zhengruifeng deleted the ml_load_with_spark branch July 24, 2024 10:45

dongjoon-hyun reviewed Jul 24, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Jul 24, 2024

[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453

Closed

zhengruifeng mentioned this pull request Jul 30, 2024

[SPARK-49053][PYTHON][ML] Make model save/load helper functions accept spark session #47527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48988][ML] Make `DefaultParamsReader/Writer` handle metadata with spark session #47467

[SPARK-48988][ML] Make `DefaultParamsReader/Writer` handle metadata with spark session #47467

zhengruifeng commented Jul 24, 2024 •

edited

Loading

WeichenXu123 left a comment

HyukjinKwon commented Jul 24, 2024

dongjoon-hyun left a comment

[SPARK-48988][ML] Make DefaultParamsReader/Writer handle metadata with spark session #47467

[SPARK-48988][ML] Make DefaultParamsReader/Writer handle metadata with spark session #47467

Conversation

zhengruifeng commented Jul 24, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WeichenXu123 left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 24, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

[SPARK-48988][ML] Make `DefaultParamsReader/Writer` handle metadata with spark session #47467

[SPARK-48988][ML] Make `DefaultParamsReader/Writer` handle metadata with spark session #47467

zhengruifeng commented Jul 24, 2024 •

edited

Loading