[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453

WeichenXu123 · 2024-07-23T00:39:20Z

What changes were proposed in this pull request?

SparkSession.getActiveSession is thread-local session, but spark ML reader / writer might be executed in different threads which causes SparkSession.getActiveSession returning None.

Why are the changes needed?

It fixes the bug like:

        spark = SparkSession.getActiveSession()
>       spark.createDataFrame(  # type: ignore[union-attr]
            [(metadataJson,)], schema=["value"]
        ).coalesce(1).write.text(metadataPath)
E       AttributeError: 'NoneType' object has no attribute 'createDataFrame'

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2024-07-23T11:20:04Z

merged to master.

dongjoon-hyun · 2024-07-23T13:58:34Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

@@ -588,7 +588,7 @@ private[ml] object DefaultParamsReader {
   */
  def loadMetadata(path: String, sc: SparkContext, expectedClassName: String = ""): Metadata = {
    val metadataPath = new Path(path, "metadata").toString
-    val spark = SparkSession.getActiveSession.get
+    val spark = SparkSession.builder().sparkContext(sc).getOrCreate()


Hi, @WeichenXu123 , @HyukjinKwon , @zhengruifeng .

This sounds like a regression of

[SPARK-48909][ML][MLLIB] Uses SparkSession over SparkContext when writing metadata #47366

If we cannot get an existing one, I believe we should not create SparkSession here.

Can we recover the existing code?

It will not be a regression. This is Spark ML which is DataFrame-based MLlib by definition. Therefore we should always have default session running. Active session is specific to a thread, so it might not exist within the same thread. Alternatively we could use SparkSession.getDefaultSession.

dongjoon-hyun · 2024-07-23T13:59:14Z

python/pyspark/ml/util.py

-        spark.createDataFrame(  # type: ignore[union-attr]
-            [(metadataJson,)], schema=["value"]
-        ).coalesce(1).write.text(metadataPath)
+        spark = SparkSession._getActiveSessionOrCreate()


dongjoon-hyun · 2024-07-23T13:59:21Z

python/pyspark/ml/util.py

@@ -580,8 +580,8 @@ def loadMetadata(path: str, sc: "SparkContext", expectedClassName: str = "") ->
            If non empty, this is checked against the loaded metadata.
        """
        metadataPath = os.path.join(path, "metadata")
-        spark = SparkSession.getActiveSession()
-        metadataStr = spark.read.text(metadataPath).first()[0]  # type: ignore[union-attr,index]
+        spark = SparkSession._getActiveSessionOrCreate()


dongjoon-hyun · 2024-07-23T14:14:22Z

Initially, the existing PRs assumes that there is no regression because we use the active sessions. AFAIK, this assumption was the same in the dev mailing discussion .

https://lists.apache.org/thread/s24lqtmno0xtoxxz6pk6tyn726bfwp8q

Is this regression inevitable, @HyukjinKwon ?

If then, could you add a documentation that ML module starts to use SparkSession always instead of SparkContext?
If that is the module's changed minimum requirement, we don't need to discuss this topic again.

dongjoon-hyun · 2024-07-23T14:33:22Z

I replied on the existing thread.

https://lists.apache.org/thread/ks68nys5n5cr129gx35gnx55t4k1x2qb

HyukjinKwon · 2024-07-24T00:25:39Z

There is no regression. This is Spark ML which is DataFrame-based MLlib. There should be a running Spark session always.

zhengruifeng · 2024-07-24T01:06:29Z

@dongjoon-hyun
DefaultParamsReader.loadMetadata is only used to load the metadata of ml models, let me take LogisticRegressionModel as an example:

spark/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Lines 1322 to 1358 in 2e1a39c

    
             private class LogisticRegressionModelReader extends MLReader[LogisticRegressionModel] { 
        
               /** Checked against metadata when loading model */ 
        
               private val className = classOf[LogisticRegressionModel].getName 
        
               override def load(path: String): LogisticRegressionModel = { 
        
                 val metadata = DefaultParamsReader.loadMetadata(path, sc, className) 
        
                 val (major, minor) = VersionUtils.majorMinorVersion(metadata.sparkVersion) 
        
                 val dataPath = new Path(path, "data").toString 
        
                 val data = sparkSession.read.format("parquet").load(dataPath) 
        
                 val model = if (major < 2 || (major == 2 && minor == 0)) { 
        
                   // 2.0 and before 
        
                   val Row(numClasses: Int, numFeatures: Int, intercept: Double, coefficients: Vector) = 
        
                     MLUtils.convertVectorColumnsToML(data, "coefficients") 
        
                       .select("numClasses", "numFeatures", "intercept", "coefficients") 
        
                       .head() 
        
                   val coefficientMatrix = 
        
                     new DenseMatrix(1, coefficients.size, coefficients.toArray, isTransposed = true) 
        
                   val interceptVector = Vectors.dense(intercept) 
        
                   new LogisticRegressionModel(metadata.uid, coefficientMatrix, 
        
                     interceptVector, numClasses, isMultinomial = false) 
        
                 } else { 
        
                   // 2.1+ 
        
                   val Row(numClasses: Int, numFeatures: Int, interceptVector: Vector, 
        
                   coefficientMatrix: Matrix, isMultinomial: Boolean) = data 
        
                     .select("numClasses", "numFeatures", "interceptVector", "coefficientMatrix", 
        
                       "isMultinomial").head() 
        
                   new LogisticRegressionModel(metadata.uid, coefficientMatrix, interceptVector, 
        
                     numClasses, isMultinomial) 
        
                 } 
        
                 metadata.getAndSetParams(model) 
        
                 model 
        
               } 
        
             }

val metadata = DefaultParamsReader.loadMetadata(path, sc, className)

loads the metadata

val data = sparkSession.read.format("parquet").load(dataPath)

then loads the model coefficients, you can see the sparkSession is already avaiable for model loading.

zhengruifeng · 2024-07-24T01:07:52Z

I think probably we can change the signature of

def loadMetadata(path: String, sc: SparkContext, expectedClassName: String = ""): Metadata

to

def loadMetadata(path: String, spark: SparkSession, expectedClassName: String = ""): Metadata

to avoid such confusion.

I will have a try

dongjoon-hyun · 2024-07-24T01:41:21Z

Thank you, @HyukjinKwon and @zhengruifeng . I'm +1 for both to have a clear semantic.

Using SparkSession.getDefaultSession instead of *OrCreate.

Alternatively we could use SparkSession.getDefaultSession.

Having a clear semantic, def loadMetadata(path: String, spark: SparkSession, expectedClassName: String = ""): Metadata.

dongjoon-hyun · 2024-07-24T15:06:43Z

For the record and the other reviewers, (2) is implemented and merged to Apache Spark 4.0.0.

[SPARK-48988][ML] Make DefaultParamsReader/Writer handle metadata with spark session #47467

…n spark ML reader/writer ### What changes were proposed in this pull request? `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None. ### Why are the changes needed? It fixes the bug like: ``` spark = SparkSession.getActiveSession() > spark.createDataFrame( # type: ignore[union-attr] [(metadataJson,)], schema=["value"] ).coalesce(1).write.text(metadataPath) E AttributeError: 'NoneType' object has no attribute 'createDataFrame' ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47453 from WeichenXu123/SPARK-48970. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

init

d25f1ad

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added ML PYTHON labels Jul 23, 2024

WeichenXu123 requested a review from HyukjinKwon July 23, 2024 00:39

update

215a204

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon approved these changes Jul 23, 2024

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-48970] Avoid using SparkSession.getActiveSession in spark ML reader/writer~~ [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer Jul 23, 2024

zhengruifeng approved these changes Jul 23, 2024

View reviewed changes

WeichenXu123 added 2 commits July 23, 2024 12:51

format

ec7bec0

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

1f0381a

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 closed this in fba4c8c Jul 23, 2024

dongjoon-hyun reviewed Jul 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453

[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453

WeichenXu123 commented Jul 23, 2024

WeichenXu123 commented Jul 23, 2024

dongjoon-hyun Jul 23, 2024 •

edited

Loading

HyukjinKwon Jul 24, 2024 •

edited

Loading

dongjoon-hyun Jul 23, 2024

dongjoon-hyun Jul 23, 2024

dongjoon-hyun commented Jul 23, 2024

dongjoon-hyun commented Jul 23, 2024

HyukjinKwon commented Jul 24, 2024

zhengruifeng commented Jul 24, 2024

zhengruifeng commented Jul 24, 2024

dongjoon-hyun commented Jul 24, 2024

dongjoon-hyun commented Jul 24, 2024

[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453

[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453

Conversation

WeichenXu123 commented Jul 23, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WeichenXu123 commented Jul 23, 2024

dongjoon-hyun Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Jul 23, 2024

Choose a reason for hiding this comment

dongjoon-hyun Jul 23, 2024

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 23, 2024

dongjoon-hyun commented Jul 23, 2024

HyukjinKwon commented Jul 24, 2024

zhengruifeng commented Jul 24, 2024

zhengruifeng commented Jul 24, 2024

dongjoon-hyun commented Jul 24, 2024

dongjoon-hyun commented Jul 24, 2024

dongjoon-hyun Jul 23, 2024 •

edited

Loading

HyukjinKwon Jul 24, 2024 •

edited

Loading