[SPARK-49263][CONNECT] Spark Connect python client: Consistently hand…

…le boolean Dataframe reader options ### What changes were proposed in this pull request? Using `spark.read.option("Foo", True)` resulted in an uppercase `'True'` string in Python Spark Connect client, while in all other cases (scala with both Spark Connect and no Spark Connect, pyspark with no Spark Connect) it would be normalized to `'true'`. This is because `to_str` helper should be used instead of `str`. ### Why are the changes needed? This is now inconsistent with other cases. Passing `"True"` as boolean options seems to be breaking Delta CDF reader (to be fixed separately, that it should be able to handle the literal case-insensitively) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittest added ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47790 from juliuszsompolski/SPARK-49263. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Julek Sompolski <Juliusz Sompolski> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
panbingkun · Aug 19, 2024 · a6a62e5 · a6a62e5
1 parent 542b24a
commit a6a62e5
Show file tree

Hide file tree

Showing 3 changed files with 8 additions and 3 deletions.
diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py
@@ -281,9 +281,13 @@ def __init__(
         assert schema is None or isinstance(schema, str)
 
         if options is not None:
+            new_options = {}
             for k, v in options.items():
-                assert isinstance(k, str)
-                assert isinstance(v, str)
+                if v is not None:
+                    assert isinstance(k, str)
+                    assert isinstance(v, str)
+                    new_options[k] = v
+            options = new_options
 
         if paths is not None:
             assert isinstance(paths, list)

diff --git a/python/pyspark/sql/connect/readwriter.py b/python/pyspark/sql/connect/readwriter.py
@@ -94,7 +94,7 @@ def schema(self, schema: Union[StructType, str]) -> "DataFrameReader":
     schema.__doc__ = PySparkDataFrameReader.schema.__doc__
 
     def option(self, key: str, value: "OptionalPrimitiveType") -> "DataFrameReader":
-        self._options[key] = str(value)
+        self._options[key] = cast(str, to_str(value))
         return self
 
     option.__doc__ = PySparkDataFrameReader.option.__doc__

diff --git a/python/pyspark/sql/tests/test_datasources.py b/python/pyspark/sql/tests/test_datasources.py
@@ -212,6 +212,7 @@ def test_checking_csv_header(self):
             )
             df = (
                 self.spark.read.option("header", "true")
+                .option("quote", None)
                 .schema(schema)
                 .csv(path, enforceSchema=False)
             )