[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs #20531

HyukjinKwon · 2018-02-07T13:36:53Z

What changes were proposed in this pull request?

This PR targets to explicitly specify supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in returnType ahead with documenting this; however, it happened to fix multiple things.

Currently, we don't support BinaryType in Pandas UDFs, for example, see:

from pyspark.sql.functions import pandas_udf
pudf = pandas_udf(lambda x: x, "binary")
df = spark.createDataFrame([[bytearray(1)]])
df.select(pudf("_1")).show()

...
TypeError: Unsupported type in conversion to Arrow: BinaryType

We can document this behaviour for its guide.

Also, the grouped aggregate Pandas UDF fails fast on ArrayType but seems we can support this case.

from pyspark.sql.functions import pandas_udf, PandasUDFType
foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG)
df = spark.range(100).selectExpr("id", "array(id) as value")
df.groupBy("id").agg(foo("value")).show()

...
 NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG

Since we can check the return type ahead, we can fail fast before actual execution.

# we can fail fast at this stage because we know the schema ahead
pandas_udf(lambda x: x, BinaryType())

How was this patch tested?

Manually tested and unit tests for BinaryType and ArrayType(...) were added.

HyukjinKwon · 2018-02-07T13:37:50Z

@ueshin, @BryanCutler and @icexelloss, mind taking a look please when you are available?

HyukjinKwon · 2018-02-07T13:41:26Z

python/pyspark/worker.py

@@ -116,7 +116,7 @@ def wrap_grouped_agg_pandas_udf(f, return_type):
    def wrapped(*series):
        import pandas as pd
        result = f(*series)
-        return pd.Series(result)
+        return pd.Series([result])


This change seems to be required:

>>> import numpy as np >>> import pandas as pd >>> pd.Series(np.array([1, 2, 3])) 0 1 1 2 2 3 dtype: int64 >>> pd.Series([np.array([1, 2, 3])]) 0 [1, 2, 3] dtype: object >>> pd.Series(1) 0 1 dtype: int64 >>> pd.Series([1]) 0 1 dtype: int64

SparkQA · 2018-02-07T14:04:38Z

Test build #87160 has finished for PR 20531 at commit ec708d5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-07T14:59:05Z

Test build #87163 has finished for PR 20531 at commit 7cbebaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-02-07T15:08:29Z

docs/sql-programming-guide.md

@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p

 ### Supported SQL Types

-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,


I thought binary type is supported... I am curious, what's the reason that it doesn't work now?

I was under impression that we don't support this. Seems Arrow doesn't work consistently with what Spark does. I think it's actually related with #20507.

I am careful to say this out but I believe the root cause is how to handle str in Python 2. Technically, it's bytes but named string. As you might already know, due to this confusion, unicode became str and str became bytes in Python 3. Spark handles this as StringType in general whereas seems Arrow deals with binaries.

I think we shouldn't support this for now until we get the consistent behaviour.

I see. Thanks for the explanation.

I agree, we need to look into these details more before we can support this type

should BinaryType be added to the unsupported types with arrow.enabled in SQLConf.scala?

icexelloss · 2018-02-07T15:10:15Z

python/pyspark/sql/tests.py

        def foo(x):
            return x
-        self.assertEqual(foo.returnType, schema)
+        self.assertEqual(foo.returnType, schema[0].dataType)


Should we just:

self.assertEqual(foo.returnType, DoubleType())

?

icexelloss · 2018-02-07T15:15:04Z

python/pyspark/sql/tests.py

-        from pyspark.sql.functions import pandas_udf, PandasUDFType
-        df = self.data
+        from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col
+        df = self.data.withColumn("arr", array(col("id")))


minor: It seems a bit arbitrary to mix array type in this test. Array probably belongs to a new test (if it doesn't exist yet) test_array, test_complex_types sth like test_all_types

icexelloss · 2018-02-07T15:20:30Z

python/pyspark/sql/tests.py

        from pyspark.sql.functions import pandas_udf, PandasUDFType

        with QuietTest(self.sc):
            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
-                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
+                @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)


Why is ArrayType(TimestampType()) a special case that is not supported? (I haven't fully tested this when implementing this feature, is only array of primitives supported?)

Seems because we don't handle the timezone issue when it's nested. There are few todos, for example:

spark/python/pyspark/sql/session.py

Line 465 in 71cfba0

# TODO: handle nested timestamps, such as ArrayType(TimestampType())?

spark/python/pyspark/sql/types.py

Line 1726 in a24c031

# TODO: handle nested timestamps, such as ArrayType(TimestampType())?

spark/python/pyspark/sql/types.py

Line 1745 in a24c031

# TODO: handle nested timestamps, such as ArrayType(TimestampType())?

spark/python/pyspark/sql/types.py

Line 1771 in a24c031

# TODO: handle nested timestamps, such as ArrayType(TimestampType())?

icexelloss · 2018-02-07T15:22:48Z

python/pyspark/sql/udf.py

+                    to_arrow_schema(self._returnType_placeholder)
+                except TypeError:
+                    raise NotImplementedError(
+                        "Invalid returnType with a grouped map Pandas UDF: "


nit: a grouped map Pandas UDF -> grouped map Pandas UDFs?

icexelloss · 2018-02-07T15:31:23Z

python/pyspark/sql/tests.py

-
-        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+        mean_arr_udf = pandas_udf(
+            self.pandas_agg_mean_udf.func,


For arrays, can we add tests for:

Test type coercion, e.g., specified type is array<double> and returned array is [0, 1, 2]?

Test exception: function returns array of different types like [0, "hello"]

Btw, we can have this to be a follow up and I can do it too

If you meant to type coercion (did I understand correctly?), I already tested in my local. Seems not working properly. Similar thing was discussed in #20163 (comment) (thanks @ueshin).

Will reread the comments when I am more awake tomorrow ...

I think with Pandas UDFs, certain type coercion is supported, e.g., when user specify "double type" and returns a pd.Series of int, it will automatically cast it to pd.Series of double. This behavior is different from regular Python UDF, which will return null in this case. Most of the type coercion is done by pyarrow. (Btw, I think type coercion in Pandas UDFs is an huge improvement over Python UDF because that's one of the biggest frustration our PySpark users have...)

Btw, if type coercion is not working with array type, I think it's still fine to allow using array type and fix type coercion separately.

Hm .. let's do it separately for type coercion stuff in another issue. I think we need another iteration for it. I am actually less sure yet if we officially document and support the type coercion given our past discussion.

icexelloss · 2018-02-07T15:33:07Z

@HyukjinKwon Looks good to me at high level. Left some comments.

HyukjinKwon · 2018-02-07T15:35:01Z

Yup, let me try to address them tomorrow. Thanks for your review.

icexelloss · 2018-02-07T15:36:03Z

python/pyspark/sql/udf.py

+                        "Invalid returnType with a grouped map Pandas UDF: "
+                        "%s is not supported" % str(self._returnType_placeholder))
+            else:
+                raise TypeError("Invalid returnType for a grouped map Pandas "


nit: a grouped map Pandas UDF -> grouped map Pandas UDFs?

icexelloss · 2018-02-07T15:36:32Z

python/pyspark/sql/udf.py

+                to_arrow_type(self._returnType_placeholder)
+            except TypeError:
+                raise NotImplementedError(
+                    "Invalid returnType with a scalar Pandas UDF: %s is "


icexelloss · 2018-02-07T15:37:38Z

python/pyspark/sql/udf.py

+                to_arrow_type(self._returnType_placeholder)
+            except TypeError:
+                raise NotImplementedError(
+                    "Invalid returnType with a grouped aggregate Pandas UDF: "


SparkQA · 2018-02-07T18:04:36Z

Test build #87164 has finished for PR 20531 at commit 68662ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-02-08T04:42:49Z

python/pyspark/sql/udf.py

-            raise NotImplementedError(
-                "ArrayType, StructType and MapType are not supported with "
-                "PandasUDFType.GROUPED_AGG")
+        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:


nit: I'd prefer to keep the check order by the definition in PythonEvalType if you don't have a special reason.

E.g.,

if self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF: ... elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF: ... elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF: ...

ueshin · 2018-02-08T04:49:30Z

python/pyspark/sql/tests.py

        from pyspark.sql.functions import pandas_udf, PandasUDFType

        with QuietTest(self.sc):
            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
-                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
+                @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
                def mean_and_std_udf(v):


nit: should rename this?

HyukjinKwon · 2018-02-08T12:24:20Z

@ueshin and @icexelloss, thanks for your review. I tried to address the comments at my best.

SparkQA · 2018-02-08T13:02:30Z

Test build #87214 has finished for PR 20531 at commit 36617e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM, just mentioned that might want to include BInaryType as unsupported in SQLConf doc. Thanks for doing some cleanup too!

BryanCutler · 2018-02-08T18:30:50Z

docs/sql-programming-guide.md

@@ -1676,7 +1676,7 @@ Using the above optimizations with Arrow will produce the same results as when A
 enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the
 DataFrame to the driver program and should be done on a small subset of the data. Not all Spark
 data types are currently supported and an error can be raised if a column has an unsupported type,
-see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,
+see [Supported SQL Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,


Nice catch!

BryanCutler · 2018-02-08T18:39:56Z

docs/sql-programming-guide.md

@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p

 ### Supported SQL Types

-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,


should BinaryType be added to the unsupported types with arrow.enabled in SQLConf.scala?

icexelloss · 2018-02-09T16:01:16Z

@HyukjinKwon LGTM! My only comment left is #20531 (comment) . But we can have separate PR for testing type coercion with array type.

SparkQA · 2018-02-10T08:05:01Z

Test build #87280 has finished for PR 20531 at commit 07f2d78.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-10T08:38:50Z

retest this please

SparkQA · 2018-02-10T11:58:03Z

Test build #87281 has finished for PR 20531 at commit 07f2d78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-12T04:06:00Z

@ueshin, does this looks fine to you too?

ueshin · 2018-02-12T04:42:13Z

@HyukjinKwon Yes, LGTM.

HyukjinKwon · 2018-02-12T07:40:28Z

retest this please

SparkQA · 2018-02-12T08:05:01Z

Test build #87325 has finished for PR 20531 at commit 07f2d78.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-12T08:07:23Z

retest this please

SparkQA · 2018-02-12T11:15:18Z

Test build #87328 has finished for PR 20531 at commit 07f2d78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-12T11:26:20Z

Test build #87329 has finished for PR 20531 at commit 07f2d78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-12T11:50:19Z

Merged to master.

HyukjinKwon · 2018-02-12T11:53:31Z

Thank you for reviewing this, @icexelloss, @ueshin and @BryanCutler.

This PR targets to explicitly specify supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case. ```python from pyspark.sql.functions import pandas_udf, PandasUDFType foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG) df = spark.range(100).selectExpr("id", "array(id) as value") df.groupBy("id").agg(foo("value")).show() ``` ``` ... NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG ``` 3. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20531 from HyukjinKwon/pudf-cleanup. (cherry picked from commit c338c8c) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

gatorsmile · 2018-02-12T23:21:28Z

python/pyspark/sql/types.py

@@ -1638,6 +1638,8 @@ def to_arrow_type(dt):
        # Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read
        arrow_type = pa.timestamp('us', tz='UTC')
    elif type(dt) == ArrayType:
+        if type(dt.elementType) == TimestampType:
+            raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))


What is the behavior before this PR?

I think timestamps with localisation issue. See #20531 (comment).

…in Pandas UDFs ## What changes were proposed in this pull request? This PR backports #20531: It explicitly specifies supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` ## How was this patch tested? Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20588 from HyukjinKwon/PR_TOOL_PICK_PR_20531_BRANCH-2.3.

Explicitly specify supported types with Pandas UDFs

ec708d5

HyukjinKwon commented Feb 7, 2018

View reviewed changes

HyukjinKwon added 2 commits February 7, 2018 23:18

Fix tests accordingly

7cbebaf

Remove unused import

68662ec

icexelloss reviewed Feb 7, 2018

View reviewed changes

ueshin reviewed Feb 8, 2018

View reviewed changes

Address comments

36617e4

BryanCutler reviewed Feb 8, 2018

View reviewed changes

icexelloss approved these changes Feb 9, 2018

View reviewed changes

Add BinaryType to configuration description

07f2d78

asfgit closed this in c338c8c Feb 12, 2018

HyukjinKwon mentioned this pull request Feb 12, 2018

[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

Closed

gatorsmile reviewed Feb 12, 2018

View reviewed changes

HyukjinKwon deleted the pudf-cleanup branch October 16, 2018 12:44

[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs #20531

[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs #20531

Conversation

HyukjinKwon commented Feb 7, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2018

SparkQA commented Feb 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 10, 2018 • edited Loading

Choose a reason for hiding this comment

icexelloss commented Feb 7, 2018

HyukjinKwon commented Feb 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2018

ueshin Feb 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 8, 2018

SparkQA commented Feb 8, 2018

BryanCutler left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss commented Feb 9, 2018 • edited Loading

SparkQA commented Feb 10, 2018

HyukjinKwon commented Feb 10, 2018

SparkQA commented Feb 10, 2018

HyukjinKwon commented Feb 12, 2018

ueshin commented Feb 12, 2018

HyukjinKwon commented Feb 12, 2018

SparkQA commented Feb 12, 2018

HyukjinKwon commented Feb 12, 2018

SparkQA commented Feb 12, 2018

SparkQA commented Feb 12, 2018

HyukjinKwon commented Feb 12, 2018

HyukjinKwon commented Feb 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 7, 2018 •

edited

Loading

HyukjinKwon commented Feb 7, 2018 •

edited

Loading

icexelloss Feb 7, 2018 •

edited

Loading

HyukjinKwon Feb 7, 2018 •

edited

Loading

HyukjinKwon Feb 10, 2018 •

edited

Loading

ueshin Feb 8, 2018 •

edited

Loading

BryanCutler left a comment •

edited

Loading

icexelloss commented Feb 9, 2018 •

edited

Loading