Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 12, 2018

What changes were proposed in this pull request?

This PR backports #20531:

It explicitly specifies supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in returnType ahead with documenting this; however, it happened to fix multiple things.

  1. Currently, we don't support BinaryType in Pandas UDFs, for example, see:

    from pyspark.sql.functions import pandas_udf
    pudf = pandas_udf(lambda x: x, "binary")
    df = spark.createDataFrame([[bytearray(1)]])
    df.select(pudf("_1")).show()
    ...
    TypeError: Unsupported type in conversion to Arrow: BinaryType
    

    We can document this behaviour for its guide.

  2. Since we can check the return type ahead, we can fail fast before actual execution.

    # we can fail fast at this stage because we know the schema ahead
    pandas_udf(lambda x: x, BinaryType())

How was this patch tested?

Manually tested and unit tests for BinaryType and ArrayType(...) were added.

This PR targets to explicitly specify supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things.

1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:

    ```python
    from pyspark.sql.functions import pandas_udf
    pudf = pandas_udf(lambda x: x, "binary")
    df = spark.createDataFrame([[bytearray(1)]])
    df.select(pudf("_1")).show()
    ```
    ```
    ...
    TypeError: Unsupported type in conversion to Arrow: BinaryType
    ```

    We can document this behaviour for its guide.

2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case.

    ```python
    from pyspark.sql.functions import pandas_udf, PandasUDFType
    foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG)
    df = spark.range(100).selectExpr("id", "array(id) as value")
    df.groupBy("id").agg(foo("value")).show()
    ```

    ```
    ...
     NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG
    ```

3. Since we can check the return type ahead, we can fail fast before actual execution.

    ```python
    # we can fail fast at this stage because we know the schema ahead
    pandas_udf(lambda x: x, BinaryType())
    ```

Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#20531 from HyukjinKwon/pudf-cleanup.

(cherry picked from commit c338c8c)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
@HyukjinKwon
Copy link
Member Author

cc @ueshin

@ueshin
Copy link
Member

ueshin commented Feb 12, 2018

LGTM.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87333 has finished for PR 20588 at commit 44fe840.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def test_simple(self):
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = self.data
def test_supported_types(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start to worry about the test coverage of vectorized udfs and arrow-based to/from pandas df. Do we have any plan in PySpark to test all the data types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, agree. Yup, I was thinking of doing it. But if you (or your colleagues) are working on that or have a plan, no need to block it by me :). please go ahead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe open a JIRA and ask the OSS community to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Filed SPARK-23401.

@gatorsmile
Copy link
Member

@HyukjinKwon Could you update the PR description? This will be part of the commit. Thus, it would be nice to document the exact changes made in this PR.

@HyukjinKwon
Copy link
Member Author

Yup, will update soon.

@gatorsmile
Copy link
Member

This PR contains multiple fixes. This is not good especially for the ones targeting to 2.3.0. We should split it to multiple independent PRs if possible.

cc @ueshin

Thanks! Merged to 2.3.

asfgit pushed a commit that referenced this pull request Feb 13, 2018
…in Pandas UDFs

## What changes were proposed in this pull request?

This PR backports #20531:

It explicitly specifies supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things.

1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:

    ```python
    from pyspark.sql.functions import pandas_udf
    pudf = pandas_udf(lambda x: x, "binary")
    df = spark.createDataFrame([[bytearray(1)]])
    df.select(pudf("_1")).show()
    ```
    ```
    ...
    TypeError: Unsupported type in conversion to Arrow: BinaryType
    ```

    We can document this behaviour for its guide.

2. Since we can check the return type ahead, we can fail fast before actual execution.

    ```python
    # we can fail fast at this stage because we know the schema ahead
    pandas_udf(lambda x: x, BinaryType())
    ```

## How was this patch tested?

Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20588 from HyukjinKwon/PR_TOOL_PICK_PR_20531_BRANCH-2.3.
@gatorsmile
Copy link
Member

Could you close it?

@HyukjinKwon
Copy link
Member Author

This PR contained one targeted change that fixes multiple problems, to be more clear.

@HyukjinKwon HyukjinKwon deleted the PR_TOOL_PICK_PR_20531_BRANCH-2.3 branch October 16, 2018 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants