Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Pickle vs. Arrow Type Coercion Difference #41706

Closed
wants to merge 2 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Jun 22, 2023

As a temporary supporting material for #41503.

xinrong-meng added a commit that referenced this pull request Jun 29, 2023
…return type in Arrow Python UDF

### What changes were proposed in this pull request?
Explicit Arrow casting for the mismatched return type of Arrow Python UDF.

### Why are the changes needed?
A more standardized and coherent type coercion.

Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately.

See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub).

### Does this PR introduce _any_ user-facing change?
Yes.

FROM
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
df.select(pandas_udf(lambda x: x, 'int')('value')).show()
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
...
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32
```

TO
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
+---------------+
|<lambda>(value)|
+---------------+
|              1|
|              2|
+---------------+
```
### How was this patch tested?
Unit tests.

Closes #41503 from xinrong-meng/type_coersion.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Xinrong Meng <xinrong@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jul 3, 2023
…return type in Arrow Python UDF

### What changes were proposed in this pull request?
Explicit Arrow casting for the mismatched return type of Arrow Python UDF.

### Why are the changes needed?
A more standardized and coherent type coercion.

Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately.

See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub).

### Does this PR introduce _any_ user-facing change?
Yes.

FROM
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
df.select(pandas_udf(lambda x: x, 'int')('value')).show()
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
...
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32
```

TO
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
+---------------+
|<lambda>(value)|
+---------------+
|              1|
|              2|
+---------------+
```
### How was this patch tested?
Unit tests.

Closes #41800 from xinrong-meng/snd_type_coersion.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant