[SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators #44668

xinrong-meng · 2024-01-10T19:15:38Z

What changes were proposed in this pull request?

When using pandas UDFs with iterators, if users enable the profiling spark conf, a warning indicating non-support should be raised, and profiling should be disabled.

However, currently, after raising the not-supported warning, the memory profiler is still being enabled.

The PR proposed to fix that.

Why are the changes needed?

A bug fix to eliminate misleading behavior.

Does this PR introduce any user-facing change?

The noticeable changes will affect only those using the PySpark shell. This is because, in the PySpark shell, the memory profiler will raise an error, which in turn blocks the execution of the UDF.

How was this patch tested?

Manual test.

Was this patch authored or co-authored using generative AI tooling?

Setup:

$ ./bin/pyspark --conf spark.python.profile=true

>>> from typing import Iterator
>>> from pyspark.sql.functions import *
>>> import pandas as pd
>>> @pandas_udf("long")
... def plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
...     for s in iterator:
...         yield s + 1
... 
>>> df = spark.createDataFrame(pd.DataFrame([1, 2, 3], columns=["v"]))

Before:

>>> df.select(plus_one(df.v)).show()
UserWarning: Profiling UDFs with iterators input/output is not supported.
Traceback (most recent call last):
...
OSError: could not get source code

After:

>>> df.select(plus_one(df.v)).show()
/Users/xinrong.meng/spark/python/pyspark/sql/udf.py:417: UserWarning: Profiling UDFs with iterators input/output is not supported.
+-----------+                                                                   
|plus_one(v)|
+-----------+
|          2|
|          3|
|          4|
+-----------+

xinrong-meng · 2024-01-11T19:27:35Z

@ueshin @HyukjinKwon @zhengruifeng may I get a review please?

ueshin

Could you add a test for this?
Otherwise, LGTM.

xinrong-meng · 2024-01-16T19:22:47Z

Thanks all! Merged to master, will do manual cherry-pick for branch-3.5

fix

55a6cc4

github-actions bot added SQL PYTHON labels Jan 10, 2024

xinrong-meng added 3 commits January 10, 2024 13:54

fix

d412cf7

comment

2012eb4

rmv

ecc38ac

xinrong-meng changed the title ~~Disable memory profiler for iterator UDFs~~ [SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators Jan 10, 2024

rmv import

7ec983f

xinrong-meng marked this pull request as ready for review January 11, 2024 19:26

xinrong-meng requested a review from ueshin January 11, 2024 19:27

ueshin reviewed Jan 12, 2024

View reviewed changes

test

a0ea1d4

HyukjinKwon approved these changes Jan 12, 2024

View reviewed changes

xinrong-meng closed this in 48152b1 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators #44668

[SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators #44668

xinrong-meng commented Jan 10, 2024 •

edited

Loading

xinrong-meng commented Jan 11, 2024

ueshin left a comment

xinrong-meng commented Jan 16, 2024

[SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators #44668

[SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators #44668

Conversation

xinrong-meng commented Jan 10, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng commented Jan 11, 2024

ueshin left a comment

Choose a reason for hiding this comment

xinrong-meng commented Jan 16, 2024

xinrong-meng commented Jan 10, 2024 •

edited

Loading