[SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions #46911

zhengruifeng · 2024-06-07T03:16:32Z

What changes were proposed in this pull request?

Throw PandasNotImplementedError for unsupported plotting functions:

{Frame, Series}.plot.hist
{Frame, Series}.plot.kde
{Frame, Series}.plot.density
{Frame, Series}.plot(kind="hist", ...)
{Frame, Series}.plot(kind="hist", ...)
{Frame, Series}.plot(kind="density", ...)

Why are the changes needed?

the previous error message is confusing:

In [3]: psdf.plot.hist()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1017: PandasAPIOnSparkAdviceWarning: The config 'spark.sql.ansi.enabled' is set to True. This can cause unexpected behavior from pandas API on Spark since pandas API on Spark follows the behavior of pandas, not SQL.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned                                                                                                                                ---------------------------------------------------------------------------
PySparkAttributeError                     Traceback (most recent call last)
Cell In[3], line 1
----> 1 psdf.plot.hist()

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:951, in PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
    903 def hist(self, bins=10, **kwds):
    904     """
    905     Draw one histogram of the DataFrame’s columns.
    906     A `histogram`_ is a representation of the distribution of data.
   (...)
    949         >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
    950     """
--> 951     return self(kind="hist", bins=bins, **kwds)

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:580, in PandasOnSparkPlotAccessor.__call__(self, kind, backend, **kwargs)
    577 kind = {"density": "kde"}.get(kind, kind)
    578 if hasattr(plot_backend, "plot_pandas_on_spark"):
    579     # use if there's pandas-on-Spark specific method.
--> 580     return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, **kwargs)
    581 else:
    582     # fallback to use pandas'
    583     if not PandasOnSparkPlotAccessor.pandas_plot_data_map[kind]:

File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:41, in plot_pandas_on_spark(data, kind, **kwargs)
     39     return plot_pie(data, **kwargs)
     40 if kind == "hist":
---> 41     return plot_histogram(data, **kwargs)
     42 if kind == "box":
     43     return plot_box(data, **kwargs)

File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:87, in plot_histogram(data, **kwargs)
     85 psdf, bins = HistogramPlotBase.prepare_hist_data(data, bins)
     86 assert len(bins) > 2, "the number of buckets must be higher than 2."
---> 87 output_series = HistogramPlotBase.compute_hist(psdf, bins)
     88 prev = float("%.9f" % bins[0])  # to make it prettier, truncate.
     89 text_bins = []

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:189, in HistogramPlotBase.compute_hist(psdf, bins)
    183 for group_id, (colname, bucket_name) in enumerate(zip(colnames, bucket_names)):
    184     # creates a Bucketizer to get corresponding bin of each value
    185     bucketizer = Bucketizer(
    186         splits=bins, inputCol=colname, outputCol=bucket_name, handleInvalid="skip"
    187     )
--> 189     bucket_df = bucketizer.transform(sdf)
    191     if output_df is None:
    192         output_df = bucket_df.select(
    193             F.lit(group_id).alias("__group_id"), F.col(bucket_name).alias("__bucket")
    194         )

File ~/Dev/spark/python/pyspark/ml/base.py:260, in Transformer.transform(self, dataset, params)
    258         return self.copy(params)._transform(dataset)
    259     else:
--> 260         return self._transform(dataset)
    261 else:
    262     raise TypeError("Params must be a param map but got %s." % type(params))

File ~/Dev/spark/python/pyspark/ml/wrapper.py:412, in JavaTransformer._transform(self, dataset)
    409 assert self._java_obj is not None
    411 self._transfer_params_to_java()
--> 412 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sparkSession)

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:1696, in DataFrame.__getattr__(self, name)
   1694 def __getattr__(self, name: str) -> "Column":
   1695     if name in ["_jseq", "_jdf", "_jmap", "_jcols", "rdd", "toJSON"]:
-> 1696         raise PySparkAttributeError(
   1697             error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name}
   1698         )
   1700     if name not in self.columns:
   1701         raise PySparkAttributeError(
   1702             error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name}
   1703         )

PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.

after this PR:

In [3]: psdf.plot.hist()
---------------------------------------------------------------------------
PandasNotImplementedError                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 psdf.plot.hist()

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:957, in PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
    909 """
    910 Draw one histogram of the DataFrame’s columns.
    911 A `histogram`_ is a representation of the distribution of data.
   (...)
    954     >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
    955 """
    956 if is_remote():
--> 957     return unsupported_function(class_name="pd.DataFrame", method_name="hist")()
    959 return self(kind="hist", bins=bins, **kwds)

File ~/Dev/spark/python/pyspark/pandas/missing/__init__.py:23, in unsupported_function.<locals>.unsupported_function(*args, **kwargs)
     22 def unsupported_function(*args, **kwargs):
---> 23     raise PandasNotImplementedError(
     24         class_name=class_name, method_name=method_name, reason=reason
     25     )

PandasNotImplementedError: The method `pd.DataFrame.hist()` is not implemented yet.

Does this PR introduce any user-facing change?

yes, error message improvement

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

No

zhengruifeng · 2024-06-07T03:18:48Z

cc @itholic and @HyukjinKwon

HyukjinKwon

Nice!

zhengruifeng · 2024-06-07T05:43:40Z

python/pyspark/pandas/tests/connect/plot/test_parity_series_plot_matplotlib.py

@@ -24,6 +24,10 @@
 class SeriesPlotMatplotlibParityTests(
    SeriesPlotMatplotlibTestsMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
 ):
+    @unittest.skip("Test depends on Spark ML which is not supported from Spark Connect.")


it failed with "Empty 'DataFrame': no numeric data to plot" before, now fails with PandasNotImplementedError

zhengruifeng · 2024-06-07T08:37:35Z

merged to master

itholic · 2024-06-09T14:00:54Z

Late LGTM. Thanks!

github-actions bot added PYTHON PANDAS API ON SPARK labels Jun 7, 2024

zhengruifeng requested a review from HyukjinKwon June 7, 2024 03:17

HyukjinKwon approved these changes Jun 7, 2024

View reviewed changes

zhengruifeng added 4 commits June 7, 2024 13:12

fix

af6066d

fix

f6de6b9

fix lint

8988268

fix lint

13da681

zhengruifeng force-pushed the ps_plotting_unsupported branch from 63ef2d8 to 13da681 Compare June 7, 2024 05:42

github-actions bot added the BUILD label Jun 7, 2024

zhengruifeng commented Jun 7, 2024

View reviewed changes

zhengruifeng closed this in 87b0f59 Jun 7, 2024

zhengruifeng deleted the ps_plotting_unsupported branch June 7, 2024 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions #46911

[SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions #46911

zhengruifeng commented Jun 7, 2024

zhengruifeng commented Jun 7, 2024

HyukjinKwon left a comment

zhengruifeng Jun 7, 2024

zhengruifeng commented Jun 7, 2024

itholic commented Jun 9, 2024

[SPARK-48561][PS][CONNECT] Throw PandasNotImplementedError for unsupported plotting functions #46911

[SPARK-48561][PS][CONNECT] Throw PandasNotImplementedError for unsupported plotting functions #46911

Conversation

zhengruifeng commented Jun 7, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng commented Jun 7, 2024

HyukjinKwon left a comment

Choose a reason for hiding this comment

zhengruifeng Jun 7, 2024

Choose a reason for hiding this comment

zhengruifeng commented Jun 7, 2024

itholic commented Jun 9, 2024

[SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions #46911

[SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions #46911