Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49530][PYTHON][CONNECT] Support kde/density plots #48492

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Oct 16, 2024

What changes were proposed in this pull request?

Support kde/density plots with plotly backend on both Spark Connect and Spark classic.

Why are the changes needed?

While Pandas on Spark supports plotting, PySpark currently lacks this feature. The proposed API will enable users to generate visualizations. This will provide users with an intuitive, interactive way to explore and understand large datasets directly from PySpark DataFrames, streamlining the data analysis workflow in distributed environments.

See more at PySpark Plotting API Specification in progress.

Part of https://issues.apache.org/jira/browse/SPARK-49530.

Does this PR introduce any user-facing change?

Yes. kde/density plots are supported as shown below.

>>> data = [
...     (1.0, 4.0),
...     (2.0, 4.0),
...     (2.5, 4.5),
...     (3.0, 5.0),
...     (3.5, 5.5),
...     (4.0, 6.0),
...     (5.0, 6.0)
... ]
>>> columns = ["x", "y"]
>>> df = spark.createDataFrame(data, columns)
>>> fig1 = df.plot.kde(column=["x", "y"], bw_method=0.3, ind=100)
>>> fig1.show()  # see below
>>> fig2 = df.plot(kind="kde", column="x", bw_method=0.3, ind=20)
>>> fig2.show()  # see below

fig1:
newplot (23)

fig2:
newplot (22)

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@@ -388,6 +396,127 @@ def box(
"""
return self(kind="box", column=column, precision=precision, **kwargs)

def kde(
self,
column: Union[str, List[str]],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create https://issues.apache.org/jira/browse/SPARK-49999 for a follow-up on optional "column" parameter support in box, kde and hist plots

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create https://issues.apache.org/jira/browse/SPARK-50000 for optional "bw_method" in both Pandas on Spark and PySpark

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-49530][PYTHON][CONNECT] Support kde/density plots [SPARK-49530][PYTHON][CONNECT] Support kde/density plots Oct 17, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review October 17, 2024 03:21
@xinrong-meng
Copy link
Member Author

@zhengruifeng @HyukjinKwon may I get a review please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant