Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49382][PS] Make frame box plot properly render the fliers/outliers #47866

Closed
wants to merge 6 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Aug 26, 2024

What changes were proposed in this pull request?

fliers/outliers was ignored in the initial implementation #36317

Why are the changes needed?

feature parity for Pandas and Series box plot

Does this PR introduce any user-facing change?

import pyspark.pandas as ps
df = ps.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1], [6.4, 3.2, 1], [5.9, 3.0, 2], [100, 200, 300]], columns=['length', 'width', 'species'])
df.boxplot()

df.length.plot.box()
image

before:
df.boxplot()
image

after:
df.boxplot()
image

How was this patch tested?

CI and manually check

Was this patch authored or co-authored using generative AI tooling?

No

@zhengruifeng zhengruifeng changed the title [SPARK-49382][PS] Make frame box plot properly render the fliers/outlier [SPARK-49382][PS] Make frame box plot properly render the fliers/outliers Aug 26, 2024
for i, colname in enumerate(colnames):
formated_colname = "`{}`".format(colname)
outlier_colname = "__{}_outlier".format(colname)
min_val = multicol_whiskers[colname]["min"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it weird to select the outliers by the distance |value - lower_whisker|, which is used in series.boxplot.

It should be something like |value - median| or |value - mean|, will revisit this later.

@zhengruifeng zhengruifeng deleted the plot_hist_fly branch August 26, 2024 05:13
@zhengruifeng
Copy link
Contributor Author

merged to master

IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…iers

### What changes were proposed in this pull request?
fliers/outliers was ignored in the initial implementation apache#36317

### Why are the changes needed?
feature parity for Pandas and Series box plot

### Does this PR introduce _any_ user-facing change?

```
import pyspark.pandas as ps
df = ps.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1], [6.4, 3.2, 1], [5.9, 3.0, 2], [100, 200, 300]], columns=['length', 'width', 'species'])
df.boxplot()
```

`df.length.plot.box()`
![image](https://github.com/user-attachments/assets/43da563c-5f68-4305-ad27-a4f04815dfd1)

before:
`df.boxplot()`
![image](https://github.com/user-attachments/assets/e25c2760-c12a-4801-a730-3987a020f889)

after:
`df.boxplot()`
![image](https://github.com/user-attachments/assets/c19f13b1-b9e4-423e-bcec-0c47c1c8df32)

### How was this patch tested?
CI and manually check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47866 from zhengruifeng/plot_hist_fly.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…iers

### What changes were proposed in this pull request?
fliers/outliers was ignored in the initial implementation apache#36317

### Why are the changes needed?
feature parity for Pandas and Series box plot

### Does this PR introduce _any_ user-facing change?

```
import pyspark.pandas as ps
df = ps.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1], [6.4, 3.2, 1], [5.9, 3.0, 2], [100, 200, 300]], columns=['length', 'width', 'species'])
df.boxplot()
```

`df.length.plot.box()`
![image](https://github.com/user-attachments/assets/43da563c-5f68-4305-ad27-a4f04815dfd1)

before:
`df.boxplot()`
![image](https://github.com/user-attachments/assets/e25c2760-c12a-4801-a730-3987a020f889)

after:
`df.boxplot()`
![image](https://github.com/user-attachments/assets/c19f13b1-b9e4-423e-bcec-0c47c1c8df32)

### How was this patch tested?
CI and manually check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47866 from zhengruifeng/plot_hist_fly.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants