Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): unify Series/DataFrame describe code #13720

Merged
merged 5 commits into from
Jan 24, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 14, 2024

This PR rationalises the Series/DataFrame describe code by deferring Series describe to the more comprehensive DataFrame method instead. This means that Series will now produce a couple of additional statistics for some dtypes, and the DataFrame describe method will now produce median values for temporal values, which was previously a Series-only result for some reason...

Series improvements:

  • deletes ~40 lines of redundant code and guarantees that each method will now return the same statistics.
  • no longer fails for nested dtypes (as we can still produce meaningful count/null_count stats).
  • now returns min/max results for string dtype values.

DataFrame improvements:

  • temporal results now return the median statistic.

Note: the only minor casualty of unification is that DataFrame no longer returns a min/max of False/True for Boolean columns, but Series already didn't do this so we were inconsistent and... it's a boolean column, there are only two possible values! This isn't really a statistic, it's a fundamental property of the type. I'd say we follow Series on this one.

Also: while min/max may not be very useful for Boolean, @JulianCologne made a good case for supporting mean as it gives an indication of the average "truthiness" of a given column (eg: what percentage of non-null values are True), so have added that support as a trivial drive-by: closes #13735.


Update: (2024-01-18)

  • Additional temporal metrics (mean, and all percentile results, not just median).
  • Refactored so that all metrics are determined in a single pass over the frame schema.
  • Exposed optional interpolation parameter for percentile calculation.
  • Reinstated bool min/max (shown as 1.0/0.0 now we're also returning bool mean).
  • Output name (of the Series or DataFrame column) harmonised as statistic.

Example

from datetime import date, time
import polars as pl

df = pl.DataFrame({
    "float": [1.0, 2.8, 3.0],
    "int": [40, 50, None],
    "str": ["zz", "xx", "yy"],
    "date": [date(2020,1,1), date(2021,7,5), date(2022,12,31)],
    "time": [time(10,20,30), time(14,45,50), time(23,15,10)],
})

df.describe(
    percentiles = [0.1, 0.3, 0.5, 0.7, 0.9],
    interpolation = "linear",
)
# shape: (11, 6)
# ┌────────────┬──────────┬──────────┬──────┬────────────┬──────────┐
# │ statistic  ┆ float    ┆ int      ┆ str  ┆ date       ┆ time     │
# │ ---        ┆ ---      ┆ ---      ┆ ---  ┆ ---        ┆ ---      │
# │ str        ┆ f64      ┆ f64      ┆ str  ┆ str        ┆ str      │
# ╞════════════╪══════════╪══════════╪══════╪════════════╪══════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3    ┆ 3          ┆ 3        │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0    ┆ 0          ┆ 0        │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ null ┆ 2021-07-02 ┆ 16:07:10 │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null       ┆ null     │
# │ min        ┆ 1.0      ┆ 40.0     ┆ xx   ┆ 2020-01-01 ┆ 10:20:30 │
# │ 10%        ┆ 1.36     ┆ 41.0     ┆ null ┆ 2020-04-20 ┆ 11:13:34 │
# │ 30%        ┆ 2.08     ┆ 43.0     ┆ null ┆ 2020-11-26 ┆ 12:59:42 │
# │ 50%        ┆ 2.8      ┆ 45.0     ┆ null ┆ 2021-07-05 ┆ 14:45:50 │
# │ 70%        ┆ 2.88     ┆ 47.0     ┆ null ┆ 2022-02-07 ┆ 18:09:34 │
# │ 90%        ┆ 2.96     ┆ 49.0     ┆ null ┆ 2022-09-13 ┆ 21:33:18 │
# │ max        ┆ 3.0      ┆ 50.0     ┆ zz   ┆ 2022-12-31 ┆ 23:15:10 │
# └────────────┴──────────┴──────────┴──────┴────────────┴──────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 14, 2024
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements here. A few small comments!

py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/series/series.py Show resolved Hide resolved
py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I only don't think we should expose an interpolation parameter.

py-polars/polars/dataframe/frame.py Show resolved Hide resolved
@ritchie46 ritchie46 merged commit e0964a5 into pola-rs:main Jan 24, 2024
14 checks passed
@alexander-beedie alexander-beedie deleted the improve-describe branch January 24, 2024 07:02
r-brink pushed a commit to r-brink/polars that referenced this pull request Jan 24, 2024
@0x26res
Copy link

0x26res commented May 9, 2024

Hey, I just wanted to point out that this change caused this error for me ColumnNotFoundError: describe, because "describe" was renamed to "statistics".

I'm just pointing it out in case someone gets the same error. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "mean" in describe for bools
4 participants