Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mean for bool columns in DataFrame.Describe and Series.Describe #13884

Conversation

taki-mekhalfa
Copy link
Contributor

@taki-mekhalfa taki-mekhalfa commented Jan 21, 2024

Fixes: #13735

I Had to use str_value from Series to ensure we have the same representation for non numerical columns (the float representing the mean for bool columns). This means that mean for bools are printed the same ways as for other columns and will respect the polars.Config.set_fmt_float

True and False become true and false to be consistent with what happens when you just print df; but it's easily changeable.

If we only use str(v):

shape: (9, 7)
┌────────────┬──────────┬──────────┬────────────────────┬──────┬──────┬────────────┐
│ describe   ┆ a        ┆ b        ┆ c                  ┆ d    ┆ e    ┆ f          │
│ ---        ┆ ---      ┆ ---      ┆ ---                ┆ ---  ┆ ---  ┆ ---        │
│ str        ┆ f64      ┆ f64      ┆ str                ┆ str  ┆ str  ┆ str        │
╞════════════╪══════════╪══════════╪════════════════════╪══════╪══════╪════════════╡
│ count      ┆ 3.0      ┆ 2.0      ┆ 3                  ┆ 2    ┆ 2    ┆ 3          │
│ null_count ┆ 0.0      ┆ 1.0      ┆ 0                  ┆ 1    ┆ 1    ┆ 0          │
│ mean       ┆ 2.266667 ┆ 4.5      ┆ 0.6666666666666666 ┆ null ┆ null ┆ null       │
│ std        ┆ 1.101514 ┆ 0.707107 ┆ null               ┆ null ┆ null ┆ null       │
│ min        ┆ 1.0      ┆ 4.0      ┆ False              ┆ b    ┆ null ┆ 2020-01-01 │
│ 25%        ┆ 2.8      ┆ 4.0      ┆ null               ┆ null ┆ null ┆ null       │
│ 50%        ┆ 2.8      ┆ 5.0      ┆ null               ┆ null ┆ null ┆ null       │
│ 75%        ┆ 3.0      ┆ 5.0      ┆ null               ┆ null ┆ null ┆ null       │
│ max        ┆ 3.0      ┆ 5.0      ┆ True               ┆ c    ┆ null ┆ 2022-01-01 │
└────────────┴──────────┴──────────┴────────────────────┴──────┴──────┴────────────┘

If we use str_value:

shape: (9, 7)
┌────────────┬──────────┬──────────┬──────────┬──────┬──────┬────────────┐
│ describe   ┆ a        ┆ b        ┆ c        ┆ d    ┆ e    ┆ f          │
│ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---  ┆ ---        │
│ str        ┆ f64      ┆ f64      ┆ str      ┆ str  ┆ str  ┆ str        │
╞════════════╪══════════╪══════════╪══════════╪══════╪══════╪════════════╡
│ count      ┆ 3.0      ┆ 2.0      ┆ 3        ┆ 2    ┆ 2    ┆ 3          │
│ null_count ┆ 0.0      ┆ 1.0      ┆ 0        ┆ 1    ┆ 1    ┆ 0          │
│ mean       ┆ 2.266667 ┆ 4.5      ┆ 0.666667 ┆ null ┆ null ┆ null       │
│ std        ┆ 1.101514 ┆ 0.707107 ┆ null     ┆ null ┆ null ┆ null       │
│ min        ┆ 1.0      ┆ 4.0      ┆ false    ┆ b    ┆ null ┆ 2020-01-01 │
│ 25%        ┆ 2.8      ┆ 4.0      ┆ null     ┆ null ┆ null ┆ null       │
│ 50%        ┆ 2.8      ┆ 5.0      ┆ null     ┆ null ┆ null ┆ null       │
│ 75%        ┆ 3.0      ┆ 5.0      ┆ null     ┆ null ┆ null ┆ null       │
│ max        ┆ 3.0      ┆ 5.0      ┆ true     ┆ c    ┆ null ┆ 2022-01-01 │
└────────────┴──────────┴──────────┴──────────┴──────┴──────┴────────────┘

If we change config:

with pl.Config(set_fmt_float="full"):
    print(df.describe())

shape: (9, 7)
┌────────────┬────────────────────┬────────────────────┬────────────────────┬──────┬──────┬────────────┐
│ describefloatintboolstrstr2date       │
│ ---------------------        │
│ strf64f64strstrstrstr        │
╞════════════╪════════════════════╪════════════════════╪════════════════════╪══════╪══════╪════════════╡
│ count323223          │
│ null_count010110          │
│ mean2.26666666666666664.50.6666666666666666nullnullnull       │
│ std1.10151410945722050.7071067811865476nullnullnullnull       │
│ min14falsebeur2020-01-01 │
│ 25%2.84nullnullnullnull       │
│ 50%2.85nullnullnullnull       │
│ 75%35nullnullnullnull       │
│ max35truecusd2022-01-01 │
└────────────┴────────────────────┴────────────────────┴────────────────────┴──────┴──────┴────────────┘

@taki-mekhalfa taki-mekhalfa changed the title Support mean for boolcolumns Support mean for bool columns Jan 21, 2024
@Wainberg
Copy link
Contributor

I think the underlying issue should be fixed by #13725, right?

@taki-mekhalfa
Copy link
Contributor Author

I think the underlying issue should be fixed by #13725, right?

Sorry what issue? float display?

@taki-mekhalfa taki-mekhalfa changed the title Support mean for bool columns Support mean for bool columns for DataFrame.Describe and Series.Describe Jan 21, 2024
@taki-mekhalfa taki-mekhalfa changed the title Support mean for bool columns for DataFrame.Describe and Series.Describe Support mean for bool columns in DataFrame.Describe and Series.Describe Jan 21, 2024
@taki-mekhalfa
Copy link
Contributor Author

I went ahead and added mean for bool series too;

>>> s = pl.Series([True, False, True, None, True])
>>> s.describe()
shape: (4, 2)
┌────────────┬───────┐
│ statisticvalue │
│ ------   │
│ strf64   │
╞════════════╪═══════╡
│ count4.0   │
│ null_count1.0   │
│ sum3.0   │
│ mean0.75  │
└────────────┴───────┘

@taki-mekhalfa
Copy link
Contributor Author

This is already being addressed here: #13720. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "mean" in describe for bools
2 participants