fix: Use appropriate bins in `hist` when `bin_count` specified #16942

mcrumiller · 2024-06-13T23:23:13Z

Resolves #16912.

This implementation affects pl.hist(x, bin_counts=...) and makes two behavioral modifications:

The first and last bucket no longer extend to infinities. We now mimic pandas' behavior, where the left-most (open interval) edge is extended by 0.1% the total range, and the right-most (closed interval) edge is the maximum value.
We no longer return an empty "garbage" bin as the first bin. Previously, this bin always returned 0 items, and was not meaningful.

This implementation is a little simpler than the previous and I believe it removes the need for the special stability logic for rounding near integers. It mimics the behavior of pandas' cut, which is reasonable behavior. I don't see any tests that cover that logic, unless the test_hist_rand function covers it already. If someone could chime in with an example case where floating point errors cause the required fix even in this PR version, let me know and I'll add that back in along with some tests.

Existing behavior

>>> a = pl.Series("a", [1, 3, 8, 8, 2, 1, 3])
>>> a.hist(bin_count=4)
shape: (5, 3)
┌────────────┬─────────────┬───────┐
│ breakpoint ┆ category    ┆ count │
│ ---        ┆ ---         ┆ ---   │
│ f64        ┆ cat         ┆ u32   │
╞════════════╪═════════════╪═══════╡
│ 0.0        ┆ (-inf, 0.0] ┆ 0     │ <-- note extra empty bucket and -inf
│ 2.25       ┆ (0.0, 2.25] ┆ 3     │ <-- note 2.25
│ 4.5        ┆ (2.25, 4.5] ┆ 2     │
│ 6.75       ┆ (4.5, 6.75] ┆ 0     │ <-- note 6.75
│ inf        ┆ (6.75, inf] ┆ 2     │ <-- note inf
└────────────┴─────────────┴───────┘

New behavior

>>> a = pl.Series("a", [1, 3, 8, 8, 2, 1, 3])
>>> a.hist(bin_count=4)
shape: (4, 3)
┌────────────┬───────────────┬───────┐
│ breakpoint ┆ category      ┆ count │
│ ---        ┆ ---           ┆ ---   │
│ f64        ┆ cat           ┆ u32   │
╞════════════╪═══════════════╪═══════╡
│ 2.75       ┆ (0.993, 2.25] ┆ 3     │  <-- note 2.75 and left edge
│ 4.5        ┆ (2.75, 4.5]   ┆ 2     │
│ 6.25       ┆ (4.5, 6.25]   ┆ 0     │  <-- note 6.25
│ 8.0        ┆ (6.25, 8.0]   ┆ 2     │  <-- note 8.0
└────────────┴────────────────┴──────┘

codecov · 2024-06-13T23:44:17Z

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 80.21%. Comparing base (7654387) to head (ce107d8).
Report is 16 commits behind head on main.

Files	Patch %	Lines
crates/polars-ops/src/chunked_array/hist.rs	87.50%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16942      +/-   ##
==========================================
- Coverage   80.30%   80.21%   -0.10%     
==========================================
  Files        1499     1500       +1     
  Lines      198744   198861     +117     
  Branches     2837     2837              
==========================================
- Hits       159604   159513      -91     
- Misses      38613    38820     +207     
- Partials      527      528       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codspeed-hq · 2024-06-14T02:16:29Z

CodSpeed Performance Report

Merging #16942 will not alter performance

_{Comparing mcrumiller:hist-bin-count (ce107d8) with main (a284174)}

Summary

✅ 37 untouched benchmarks

Julian-J-S · 2024-06-14T06:13:44Z

Hi @mcrumiller,

as I mentioned in your issue (#16912 (comment)) the problem is twofold:

wrong number of bins -> solved by your pr ✅
incorrect bins! (according to my understanding) -> not solved yet 💥

Problem

Your example has numbers from 1-8 and 4 bins.
That should mean the 2 bins in the middle are centered around the mean -> (1+8) / 2 = 4.5
See pandas

pd.cut([1, 3, 8, 8, 2, 1, 3], bins=4)

[(0.993, 2.75], (2.75, 4.5], (6.25, 8.0], (6.25, 8.0], (0.993, 2.75], (0.993, 2.75], (2.75, 4.5]]
Categories (4, interval[float64, right]): [(0.993, 2.75] < (2.75, 4.5] < (4.5, 6.25] < (6.25, 8.0]]

pandas uses some "interesting" 0.1% rule to include the leftmost element. That is why the first bin looks odd pandas cut

mcrumiller · 2024-06-14T11:31:17Z

Do we want to match pandas' behavior? I suppose if it doesn't cost us anything (discounting developer time, which is minimal here) and we don't have a good argument against it, it reduces the barrier to entry.

I don't really see why they extended the bins, since histogram/cut is performed on existing data, and if the extremes already fit into the outer bins, why extend them?

mcrumiller · 2024-06-14T12:16:58Z

@JulianCologne a bit confused about pandas' cut. They do they say:

Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

But the resulting categories shows that they 1) define the bins using a tight interval without the extension, and 2) only the first bin appears to be extended:

>>> s = pd.Series([1, 3, 8, 8, 2, 1, 3])
>>> pd.cut([1, 3, 8, 8, 2, 1, 3], bins=4, precision=9)
[(0.993, 2.75], (2.75, 4.5], (6.25, 8.0], (6.25, 8.0], (0.993, 2.75], (0.993, 2.75], (2.75, 4.5]]
Categories (4, interval[float64, right]): [(0.993, 2.75] < (2.75, 4.5] < (4.5, 6.25] < (6.25, 8.0]]

In easier to read format (note that 0.1% of range is 0.007):

w = (8 - 1) / 4 = 1.75
margin = 0.007
[
    (0.993, 2.75],    <-- width of w + 0.007
    (2.75, 4.5],      <-- width of w
    (6.25, 8.0],      <-- width of w
]

Their example here also shows the same thing, where they use [10, 15, 13, 12, 23, 25, 28, 59, 60] as input, and with 3 bins, the bins are:

w = (60 - 10) / 3 = 16.66666...
margin of 0.05
[
    (9.95, 26.667],   <-- width of w + .05
    (26.667, 43.333], <-- width of w
    (43.333, 60.0]    <-- width of w
]

If we use the right=False parameter, the right interval is extended. So it looks like they extend the open edge by the value only. I'll do that too.

ritchie46

Right.. This was snowed under. I think this makes sense, as always returning an empty bin is not very useful. Thanks!

github-actions bot added fix Bug fix rust Related to Rust Polars labels Jun 13, 2024

mcrumiller marked this pull request as ready for review June 14, 2024 02:36

mcrumiller requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners June 14, 2024 02:36

stinodego changed the title ~~fix(rust): Use appropriate bins in hist when bin_count specified~~ fix: Use appropriate bins in hist when bin_count specified Jun 14, 2024

github-actions bot added the python Related to Python Polars label Jun 14, 2024

mcrumiller marked this pull request as draft June 14, 2024 11:41

mcrumiller force-pushed the hist-bin-count branch 2 times, most recently from 0b4ef64 to 449843d Compare June 14, 2024 12:59

mcrumiller marked this pull request as ready for review June 14, 2024 13:08

mcrumiller mentioned this pull request Jun 29, 2024

Series.hist adds two bins when specifying bins #17280

Closed

2 tasks

ritchie46 force-pushed the main branch from 0a696ff to 9c29683 Compare July 28, 2024 08:11

mcrumiller force-pushed the hist-bin-count branch from 449843d to ce107d8 Compare August 16, 2024 15:37

Add new method when bins not specified

ce107d8

mcrumiller mentioned this pull request Aug 16, 2024

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

Open

2 tasks

ritchie46 reviewed Aug 19, 2024

View reviewed changes

ritchie46 merged commit bad13b3 into pola-rs:main Aug 19, 2024
25 of 27 checks passed

mcrumiller deleted the hist-bin-count branch August 29, 2024 00:56

alexander-beedie mentioned this pull request Sep 10, 2024

hist panics after creating zero bins #18650

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Use appropriate bins in `hist` when `bin_count` specified #16942

fix: Use appropriate bins in `hist` when `bin_count` specified #16942

mcrumiller commented Jun 13, 2024 •

edited

Loading

codecov bot commented Jun 13, 2024 •

edited

Loading

codspeed-hq bot commented Jun 14, 2024 •

edited

Loading

Julian-J-S commented Jun 14, 2024

mcrumiller commented Jun 14, 2024 •

edited

Loading

mcrumiller commented Jun 14, 2024

ritchie46 left a comment

fix: Use appropriate bins in hist when bin_count specified #16942

fix: Use appropriate bins in hist when bin_count specified #16942

Conversation

mcrumiller commented Jun 13, 2024 • edited Loading

Existing behavior

New behavior

codecov bot commented Jun 13, 2024 • edited Loading

Codecov Report

codspeed-hq bot commented Jun 14, 2024 • edited Loading

CodSpeed Performance Report

Merging #16942 will not alter performance

Summary

Julian-J-S commented Jun 14, 2024

Problem

mcrumiller commented Jun 14, 2024 • edited Loading

mcrumiller commented Jun 14, 2024

ritchie46 left a comment

Choose a reason for hiding this comment

fix: Use appropriate bins in `hist` when `bin_count` specified #16942

fix: Use appropriate bins in `hist` when `bin_count` specified #16942

mcrumiller commented Jun 13, 2024 •

edited

Loading

codecov bot commented Jun 13, 2024 •

edited

Loading

codspeed-hq bot commented Jun 14, 2024 •

edited

Loading

mcrumiller commented Jun 14, 2024 •

edited

Loading