-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Use appropriate bins in hist
when bin_count
specified
#16942
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16942 +/- ##
==========================================
- Coverage 80.30% 80.21% -0.10%
==========================================
Files 1499 1500 +1
Lines 198744 198861 +117
Branches 2837 2837
==========================================
- Hits 159604 159513 -91
- Misses 38613 38820 +207
- Partials 527 528 +1 ☔ View full report in Codecov by Sentry. |
CodSpeed Performance ReportMerging #16942 will not alter performanceComparing Summary
|
Hi @mcrumiller, as I mentioned in your issue (#16912 (comment)) the problem is twofold:
ProblemYour example has numbers from 1-8 and 4 bins. pd.cut([1, 3, 8, 8, 2, 1, 3], bins=4)
[(0.993, 2.75], (2.75, 4.5], (6.25, 8.0], (6.25, 8.0], (0.993, 2.75], (0.993, 2.75], (2.75, 4.5]]
Categories (4, interval[float64, right]): [(0.993, 2.75] < (2.75, 4.5] < (4.5, 6.25] < (6.25, 8.0]] pandas uses some "interesting" 0.1% rule to include the leftmost element. That is why the first bin looks odd pandas cut |
bin_count
specifiedhist
when bin_count
specified
Do we want to match pandas' behavior? I suppose if it doesn't cost us anything (discounting developer time, which is minimal here) and we don't have a good argument against it, it reduces the barrier to entry. I don't really see why they extended the bins, since histogram/cut is performed on existing data, and if the extremes already fit into the outer bins, why extend them? |
@JulianCologne a bit confused about pandas'
But the resulting categories shows that they 1) define the bins using a tight interval without the extension, and 2) only the first bin appears to be extended: >>> s = pd.Series([1, 3, 8, 8, 2, 1, 3])
>>> pd.cut([1, 3, 8, 8, 2, 1, 3], bins=4, precision=9)
[(0.993, 2.75], (2.75, 4.5], (6.25, 8.0], (6.25, 8.0], (0.993, 2.75], (0.993, 2.75], (2.75, 4.5]]
Categories (4, interval[float64, right]): [(0.993, 2.75] < (2.75, 4.5] < (4.5, 6.25] < (6.25, 8.0]] In easier to read format (note that 0.1% of range is 0.007):
Their example here also shows the same thing, where they use
If we use the |
0b4ef64
to
449843d
Compare
449843d
to
ce107d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.. This was snowed under. I think this makes sense, as always returning an empty bin is not very useful. Thanks!
Resolves #16912.
This implementation affects
pl.hist(x, bin_counts=...)
and makes two behavioral modifications:This implementation is a little simpler than the previous and I believe it removes the need for the special stability logic for rounding near integers. It mimics the behavior of pandas'
cut
, which is reasonable behavior. I don't see any tests that cover that logic, unless thetest_hist_rand
function covers it already. If someone could chime in with an example case where floating point errors cause the required fix even in this PR version, let me know and I'll add that back in along with some tests.Existing behavior
New behavior