Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(types): fix histogram bin allocation #9711

Merged
merged 7 commits into from
Jul 30, 2024

Conversation

christophertitchen
Copy link
Contributor

Description of changes

import ibis

ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 20

t = ibis.range(1000).unnest().name("index").as_table()
t.select(hist=t["index"].histogram(nbins=10)).value_counts()
┏━━━━━━━┳━━━━━━━━━━━━┓
┃ hist  ┃ hist_count ┃
┡━━━━━━━╇━━━━━━━━━━━━┩
│ int64 │ int64      │
├───────┼────────────┤
│     5 │        100 │
│     9 │        100 │
│     0 │        100 │
│     3 │        100 │
│     6 │        100 │
│     2 │        100 │
│     7 │        100 │
│     8 │        100 │
│     1 │        100 │
│     4 │        100 │
└───────┴────────────┘

Issues closed

@gforsyth
Copy link
Member

🎉 thanks for putting this together @christopher-titchen !

Can we add the example above as a test in ibis/backends/tests/test_numeric.py?
Maybe drop it to range=100 and then assert that nunique over the hist_count column is 1?

@christophertitchen
Copy link
Contributor Author

🎉 thanks for putting this together @christopher-titchen !

Can we add the example above as a test in ibis/backends/tests/test_numeric.py? Maybe drop it to range=100 and then assert that nunique over the hist_count column is 1?

Thanks! I need a little guidance regarding the tests. I added the test to check that each bin has the same number of values when the number of values is completely divisible by nbins, but it is causing a few backends to fail, mostly because they have not mapped the range operation in the test case.

Should these backends be marked as not implemented, despite histogram working just fine on them, or should I try to find a formulation of the test case that works on them all?

@gforsyth
Copy link
Member

Hey @christopher-titchen !

Should these backends be marked as not implemented, despite histogram working just fine on them, or should I try to find a formulation of the test case that works on them all?

That's an interesting hiccup. For the purposes of the test, how about we make use of a memtable instead, because that will work on any of the backends (except possibly Druid).

Something like:

def test_histogram_distribution(con, monkeypatch):
    monkeypatch.setattr(ibis.options, "default_backend", con)

    t = ibis.memtable({"index": range(100)})
    ...

@christophertitchen
Copy link
Contributor Author

That's an interesting hiccup. For the purposes of the test, how about we make use of a memtable instead, because that will work on any of the backends (except possibly Druid).

Thanks for the help @gforsyth! I was thinking about using a memtable, but I was unsure about how many backends supported them. I swapped to a memtable in the test case and it works great.

I had to make a slight change to histogram to account for an edge case that was tested for Impala. It would fail if nbins was not passed, which is a rather niche use case because np.histogram for example requires the number of bins to be passed either explicitly or implicitly. I also found a slight quirk with the current design when fixing this because if a base is passed that is not the minimum value, it would assign those out-of-bound values smaller than the base a negative bin index. It now clips those out-of-bound values to the bin index of -1 to group them together, rather than potentially having bin indices of -1 and -2 for example, so this now aligns with how np.histogram assigns a bin index of 0 for out-of-bound values.

Anyway, everything is wrapped up now. I think this PR leaves histogram in a good state without sweeping changes and while sticking to the original design principles for it, and it should last until somebody ends up redesigning it to add more functionality.

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @christopher-titchen!

Thanks for tracking down those edge-cases!

@gforsyth gforsyth merged commit 6634864 into ibis-project:main Jul 30, 2024
82 checks passed
@lostmygithubaccount
Copy link
Member

thanks @christopher-titchen!

@cpcloud cpcloud added this to the 9.3 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: histogram(nbins=N) off by one error
4 participants