Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: GroupBy.indices no longer includes unobserved categories #38642

Closed
jorisvandenbossche opened this issue Dec 22, 2020 · 12 comments · Fixed by #38649
Closed

REGR: GroupBy.indices no longer includes unobserved categories #38642

jorisvandenbossche opened this issue Dec 22, 2020 · 12 comments · Fixed by #38649
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Does anybody know if this was an intentional change? (I don't directly find something about it in the whatsnew)

In [9]: pd.__version__ 
Out[9]: '1.0.5'

In [10]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)}) 

In [11]: gb = df.groupby("key")   

In [12]: list(gb.indices)  
Out[12]: ['a', 'b', 'c', 'd']

vs

In [1]: pd.__version__
Out[1]: '1.3.0.dev0+92.ga2d10ba88a'

In [2]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})

In [3]: gb = df.groupby("key")

In [4]: list(gb.indices)
Out[4]: ['b']

This already changed in pandas 1.1, so not a recent change.

The consequence of this is that iterating over gb vs iterating over gb.indices is not consistent anymore.

cc @mroeschke @rhshadrach

@jorisvandenbossche
Copy link
Member Author

So for example, those two APIs still return all values (both for pandas 1.0 and master):

In [6]: gb.groups
Out[6]: {'a': [], 'b': [0, 1, 2, 3, 4], 'c': [], 'd': []}

In [7]: [key for key, group in gb]
Out[7]: ['a', 'b', 'c', 'd']

So it seems .indices should be consistent with it?

@mroeschke
Copy link
Member

This may have been unintentionally changed by me in https://github.com/pandas-dev/pandas/pull/36911/files

@phofl
Copy link
Member

phofl commented Dec 22, 2020

I looked into this, I think the new case is more consistent maybe?
on 1.0.5 and master we get:

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "key1": [1,1,1,2,2,3], "col": range(6)})

gb = df.groupby(["key", "key1"])
gb.groups
gb.indices

returned

{('b', 1): Int64Index([0, 1, 2], dtype='int64'), ('b', 2): Int64Index([3, 4], dtype='int64'), ('c', 3): Int64Index([5], dtype='int64')}
{('b', 1): array([0, 1, 2]), ('b', 2): array([3, 4]), ('c', 3): array([5])}

While a one dimensional group key returned what you showed above. The missing categories case would be tricky to handle with multidimensional keys. Maybe it would be better to remove unused categories from groups too? Or should the one-dimensional case be special here?

@phofl
Copy link
Member

phofl commented Dec 22, 2020

@mroeschke no that was not the reason. I think this was caused by c4226d4

@mroeschke
Copy link
Member

Thanks for confirming @phofl

@phofl
Copy link
Member

phofl commented Dec 22, 2020

Addition: We are no longer running through there since #36842

@jorisvandenbossche
Copy link
Member Author

@phofl thanks for looking at it!

I looked into this, I think the new case is more consistent maybe?

Indeed for multiple keys, we seem to not include unobserved categories. But, here both .indices as .groups (and GroupBy.__iter__ do that), so they are also consistent with each other.
While for a single key, while maybe inconsistent or not with the multiple key case, it is now inconsistent between .indices and .groups.

So to fully make it consistent, then for example also .groups and GroupBy.__iter__ should change for the single key case. But that would also be a breaking change ..
And I am not fully sure that is necessarily the good change, since we actually include the unobserved categories in the output if you do eg an aggregation on the groupby object, so then I would expect to include those empty groups as well when iterating over the groupby object?

@jsignell
Copy link
Contributor

Passing observed has no impact on .indices, but maybe it should? I think this behavior would not be surprising:

>>>df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})
>>> gb = df.groupby("key", observed=True)
>>> list(gb.indices)
['b']
>>> gb = df.groupby("key", observed=False)
>>> list(gb.indices)
['a', b', 'c', 'd']

@phofl
Copy link
Member

phofl commented Dec 22, 2020

Have to correct myself, this was changed by #36842

@jorisvandenbossche When testing this on 1.1.0 and 1.1.5 I get

{'a': [], 'b': [0, 1, 2, 3, 4], 'c': [5], 'd': []}
{'b': array([0, 1, 2, 3, 4]), 'c': array([5])}

both times.

Edit: Changed the example a bit.

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "col": range(6)})

@jorisvandenbossche
Copy link
Member Author

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

And unlike what I said earlier (I thought it was only working on 1.0, and not in 1.1.x), this actually only changed from 1.1.3 to 1.1.4.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 23, 2020
@simonjayhawkins
Copy link
Member

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

can confirm, first bad commit: [345efdd] BUG: RollingGroupby not respecting sort=False (#36911)

@phofl
Copy link
Member

phofl commented Dec 23, 2020

Hm just looked at the pr numbers, not when they were merged.

Nevertheless, we have to change both commits to get the original result, because the code path from #36911 is currently not used on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants