Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.groupby on category broken #1426

Closed
mcooperman opened this issue May 6, 2020 · 6 comments · Fixed by #1802
Closed

DataFrame.groupby on category broken #1426

mcooperman opened this issue May 6, 2020 · 6 comments · Fixed by #1802
Assignees
Labels
bug 🦗 Something isn't working
Milestone

Comments

@mcooperman
Copy link

System information

  • Linux ip-172-16-74-185 4.14.171-105.231.amzn1.x86_64 [CI] Adding travis config #1 SMP Thu Feb 27 23:49:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Modin version (modin.__version__): 0.7.3
  • Python version: 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
  • Code we can use to reproduce:

test dataframe

data1 = ['Marc','Lori','Sam', 'Alexandra',
None, np.NaN,
'Marc','Lori','Sam', 'Alexandra',
'Marc','Lori','Sam', 'Alexandra']
data2 = [1, 1, 1, 1,
0,0,
1, 1, 1, 1,
1, 1, 1, 1]
data3 = [1.0, 1.0, 1.0, 1.0,
None, np.NaN,
1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0]
data4 = ['Same','Same','Same', 'Same',
None, np.NaN,
'Same','Same','Same', 'Same',
'Same','Same','Same', 'Same']
data5 = ['Diff1','Diff2','Diff3', 'Diff4',
None, np.NaN,
'Diff5','Diff6','Diff7', 'Diff8',
'Diff1','Diff2','Diff3', 'Diff4',]

df = pandas_impl_module.DataFrame(data={
'col1': data1,
'col2': data2,
'col3': data3,
'col4': data4,
'col5': data5
})
#print("col2 input data type = {}".format(type(data2[0])))
df.info()

df.astype({'col1': 'category'}, copy=False)
df.info()

grpby = df.groupby(by=['col1'], as_index=True, sort=False, observed=False)

grpby.groups

grpby.indices

Describe the problem

when working with category types in the dataframe, this appears to throw something off in groupby
groupby a category column, then try
accessing the groups or indices attribute of the groupby
exception raised
converting these back to strings and re-doing the groupby appears to work around this problem

Source code / logs


grpby.groups

AttributeError Traceback (most recent call last)
in
----> 1 grpby.groups
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
99 if key in self._columns:
100 return self._default_to_pandas(lambda df: df.getitem(key))
--> 101 raise e
102
103 _index_grouped_cache = None

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
95 """
96 try:
---> 97 return object.getattribute(self, key)
98 except AttributeError as e:
99 if key in self._columns:

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in groups(self)
202 @Property
203 def groups(self):
--> 204 return self._index_grouped
205
206 def min(self, **kwargs):

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
99 if key in self._columns:
100 return self._default_to_pandas(lambda df: df.getitem(key))
--> 101 raise e
102
103 _index_grouped_cache = None

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
95 """
96 try:
---> 97 return object.getattribute(self, key)
98 except AttributeError as e:
99 if key in self._columns:

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in _index_grouped(self)
130 by = self._by
131 if self._axis == 0:
--> 132 self._index_grouped_cache = self._index.groupby(by)
133 else:
134 self._index_grouped_cache = self._columns.groupby(by)

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/pandas/core/indexes/base.py in groupby(self, values)
4533 values = values.values
4534 values = ensure_categorical(values)
-> 4535 result = values._reverse_indexer()
4536
4537 # map to the label

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.getattribute(self, name)
5275
5276 def setattr(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute '_reverse_indexer'

@mcooperman mcooperman added the bug 🦗 Something isn't working label May 6, 2020
@gshimansky
Copy link
Collaborator

There are some other fixes to categorical dtypes in progress in this PR #1382 but they are not yet finished because some tests still fail.

@aregm aregm added this to the 0.7.4 milestone May 7, 2020
@gshimansky
Copy link
Collaborator

I reproduced the problem on PR #1382 branch, so changes in it don't resolve this bug.

@gshimansky
Copy link
Collaborator

This is a simple standalone reproducer

import modin.pandas as pd

df = pd.DataFrame({'col1': ['Marc', 'Lori'], 'col2': [1, 1]})
df.astype({'col1': 'category'}, copy=False)
grpby = df.groupby(by=['col1'], as_index=True, sort=False, observed=False)

print(grpby.groups)
print(grpby.indices)

@gshimansky
Copy link
Collaborator

This patch adds a test for this bug

diff --git a/modin/pandas/test/test_groupby.py b/modin/pandas/test/test_groupby.py
index dd3ea96..a690d75 100644
--- a/modin/pandas/test/test_groupby.py
+++ b/modin/pandas/test/test_groupby.py
@@ -126,10 +126,11 @@ def test_mixed_dtypes_groupby(as_index):


 @pytest.mark.parametrize(
-    "by", [[1, 2, 1, 2], lambda x: x % 3, "col1", ["col1", "col2"]]
+    "by", [[1, 2, 1, 2], lambda x: x % 3, "col1", ["col1"], ["col1", "col2"]]
 )
 @pytest.mark.parametrize("as_index", [True, False])
-def test_simple_row_groupby(by, as_index):
+@pytest.mark.parametrize("col1_categories", [True, False])
+def test_simple_row_groupby(by, as_index, col1_categories):
     pandas_df = pandas.DataFrame(
         {
             "col1": [0, 1, 2, 3],
@@ -140,7 +141,14 @@ def test_simple_row_groupby(by, as_index):
         }
     )

+    if col1_categories:
+        pandas_df.astype({"col1": "category"}, copy=False)
+
     ray_df = from_pandas(pandas_df)
+
+    if col1_categories:
+        ray_df.astype({"col1": "category"}, copy=False)
+
     n = 1
     ray_groupby = ray_df.groupby(by=by, as_index=as_index)
     pandas_groupby = pandas_df.groupby(by=by, as_index=as_index)

I had to add a separate conversion for ray_df DataFrame returned from from_pandas(pandas_df) didn't have categorical dtypes. Not sure if it is a bug or should work like this.

@dchigarev
Copy link
Collaborator

as part of this issue #1460 #1461 #1462 were opened, after fixing those issues investigating about broken groupby on categories will continue

@dchigarev
Copy link
Collaborator

Current state:
It seems that bug with groupby on categories appears only when we're grouping by multiple columns, and only in the groupby_reduce operations. For example there is the output of modin and pandas any operations when grouping on multiple categories columns:

Modin:
            col2   col3   col4
col1 col5
0    -7    False  False  False
     -6    False  False  False
     -5    False  False  False
     -4     True  False   True
1    -7    False  False  False
     -6    False  False  False
     -5     True  False   True
     -4    False  False  False
2    -7    False  False  False
     -6    False   True   True
     -5    False  False  False
     -4    False  False  False
3    -7     True   True   True
     -6    False  False  False
     -5    False  False  False
     -4    False  False  False


Pandas:
            col2   col3  col4
col1 col5
0    -7      NaN    NaN   NaN
     -6      NaN    NaN   NaN
     -5      NaN    NaN   NaN
     -4     True  False  True
1    -7      NaN    NaN   NaN
     -6      NaN    NaN   NaN
     -5     True  False  True
     -4      NaN    NaN   NaN
2    -7      NaN    NaN   NaN
     -6    False   True  True
     -5      NaN    NaN   NaN
     -4      NaN    NaN   NaN
3    -7     True   True  True
     -6      NaN    NaN   NaN
     -5      NaN    NaN   NaN
     -4      NaN    NaN   NaN

That behaviour could be control with observe parameter of groupby.

Investigating...

dchigarev added a commit to dchigarev/modin that referenced this issue Jul 28, 2020
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
devin-petersohn pushed a commit that referenced this issue Jul 29, 2020
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants