DataFrame.groupby on category broken #1426

mcooperman · 2020-05-06T14:40:51Z

System information

Linux ip-172-16-74-185 4.14.171-105.231.amzn1.x86_64 [CI] Adding travis config #1 SMP Thu Feb 27 23:49:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Modin version (modin.__version__): 0.7.3
Python version: 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
Code we can use to reproduce:

test dataframe

data1 = ['Marc','Lori','Sam', 'Alexandra',
None, np.NaN,
'Marc','Lori','Sam', 'Alexandra',
'Marc','Lori','Sam', 'Alexandra']
data2 = [1, 1, 1, 1,
0,0,
1, 1, 1, 1,
1, 1, 1, 1]
data3 = [1.0, 1.0, 1.0, 1.0,
None, np.NaN,
1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0]
data4 = ['Same','Same','Same', 'Same',
None, np.NaN,
'Same','Same','Same', 'Same',
'Same','Same','Same', 'Same']
data5 = ['Diff1','Diff2','Diff3', 'Diff4',
None, np.NaN,
'Diff5','Diff6','Diff7', 'Diff8',
'Diff1','Diff2','Diff3', 'Diff4',]

df = pandas_impl_module.DataFrame(data={
'col1': data1,
'col2': data2,
'col3': data3,
'col4': data4,
'col5': data5
})
#print("col2 input data type = {}".format(type(data2[0])))
df.info()

df.astype({'col1': 'category'}, copy=False)
df.info()

grpby = df.groupby(by=['col1'], as_index=True, sort=False, observed=False)

grpby.groups

grpby.indices

Describe the problem

when working with category types in the dataframe, this appears to throw something off in groupby
groupby a category column, then try
accessing the groups or indices attribute of the groupby
exception raised
converting these back to strings and re-doing the groupby appears to work around this problem

Source code / logs

grpby.groups

AttributeError Traceback (most recent call last)
in
----> 1 grpby.groups
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
99 if key in self._columns:
100 return self._default_to_pandas(lambda df: df.getitem(key))
--> 101 raise e
102
103 _index_grouped_cache = None

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
95 """
96 try:
---> 97 return object.getattribute(self, key)
98 except AttributeError as e:
99 if key in self._columns:

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in groups(self)
202 @Property
203 def groups(self):
--> 204 return self._index_grouped
205
206 def min(self, **kwargs):

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
99 if key in self._columns:
100 return self._default_to_pandas(lambda df: df.getitem(key))
--> 101 raise e
102
103 _index_grouped_cache = None

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
95 """
96 try:
---> 97 return object.getattribute(self, key)
98 except AttributeError as e:
99 if key in self._columns:

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in _index_grouped(self)
130 by = self._by
131 if self._axis == 0:
--> 132 self._index_grouped_cache = self._index.groupby(by)
133 else:
134 self._index_grouped_cache = self._columns.groupby(by)

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/pandas/core/indexes/base.py in groupby(self, values)
4533 values = values.values
4534 values = ensure_categorical(values)
-> 4535 result = values._reverse_indexer()
4536
4537 # map to the label

~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.getattribute(self, name)
5275
5276 def setattr(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute '_reverse_indexer'

The text was updated successfully, but these errors were encountered:

gshimansky · 2020-05-07T18:57:44Z

There are some other fixes to categorical dtypes in progress in this PR #1382 but they are not yet finished because some tests still fail.

gshimansky · 2020-05-07T21:36:20Z

I reproduced the problem on PR #1382 branch, so changes in it don't resolve this bug.

gshimansky · 2020-05-07T22:06:20Z

This is a simple standalone reproducer

import modin.pandas as pd

df = pd.DataFrame({'col1': ['Marc', 'Lori'], 'col2': [1, 1]})
df.astype({'col1': 'category'}, copy=False)
grpby = df.groupby(by=['col1'], as_index=True, sort=False, observed=False)

print(grpby.groups)
print(grpby.indices)

gshimansky · 2020-05-07T22:27:55Z

This patch adds a test for this bug

diff --git a/modin/pandas/test/test_groupby.py b/modin/pandas/test/test_groupby.py
index dd3ea96..a690d75 100644
--- a/modin/pandas/test/test_groupby.py
+++ b/modin/pandas/test/test_groupby.py
@@ -126,10 +126,11 @@ def test_mixed_dtypes_groupby(as_index):


 @pytest.mark.parametrize(
-    "by", [[1, 2, 1, 2], lambda x: x % 3, "col1", ["col1", "col2"]]
+    "by", [[1, 2, 1, 2], lambda x: x % 3, "col1", ["col1"], ["col1", "col2"]]
 )
 @pytest.mark.parametrize("as_index", [True, False])
-def test_simple_row_groupby(by, as_index):
+@pytest.mark.parametrize("col1_categories", [True, False])
+def test_simple_row_groupby(by, as_index, col1_categories):
     pandas_df = pandas.DataFrame(
         {
             "col1": [0, 1, 2, 3],
@@ -140,7 +141,14 @@ def test_simple_row_groupby(by, as_index):
         }
     )

+    if col1_categories:
+        pandas_df.astype({"col1": "category"}, copy=False)
+
     ray_df = from_pandas(pandas_df)
+
+    if col1_categories:
+        ray_df.astype({"col1": "category"}, copy=False)
+
     n = 1
     ray_groupby = ray_df.groupby(by=by, as_index=as_index)
     pandas_groupby = pandas_df.groupby(by=by, as_index=as_index)

I had to add a separate conversion for ray_df DataFrame returned from from_pandas(pandas_df) didn't have categorical dtypes. Not sure if it is a bug or should work like this.

dchigarev · 2020-05-25T10:51:32Z

as part of this issue #1460 #1461 #1462 were opened, after fixing those issues investigating about broken groupby on categories will continue

dchigarev · 2020-07-23T14:01:25Z

Current state:
It seems that bug with groupby on categories appears only when we're grouping by multiple columns, and only in the groupby_reduce operations. For example there is the output of modin and pandas any operations when grouping on multiple categories columns:

Modin:
            col2   col3   col4
col1 col5
0    -7    False  False  False
     -6    False  False  False
     -5    False  False  False
     -4     True  False   True
1    -7    False  False  False
     -6    False  False  False
     -5     True  False   True
     -4    False  False  False
2    -7    False  False  False
     -6    False   True   True
     -5    False  False  False
     -4    False  False  False
3    -7     True   True   True
     -6    False  False  False
     -5    False  False  False
     -4    False  False  False


Pandas:
            col2   col3  col4
col1 col5
0    -7      NaN    NaN   NaN
     -6      NaN    NaN   NaN
     -5      NaN    NaN   NaN
     -4     True  False  True
1    -7      NaN    NaN   NaN
     -6      NaN    NaN   NaN
     -5     True  False  True
     -4      NaN    NaN   NaN
2    -7      NaN    NaN   NaN
     -6    False   True  True
     -5      NaN    NaN   NaN
     -4      NaN    NaN   NaN
3    -7     True   True  True
     -6      NaN    NaN   NaN
     -5      NaN    NaN   NaN
     -4      NaN    NaN   NaN

That behaviour could be control with observe parameter of groupby.

Investigating...

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

mcooperman added the bug 🦗 Something isn't working label May 6, 2020

anmyachev assigned dchigarev May 7, 2020

aregm added this to the 0.7.4 milestone May 7, 2020

dchigarev mentioned this issue Jul 24, 2020

FIX-#1426: Groupby on categories fixed #1802

Merged

6 tasks

dchigarev added a commit to dchigarev/modin that referenced this issue Jul 28, 2020

FIX-modin-project#1426: Groupby on categories fixed

b0f83fe

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

devin-petersohn closed this as completed in #1802 Jul 29, 2020

devin-petersohn pushed a commit that referenced this issue Jul 29, 2020

FIX-#1426: Groupby on categories fixed (#1802)

538bd9c

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020

FIX-modin-project#1426: Groupby on categories fixed (modin-project#1802)

f5ecc5d

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.groupby on category broken #1426

DataFrame.groupby on category broken #1426

mcooperman commented May 6, 2020

gshimansky commented May 7, 2020

gshimansky commented May 7, 2020

gshimansky commented May 7, 2020

gshimansky commented May 7, 2020

dchigarev commented May 25, 2020

dchigarev commented Jul 23, 2020

DataFrame.groupby on category broken #1426

DataFrame.groupby on category broken #1426

Comments

mcooperman commented May 6, 2020

System information

test dataframe

Describe the problem

Source code / logs

gshimansky commented May 7, 2020

gshimansky commented May 7, 2020

gshimansky commented May 7, 2020

gshimansky commented May 7, 2020

dchigarev commented May 25, 2020

dchigarev commented Jul 23, 2020