-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.groupby on category broken #1426
Comments
There are some other fixes to categorical dtypes in progress in this PR #1382 but they are not yet finished because some tests still fail. |
I reproduced the problem on PR #1382 branch, so changes in it don't resolve this bug. |
This is a simple standalone reproducer import modin.pandas as pd
df = pd.DataFrame({'col1': ['Marc', 'Lori'], 'col2': [1, 1]})
df.astype({'col1': 'category'}, copy=False)
grpby = df.groupby(by=['col1'], as_index=True, sort=False, observed=False)
print(grpby.groups)
print(grpby.indices) |
This patch adds a test for this bug diff --git a/modin/pandas/test/test_groupby.py b/modin/pandas/test/test_groupby.py
index dd3ea96..a690d75 100644
--- a/modin/pandas/test/test_groupby.py
+++ b/modin/pandas/test/test_groupby.py
@@ -126,10 +126,11 @@ def test_mixed_dtypes_groupby(as_index):
@pytest.mark.parametrize(
- "by", [[1, 2, 1, 2], lambda x: x % 3, "col1", ["col1", "col2"]]
+ "by", [[1, 2, 1, 2], lambda x: x % 3, "col1", ["col1"], ["col1", "col2"]]
)
@pytest.mark.parametrize("as_index", [True, False])
-def test_simple_row_groupby(by, as_index):
+@pytest.mark.parametrize("col1_categories", [True, False])
+def test_simple_row_groupby(by, as_index, col1_categories):
pandas_df = pandas.DataFrame(
{
"col1": [0, 1, 2, 3],
@@ -140,7 +141,14 @@ def test_simple_row_groupby(by, as_index):
}
)
+ if col1_categories:
+ pandas_df.astype({"col1": "category"}, copy=False)
+
ray_df = from_pandas(pandas_df)
+
+ if col1_categories:
+ ray_df.astype({"col1": "category"}, copy=False)
+
n = 1
ray_groupby = ray_df.groupby(by=by, as_index=as_index)
pandas_groupby = pandas_df.groupby(by=by, as_index=as_index) I had to add a separate conversion for |
Current state:
That behaviour could be control with Investigating... |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
System information
modin.__version__
): 0.7.3test dataframe
data1 = ['Marc','Lori','Sam', 'Alexandra',
None, np.NaN,
'Marc','Lori','Sam', 'Alexandra',
'Marc','Lori','Sam', 'Alexandra']
data2 = [1, 1, 1, 1,
0,0,
1, 1, 1, 1,
1, 1, 1, 1]
data3 = [1.0, 1.0, 1.0, 1.0,
None, np.NaN,
1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0]
data4 = ['Same','Same','Same', 'Same',
None, np.NaN,
'Same','Same','Same', 'Same',
'Same','Same','Same', 'Same']
data5 = ['Diff1','Diff2','Diff3', 'Diff4',
None, np.NaN,
'Diff5','Diff6','Diff7', 'Diff8',
'Diff1','Diff2','Diff3', 'Diff4',]
df = pandas_impl_module.DataFrame(data={
'col1': data1,
'col2': data2,
'col3': data3,
'col4': data4,
'col5': data5
})
#print("col2 input data type = {}".format(type(data2[0])))
df.info()
df.astype({'col1': 'category'}, copy=False)
df.info()
grpby = df.groupby(by=['col1'], as_index=True, sort=False, observed=False)
grpby.groups
grpby.indices
Describe the problem
when working with category types in the dataframe, this appears to throw something off in groupby
groupby a category column, then try
accessing the groups or indices attribute of the groupby
exception raised
converting these back to strings and re-doing the groupby appears to work around this problem
Source code / logs
AttributeError Traceback (most recent call last)
in
----> 1 grpby.groups
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
99 if key in self._columns:
100 return self._default_to_pandas(lambda df: df.getitem(key))
--> 101 raise e
102
103 _index_grouped_cache = None
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
95 """
96 try:
---> 97 return object.getattribute(self, key)
98 except AttributeError as e:
99 if key in self._columns:
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in groups(self)
202 @Property
203 def groups(self):
--> 204 return self._index_grouped
205
206 def min(self, **kwargs):
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
99 if key in self._columns:
100 return self._default_to_pandas(lambda df: df.getitem(key))
--> 101 raise e
102
103 _index_grouped_cache = None
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in getattr(self, key)
95 """
96 try:
---> 97 return object.getattribute(self, key)
98 except AttributeError as e:
99 if key in self._columns:
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/modin/pandas/groupby.py in _index_grouped(self)
130 by = self._by
131 if self._axis == 0:
--> 132 self._index_grouped_cache = self._index.groupby(by)
133 else:
134 self._index_grouped_cache = self._columns.groupby(by)
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/pandas/core/indexes/base.py in groupby(self, values)
4533 values = values.values
4534 values = ensure_categorical(values)
-> 4535 result = values._reverse_indexer()
4536
4537 # map to the label
~/anaconda3/envs/run-nsf/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.getattribute(self, name)
5275
5276 def setattr(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute '_reverse_indexer'
The text was updated successfully, but these errors were encountered: