Groupby reduction operations inserts 'by' columns in all column partition when `as_index=False` #2512

dchigarev · 2020-12-04T18:20:45Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
Modin version (modin.__version__): c2e7f9e
Python version: 3.7.5
Code we can use to reproduce:

import modin.pandas as pd
import numpy as np
import pandas
pd.DEFAULT_NPARTITIONS = 2

from modin.pandas.test.utils import test_data_values, create_test_dfs, df_equals

data = test_data_values[0]
md_df, pd_df = create_test_dfs(data)

by = [pd_df.columns[0], pd_df.columns[1]]

md_res = md_df.groupby(by, as_index=False).sum()
pd_res = pd_df.groupby(by, as_index=False).sum()

print(f"by in modin result:\n{md_res[by]}")
print(f"by in pandas result:\n{pd_res[by]}")

df_equals(md_res, pd_res) # AssertionError: DataFrame shape mismatch

Output

by in modin result:
     col33  col33  col34  col34
0        0      0      0      0
1        0      0      1      1
2        0      0     29     29
3        0      0     51     51
4        1      1     18     18
..     ...    ...    ...    ...
246     97     97     50     50
247     98     98     57     57
248     98     98     85     85
249     99     99     85     85
250     99     99     95     95

[251 rows x 4 columns]
by in pandas result:
     col33  col34
0        0      0
1        0      1
2        0     29
3        0     51
4        1     18
..     ...    ...
246     97     50
247     98     57
248     98     85
249     99     85
250     99     95

[251 rows x 2 columns]
Traceback (most recent call last):
  File "../rofl.py", line 22, in <module>
    df_equals(md_res, pd_res) # AssertionError: DataFrame shape mismatch
  File "/localdisk/dchigare/repos/modin_bp/modin/pandas/test/utils.py", line 525, in df_equals
    check_categorical=False,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/_testing.py", line 1562, in assert_frame_equal
    obj, f"{obj} shape mismatch", f"{repr(left.shape)}", f"{repr(right.shape)}",
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/_testing.py", line 1036, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (251, 66)
[right]: (251, 64)

Describe the problem

Modin, while doing the reduce phase of the groupby reduce, resets the index of a frame with inserting index levels at the beginning of the frame (pandas behavior). However, when we got a frame that can't fit into a single column partition we get a situation when these index levels will be inserted into the beginning of all column partitions, so they'll be duplicated.

P.S. groupby_agg from #2461 doesn't have that issue because it check is partition_idx == 0 before resetting index, maybe we should apply the same approach and for reduction operations.

The text was updated successfully, but these errors were encountered:

dchigarev added the bug 🦗 Something isn't working label Dec 4, 2020

dchigarev mentioned this issue Dec 7, 2020

FEAT-#2375: implementation of multi-column groupby aggregation #2461

Merged

6 tasks

dchigarev mentioned this issue Dec 14, 2020

FEAT-#2491: optimized groupby dictionary aggregation #2534

Merged

6 tasks

dchigarev linked a pull request Dec 14, 2020 that will close this issue

FEAT-#2491: optimized groupby dictionary aggregation #2534

Merged

6 tasks

anmyachev closed this as completed in #2534 Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby reduction operations inserts 'by' columns in all column partition when `as_index=False` #2512

Groupby reduction operations inserts 'by' columns in all column partition when `as_index=False` #2512

dchigarev commented Dec 4, 2020

Groupby reduction operations inserts 'by' columns in all column partition when as_index=False #2512

Groupby reduction operations inserts 'by' columns in all column partition when as_index=False #2512

Comments

dchigarev commented Dec 4, 2020

System information

Describe the problem

Groupby reduction operations inserts 'by' columns in all column partition when `as_index=False` #2512

Groupby reduction operations inserts 'by' columns in all column partition when `as_index=False` #2512