Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby reduction operations inserts 'by' columns in all column partition when as_index=False #2512

Closed
dchigarev opened this issue Dec 4, 2020 · 0 comments · Fixed by #2534
Labels
bug 🦗 Something isn't working

Comments

@dchigarev
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): c2e7f9e
  • Python version: 3.7.5
  • Code we can use to reproduce:
import modin.pandas as pd
import numpy as np
import pandas
pd.DEFAULT_NPARTITIONS = 2

from modin.pandas.test.utils import test_data_values, create_test_dfs, df_equals

data = test_data_values[0]
md_df, pd_df = create_test_dfs(data)

by = [pd_df.columns[0], pd_df.columns[1]]

md_res = md_df.groupby(by, as_index=False).sum()
pd_res = pd_df.groupby(by, as_index=False).sum()

print(f"by in modin result:\n{md_res[by]}")
print(f"by in pandas result:\n{pd_res[by]}")

df_equals(md_res, pd_res) # AssertionError: DataFrame shape mismatch
Output
by in modin result:
     col33  col33  col34  col34
0        0      0      0      0
1        0      0      1      1
2        0      0     29     29
3        0      0     51     51
4        1      1     18     18
..     ...    ...    ...    ...
246     97     97     50     50
247     98     98     57     57
248     98     98     85     85
249     99     99     85     85
250     99     99     95     95

[251 rows x 4 columns]
by in pandas result:
     col33  col34
0        0      0
1        0      1
2        0     29
3        0     51
4        1     18
..     ...    ...
246     97     50
247     98     57
248     98     85
249     99     85
250     99     95

[251 rows x 2 columns]
Traceback (most recent call last):
  File "../rofl.py", line 22, in <module>
    df_equals(md_res, pd_res) # AssertionError: DataFrame shape mismatch
  File "/localdisk/dchigare/repos/modin_bp/modin/pandas/test/utils.py", line 525, in df_equals
    check_categorical=False,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/_testing.py", line 1562, in assert_frame_equal
    obj, f"{obj} shape mismatch", f"{repr(left.shape)}", f"{repr(right.shape)}",
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/_testing.py", line 1036, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (251, 66)
[right]: (251, 64)

Describe the problem

Modin, while doing the reduce phase of the groupby reduce, resets the index of a frame with inserting index levels at the beginning of the frame (pandas behavior). However, when we got a frame that can't fit into a single column partition we get a situation when these index levels will be inserted into the beginning of all column partitions, so they'll be duplicated.

P.S. groupby_agg from #2461 doesn't have that issue because it check is partition_idx == 0 before resetting index, maybe we should apply the same approach and for reduction operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant