Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple groupby causes TypeError #1770

Closed
ienkovich opened this issue Jul 22, 2020 · 6 comments · Fixed by #1842
Closed

Simple groupby causes TypeError #1770

ienkovich opened this issue Jul 22, 2020 · 6 comments · Fixed by #1842
Labels
bug 🦗 Something isn't working pandas.dataframe Related to pandas.dataframe module
Milestone

Comments

@ienkovich
Copy link
Collaborator

Here is a failing test:

import pandas as pd
import modin.pandas as mpd

data = {
    "a": [1, 1, 2],
    "b": [11, 11, 22],
    "c": [111, 111, 222]
}

df = pd.DataFrame(data)
df = pd.concat([df])
ref = df.groupby(["a", "b", df["c"]]).size()
print(ref)

df = mpd.DataFrame(data)
df = mpd.concat([df])
exp = df.groupby(["a", "b", df["c"]]).size()
print(exp)

The reason of fail is in DataFrame.groupby code. When we see that 'by' is a list, but it doesn't hold column labels only, we compare its length to index length. If lengths match, we assume, that 'by' holds key data, not a list of key columns/Series. In the test above this assumption is wrong.

Also, this length check triggers execution for lazy frame, which we want to avoid. So, we need to first check if 'by' is a list of key columns/Series and check index length only when we have to assume, that 'by' holds key data.

@devin-petersohn
Copy link
Collaborator

Is this an issue for OmniSci only? Please choose an appropriate label so we can classify this issue. Thanks!

@ienkovich
Copy link
Collaborator Author

This is a back-end agnostic issue. The problem is in front-end.

@ienkovich ienkovich added bug 🦗 Something isn't working pandas.dataframe Related to pandas.dataframe module labels Jul 27, 2020
@devin-petersohn devin-petersohn added this to the 0.8.1 milestone Jul 27, 2020
@devin-petersohn
Copy link
Collaborator

I see, thanks for clarifying. I have added to next release since there is not time to fix for this release.

@ienkovich
Copy link
Collaborator Author

I want to mention in advance, that while #1842 fixes error for this particular test case, it doesn't fix the original problem in the code. We can slightly change groupby statement to reproduce the original problem after #1842 is applied:

df.groupby(["a", "b", df["c"] + 1]).size()

@devin-petersohn
Copy link
Collaborator

@ienkovich Good catch, that example is significantly more complicated because the grouping values come from a transformation on the data rather than the data itself. We should track that particular case separately because it will require special care.

@itamarst
Copy link
Contributor

OK, I will open a separate issue for the additional case.

itamarst added a commit to itamarst/modin that referenced this issue Jul 29, 2020
… by list.

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
devin-petersohn pushed a commit that referenced this issue Jul 29, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
… by list.

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working pandas.dataframe Related to pandas.dataframe module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants