Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant .fillna(0) at the end of groupby.size()/.count() #6126

Closed
dchigarev opened this issue May 15, 2023 · 0 comments · Fixed by #6127
Closed

Remove redundant .fillna(0) at the end of groupby.size()/.count() #6126

dchigarev opened this issue May 15, 2023 · 0 comments · Fixed by #6127
Assignees
Labels
P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests.

Comments

@dchigarev
Copy link
Collaborator

Currently, each groupby.size() call is finished with .fillna(0):

return result.fillna(0)

The behavior was originally introduced in #1802 in order to match pandas in the sense of not having NaN values in the result of .size()/.count(). The NaNs appeared at the reduction phase of groupby due to some specific pandas behavior:

>>> import pandas as pd # pandas 1.0.5
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [4, 3]}).astype({"a": "category"})
>>> df.groupby(["a", "b"]).sum()
       c
a b
1 3  4.0
  4  NaN
2 3  NaN
  4  3.0

Running the same code on the latest pandas gives the expected behavior (not producing NaNs), meaning that we now can remove the finalizing .fillna(0) in our modin's code

>>> import pandas as pd # pandas 1.5.3
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [4, 3]}).astype({"a": "category"})
>>> df.groupby(["a", "b"]).sum()
     c
a b
1 3  4
  4  0
2 3  0
  4  3
@dchigarev dchigarev added Performance 🚀 Performance related issues and pull requests. P1 Important tasks that we should complete soon labels May 15, 2023
@dchigarev dchigarev self-assigned this May 15, 2023
dchigarev added a commit to dchigarev/modin that referenced this issue May 15, 2023
…'.size()' and '.count()'

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
anmyachev pushed a commit that referenced this issue May 30, 2023
… '.count()' (#6127)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant