-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged groupby_agg and groupby_dict_agg to implement dictionary functions aggregations #2317
Merged
Merged
Changes from 9 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
97e6fd0
FIX-#2254: Added dictionary functions to groupby aggregate tests
gshimansky 67e95fd
FIX-#2254: Initial implementation of dictionary functions aggregation
gshimansky 90efde0
FIX-#2254: Remove lambda wrapper to allow dictionary to go to backend
gshimansky f0c6300
FIX-#2254: Fixed AttributeError not being thrown from getattr
gshimansky 12debef
FIX-#2254: Lint fixes
gshimansky 18249a7
FEAT-#2363: fix index name setter in OmniSci backend
ienkovich c1dc213
FIX-#2254: Removed obsolete groupby_dict_agg API function
gshimansky 9eea77c
FIX-#2254: Fixed dict aggregate for base backend
gshimansky a99ca6c
FIX-#2254: Address reformatting comments
gshimansky 412c1c9
FIX-#2254: Remove whitespace
gshimansky 6917382
FIX-#2254: Removed redundant argument conversion
gshimansky File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -357,6 +357,8 @@ def aggregate(self, func=None, *args, **kwargs): | |
# This is not implemented in pandas, | ||
# so we throw a different message | ||
raise NotImplementedError("axis other than 0 is not supported") | ||
|
||
relabeling_required = False | ||
if isinstance(func, dict) or func is None: | ||
|
||
def _reconstruct_func(func, **kwargs): | ||
|
@@ -380,50 +382,33 @@ def _reconstruct_func(func, **kwargs): | |
from pandas.core.base import SpecificationError | ||
|
||
raise SpecificationError("nested renamer is not supported") | ||
if isinstance(self._by, type(self._query_compiler)): | ||
by = list(self._by.columns) | ||
else: | ||
by = self._by | ||
|
||
subset_cols = list(func_dict.keys()) + ( | ||
list(self._by.columns) | ||
if isinstance(self._by, type(self._query_compiler)) | ||
and all(c in self._df.columns for c in self._by.columns) | ||
else [] | ||
) | ||
result = type(self._df)( | ||
query_compiler=self._df[subset_cols]._query_compiler.groupby_dict_agg( | ||
by=by, | ||
func_dict=func_dict, | ||
groupby_args=self._kwargs, | ||
agg_args=kwargs, | ||
drop=self._drop, | ||
) | ||
) | ||
|
||
if relabeling_required: | ||
result = result.iloc[:, order] | ||
result.columns = new_columns | ||
|
||
return result | ||
|
||
if is_list_like(func): | ||
func = func_dict | ||
elif is_list_like(func): | ||
return self._default_to_pandas( | ||
lambda df, *args, **kwargs: df.aggregate(func, *args, **kwargs), | ||
*args, | ||
**kwargs, | ||
) | ||
if isinstance(func, str): | ||
agg_func = getattr(self, func, None) | ||
elif isinstance(func, str): | ||
# Using "getattr" here masks possible AttributeError which we throw | ||
# in __getattr__, so we should call __getattr__ directly instead. | ||
agg_func = self.__getattr__(func) | ||
if callable(agg_func): | ||
return agg_func(*args, **kwargs) | ||
return self._apply_agg_function( | ||
lambda df, *args, **kwargs: df.aggregate(func, *args, **kwargs), | ||
|
||
result = self._apply_agg_function( | ||
func, | ||
drop=self._as_index, | ||
*args, | ||
**kwargs, | ||
) | ||
|
||
if relabeling_required: | ||
result = result.iloc[:, order] | ||
result.columns = new_columns | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: remove vertical whitespace. |
||
return result | ||
|
||
agg = aggregate | ||
|
||
def last(self, **kwargs): | ||
|
@@ -888,7 +873,9 @@ def _apply_agg_function(self, f, drop=True, *args, **kwargs): | |
------- | ||
A new combined DataFrame with the result of all groups. | ||
""" | ||
assert callable(f), "'{0}' object is not callable".format(type(f)) | ||
assert callable(f) or isinstance( | ||
f, dict | ||
), "'{0}' object is not callable and not a dict".format(type(f)) | ||
|
||
# For aggregations, pandas behavior does this for the result. | ||
# For other operations it does not, so we wait until there is an aggregation to | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible case when
_apply_agg_function
returnsSeries
andrelabeling_required
is True? If yes, lines are belowresult = result.iloc[:, order]
andresult.columns = new_columns
will throw an exception.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example when
DataFrameGroupBy.agg
returns aSeries
? We could add this case to tests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of an example right now. However, return value depends on type of member
DataFrameGroupBy._df
, which can beSeries
in some case, but I am not sure regarding this operation. It is not critical for me to merge the PR. We can wait till anyone (users) will be able to face with that case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going by documentation pandas groupby aggregation always returns
DataFrame
, so there is two options:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found out that we're already converting aggregation result to the DataFrame explicitly, so our current code about reordering will work correct
modin/modin/pandas/groupby.py
Line 913 in 2b0b755
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._df
can be eitherDataFrame
orSeries
and in depend of that different objects can be created aftertype(self._df)(query_compiler=new_manager)
. This is my concern.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, currently I don't see a case when
self._df
isDataFrame
, but it was some time ago. At the same time we can getSeries
from_apply_agg_function
via this line.Since pandas is going to remove.squeeze
parameter in future then we can keep the code without converting toDataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, I just forgot about that aggregation for Series can also obtain dictionary function...
Then it seems that Modin has a lack of testing Series dictionary aggregation, because yes, it can happen that Series aggregation will return DataFrame:
Then I believe, that because of the line above we will get incorrect result of aggregation (Series instead of DataFrame), however it should be tested.
There is also another problem that we can face, pandas processes dictionary functions differently for
Series
andDataFrame
groupby aggregation, since we use general implementation for DataFrame, this also can cause some problems.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we definitely should add tests for Series groupby dict aggregation, and if them fails and it will be difficult to fix, create an issue and fix that in another PR, since this PR is just about merging dict aggregation with regular one