Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use _get_numeric_types in numeric reductions #2067

Closed
mrocklin opened this issue Jun 21, 2019 · 3 comments · Fixed by #10629
Closed

[FEA] Use _get_numeric_types in numeric reductions #2067

mrocklin opened this issue Jun 21, 2019 · 3 comments · Fixed by #10629
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@mrocklin
Copy link
Collaborator

If would be great if the following worked:

import cudf
df = cudf.datasets.timeseries()
df.mean()

Currently we get this

AttributeError: 'CategoricalColumn' object has no attribute 'mean'

Typically the solution in Pandas is to strip out the non-numeric columns first

df._get_numeric_data().mean()

We should consider adding this method call in all dataframe reductions that expect numeric data like sum, mean, prod, std, var, cum* and so on.

@mrocklin mrocklin added feature request New feature or request Needs Triage Need team to review and classify good first issue Good for newcomers Python Affects Python cuDF API. labels Jun 21, 2019
@mrocklin
Copy link
Collaborator Author

mrocklin commented Jun 21, 2019

So to be fully explicit we would want changes like the following:

def mean(self, *kwargs):
-    return self._apply_support_method('mean', **kwargs)
+    return self._get_numeric_data()._apply_support_method('mean', **kwargs)

Along with tests that this works when we have a small dataframe with both numeric and text dtypes.

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Jun 27, 2019
@beckernick
Copy link
Member

beckernick commented Jan 6, 2022

I believe the behavior requested in this issue is now considered deprecated by pandas based on the following example:

import cudf
df = cudf.datasets.timeseries()
pdf = df.to_pandas()
pdf.mean()
/tmp/ipykernel_47228/2065928329.py:4: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  pdf.mean()
id    999.979244
x       0.000081
y       0.000099
dtype: float64

df.mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_47228/3698961737.py in <module>
----> 1 df.mean()

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/frame.py in mean(self, axis, skipna, level, numeric_only, **kwargs)
   4304         dtype: float64
   4305         """
-> 4306         return self._reduce(
   4307             "mean",
   4308             axis=axis,

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/dataframe.py in _reduce(self, op, axis, level, numeric_only, **kwargs)
   5326 
   5327         if axis == 0:
-> 5328             result = [
   5329                 getattr(self._data[col], op)(**kwargs)
   5330                 for col in self._data.names

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/dataframe.py in <listcomp>(.0)
   5327         if axis == 0:
   5328             result = [
-> 5329                 getattr(self._data[col], op)(**kwargs)
   5330                 for col in self._data.names
   5331             ]

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/column/column.py in mean(self, skipna, dtype)
   1178 
   1179     def mean(self, skipna: bool = None, dtype: Dtype = None):
-> 1180         raise TypeError(f"cannot perform mean with type {self.dtype}")
   1181 
   1182     def std(self, skipna: bool = None, ddof=1, dtype: Dtype = np.float64):

TypeError: cannot perform mean with type category

This is consistent with pandas-dev/pandas#41480 . Rather than match deprecated functionality, I think our current behavior of throwing a TypeError should be considered correct.

Separately, we have a bug gap with our numeric_only=True behavior, and I'll file an issue and cross-link.

@beckernick
Copy link
Member

Rather than file a new issue, I'll add context here.

In Python, we provide passthrough support for the numeric_only parameter in most dataframe reductions for Dask compatibility. For DataFrame.rank, we implement the numeric_only parameter.

Implementing support for numeric_only in the DataFrame._reduce machinery would likely close this issue

rapids-bot bot pushed a commit that referenced this issue Apr 14, 2022
Add support for numeric_only in DataFrame._reduce, this way can use df.mean(numeric_only=True), etc. Resolves #2067. Also partially addresses #9009.

Authors:
  - https://github.com/martinfalisse

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10629
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants