[FEA] Use `_get_numeric_types` in numeric reductions #2067

mrocklin · 2019-06-21T11:07:27Z

If would be great if the following worked:

import cudf
df = cudf.datasets.timeseries()
df.mean()

Currently we get this

AttributeError: 'CategoricalColumn' object has no attribute 'mean'

Typically the solution in Pandas is to strip out the non-numeric columns first

df._get_numeric_data().mean()

We should consider adding this method call in all dataframe reductions that expect numeric data like sum, mean, prod, std, var, cum* and so on.

mrocklin · 2019-06-21T11:16:14Z

So to be fully explicit we would want changes like the following:

def mean(self, *kwargs):
-    return self._apply_support_method('mean', **kwargs)
+    return self._get_numeric_data()._apply_support_method('mean', **kwargs)

Along with tests that this works when we have a small dataframe with both numeric and text dtypes.

beckernick · 2022-01-06T23:15:27Z

I believe the behavior requested in this issue is now considered deprecated by pandas based on the following example:

import cudf
df = cudf.datasets.timeseries()
pdf = df.to_pandas()
pdf.mean()
/tmp/ipykernel_47228/2065928329.py:4: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  pdf.mean()
id    999.979244
x       0.000081
y       0.000099
dtype: float64

df.mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_47228/3698961737.py in <module>
----> 1 df.mean()

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/frame.py in mean(self, axis, skipna, level, numeric_only, **kwargs)
   4304         dtype: float64
   4305         """
-> 4306         return self._reduce(
   4307             "mean",
   4308             axis=axis,

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/dataframe.py in _reduce(self, op, axis, level, numeric_only, **kwargs)
   5326 
   5327         if axis == 0:
-> 5328             result = [
   5329                 getattr(self._data[col], op)(**kwargs)
   5330                 for col in self._data.names

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/dataframe.py in <listcomp>(.0)
   5327         if axis == 0:
   5328             result = [
-> 5329                 getattr(self._data[col], op)(**kwargs)
   5330                 for col in self._data.names
   5331             ]

~/conda/envs/rapids-22.02/lib/python3.8/site-packages/cudf/core/column/column.py in mean(self, skipna, dtype)
   1178 
   1179     def mean(self, skipna: bool = None, dtype: Dtype = None):
-> 1180         raise TypeError(f"cannot perform mean with type {self.dtype}")
   1181 
   1182     def std(self, skipna: bool = None, ddof=1, dtype: Dtype = np.float64):

TypeError: cannot perform mean with type category

This is consistent with pandas-dev/pandas#41480 . Rather than match deprecated functionality, I think our current behavior of throwing a TypeError should be considered correct.

Separately, we have a ~~bug~~ gap with our numeric_only=True behavior, and I'll file an issue and cross-link.

beckernick · 2022-01-06T23:22:10Z

Rather than file a new issue, I'll add context here.

In Python, we provide passthrough support for the numeric_only parameter in most dataframe reductions for Dask compatibility. For DataFrame.rank, we implement the numeric_only parameter.

Implementing support for numeric_only in the DataFrame._reduce machinery would likely close this issue

Add support for numeric_only in DataFrame._reduce, this way can use df.mean(numeric_only=True), etc. Resolves #2067. Also partially addresses #9009. Authors: - https://github.com/martinfalisse Approvers: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10629

mrocklin added feature request New feature or request Needs Triage Need team to review and classify good first issue Good for newcomers Python Affects Python cuDF API. labels Jun 21, 2019

kkraus14 removed the Needs Triage Need team to review and classify label Jun 27, 2019

brandon-b-miller mentioned this issue Jan 13, 2020

[FEA] Reductions over string columns #3771

Closed

martinfalisse mentioned this issue Apr 10, 2022

Add support for numeric_only in DataFrame._reduce #10629

Merged

rapids-bot bot closed this as completed in #10629 Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Use `_get_numeric_types` in numeric reductions #2067

[FEA] Use `_get_numeric_types` in numeric reductions #2067

mrocklin commented Jun 21, 2019

mrocklin commented Jun 21, 2019 •

edited

Loading

beckernick commented Jan 6, 2022 •

edited

Loading

beckernick commented Jan 6, 2022

[FEA] Use _get_numeric_types in numeric reductions #2067

[FEA] Use _get_numeric_types in numeric reductions #2067

Comments

mrocklin commented Jun 21, 2019

mrocklin commented Jun 21, 2019 • edited Loading

beckernick commented Jan 6, 2022 • edited Loading

beckernick commented Jan 6, 2022

[FEA] Use `_get_numeric_types` in numeric reductions #2067

[FEA] Use `_get_numeric_types` in numeric reductions #2067

mrocklin commented Jun 21, 2019 •

edited

Loading

beckernick commented Jan 6, 2022 •

edited

Loading