Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746

Closed
tdpetrou opened this issue Feb 6, 2020 · 7 comments · Fixed by #45505
Closed
Assignees
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Needs Tests Unit test(s) needed to prevent regressions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Milestone

Comments

@tdpetrou
Copy link
Contributor

tdpetrou commented Feb 6, 2020

>>> s = pd.Series(['a', 'b'], dtype='string')
>>> s.max()
TypeError: Cannot perform reduction 'max' with string dtype
>>> df = s.to_frame()
>>> df.max()
0    b
dtype: object

Problem description

I assume this isn't supposed to work on DataFrames if it doesn't work on strings. min and sum do the same thing.

Expected Output

Consistent behavior for both dataframes and series

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0+untagged.1.gce8af21.dirty
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@jorisvandenbossche
Copy link
Member

Thanks for the report!

The reductions on DataFrames are a bit messy / inconsistent with Series right now, in general. It first tries it on the 2D values, and then column wise (but still different as series).
While on the Series, it dispatches to the underlying array, which currently doesn't implement reductions for StringArray via _reduce.

But I think we can actually add those reductions, so the series case works as well?

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Feb 6, 2020
@dsaxton
Copy link
Member

dsaxton commented Feb 6, 2020

@jorisvandenbossche It looks like these particular reductions are "sort of" working (they don't seem to be NA-aware at the moment) for StringArray already?

In [2]: arr                                                                                                                                                           
Out[2]: 
<StringArray>
['x', 'y', 'z']
Length: 3, dtype: string

In [3]: arr.min()                                                                                                                                                     
Out[3]: 'x'

In [4]: arr.max()                                                                                                                                                     
Out[4]: 'z'

In [5]: arr.sum()                                                                                                                                                     
Out[5]: 'xyz'

In [6]: arr[0] = pd.NA                                                                                                                                                

In [7]: arr.min()                                                                                                                                                     
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-8f52f7f0ded7> in <module>
----> 1 arr.min()

~/pandas/pandas/core/arrays/numpy_.py in min(self, axis, out, keepdims, skipna)
    356     def min(self, axis=None, out=None, keepdims=False, skipna=True):
    357         nv.validate_min((), dict(out=out, keepdims=keepdims))
--> 358         return nanops.nanmin(self._ndarray, axis=axis, skipna=skipna)
    359 
    360     def max(self, axis=None, out=None, keepdims=False, skipna=True):

~/pandas/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    126                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    127             else:
--> 128                 result = alt(values, axis=axis, skipna=skipna, **kwds)
    129 
    130             return result

~/pandas/pandas/core/nanops.py in reduction(values, axis, skipna, mask)
    869                 result = np.nan
    870         else:
--> 871             result = getattr(values, meth)(axis)
    872 
    873         result = _wrap_results(result, dtype, fill_value)

~/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims, initial, where)
     32 def _amin(a, axis=None, out=None, keepdims=False,
     33           initial=_NoValue, where=True):
---> 34     return umr_minimum(a, axis, None, out, keepdims, initial, where)
     35 
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: '<=' not supported between instances of 'float' and 'str'

@jorisvandenbossche
Copy link
Member

Yeah, I noticed that as well, but that is the implementation StringArray inherits from Numpy PandasArray. So it only works "by accident" on the actual array, I think. Since the _reduce purposefully raises right now. To support it properly, we need to implement that specifically for StringArray.

So I think we certainly would like to get this working, but it will require some custom code (although for now falling back to a conversion to numpy array with None instead of NA and using nanops is probably fine, and not much work)

@TomAugspurger
Copy link
Contributor

No one is currently working on this. Moving to 1.1.

@TomAugspurger TomAugspurger modified the milestones: 1.0.2, 1.1 Mar 11, 2020
@mroeschke mroeschke added the Bug label May 3, 2020
@jreback jreback modified the milestones: 1.1, Contributions Welcome Jul 10, 2020
@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 23, 2020
@jbrockmendel
Copy link
Member

Same underlying issue as #36076

@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.2 Nov 11, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020
@mroeschke
Copy link
Member

This looks okay on master now. Could use a test

In [4]: >>> s = pd.Series(['a', 'b'], dtype='string')
   ...: >>> s.max()
Out[4]: 'b'

In [5]: >>> df = s.to_frame()
   ...: >>> df.max()
Out[5]:
0    b
dtype: object

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 28, 2021
@NumberPiOso
Copy link
Contributor

take

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 20, 2022
@jreback jreback removed this from the Contributions Welcome milestone Jan 21, 2022
@jreback jreback added this to the 1.5 milestone Jan 21, 2022
NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 21, 2022
NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Needs Tests Unit test(s) needed to prevent regressions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
9 participants