New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746

tdpetrou · 2020-02-06T13:40:53Z

>>> s = pd.Series(['a', 'b'], dtype='string')
>>> s.max()
TypeError: Cannot perform reduction 'max' with string dtype
>>> df = s.to_frame()
>>> df.max()
0    b
dtype: object

Problem description

I assume this isn't supposed to work on DataFrames if it doesn't work on strings. min and sum do the same thing.

Expected Output

Consistent behavior for both dataframes and series

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0+untagged.1.gce8af21.dirty
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-02-06T16:02:57Z

Thanks for the report!

The reductions on DataFrames are a bit messy / inconsistent with Series right now, in general. It first tries it on the 2D values, and then column wise (but still different as series).
While on the Series, it dispatches to the underlying array, which currently doesn't implement reductions for StringArray via _reduce.

But I think we can actually add those reductions, so the series case works as well?

dsaxton · 2020-02-06T16:32:23Z

@jorisvandenbossche It looks like these particular reductions are "sort of" working (they don't seem to be NA-aware at the moment) for StringArray already?

In [2]: arr                                                                                                                                                           
Out[2]: 
<StringArray>
['x', 'y', 'z']
Length: 3, dtype: string

In [3]: arr.min()                                                                                                                                                     
Out[3]: 'x'

In [4]: arr.max()                                                                                                                                                     
Out[4]: 'z'

In [5]: arr.sum()                                                                                                                                                     
Out[5]: 'xyz'

In [6]: arr[0] = pd.NA                                                                                                                                                

In [7]: arr.min()                                                                                                                                                     
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-8f52f7f0ded7> in <module>
----> 1 arr.min()

~/pandas/pandas/core/arrays/numpy_.py in min(self, axis, out, keepdims, skipna)
    356     def min(self, axis=None, out=None, keepdims=False, skipna=True):
    357         nv.validate_min((), dict(out=out, keepdims=keepdims))
--> 358         return nanops.nanmin(self._ndarray, axis=axis, skipna=skipna)
    359 
    360     def max(self, axis=None, out=None, keepdims=False, skipna=True):

~/pandas/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    126                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    127             else:
--> 128                 result = alt(values, axis=axis, skipna=skipna, **kwds)
    129 
    130             return result

~/pandas/pandas/core/nanops.py in reduction(values, axis, skipna, mask)
    869                 result = np.nan
    870         else:
--> 871             result = getattr(values, meth)(axis)
    872 
    873         result = _wrap_results(result, dtype, fill_value)

~/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims, initial, where)
     32 def _amin(a, axis=None, out=None, keepdims=False,
     33           initial=_NoValue, where=True):
---> 34     return umr_minimum(a, axis, None, out, keepdims, initial, where)
     35 
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: '<=' not supported between instances of 'float' and 'str'

jorisvandenbossche · 2020-02-06T16:39:32Z

Yeah, I noticed that as well, but that is the implementation StringArray inherits from Numpy PandasArray. So it only works "by accident" on the actual array, I think. Since the _reduce purposefully raises right now. To support it properly, we need to implement that specifically for StringArray.

So I think we certainly would like to get this working, but it will require some custom code (although for now falling back to a conversion to numpy array with None instead of NA and using nanops is probably fine, and not much work)

TomAugspurger · 2020-03-11T14:23:57Z

No one is currently working on this. Moving to 1.1.

jbrockmendel · 2020-09-23T21:06:56Z

Same underlying issue as #36076

mroeschke · 2021-07-28T04:54:52Z

This looks okay on master now. Could use a test

In [4]: >>> s = pd.Series(['a', 'b'], dtype='string')
   ...: >>> s.max()
Out[4]: 'b'

In [5]: >>> df = s.to_frame()
   ...: >>> df.max()
Out[5]:
0    b
dtype: object

NumberPiOso · 2022-01-18T21:26:40Z

take

In line with pandas-dev#31746

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Feb 6, 2020

dsaxton mentioned this issue Feb 6, 2020

Implement some reductions for string Series #31757

Closed

5 tasks

jorisvandenbossche added this to the 1.0.2 milestone Feb 6, 2020

TomAugspurger modified the milestones: 1.0.2, 1.1 Mar 11, 2020

dsaxton mentioned this issue Apr 7, 2020

ENH: Implement StringArray.min / max #33351

Merged

4 tasks

mroeschke added the Bug label May 3, 2020

jreback modified the milestones: 1.1, Contributions Welcome Jul 10, 2020

jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 23, 2020

simonjayhawkins modified the milestones: Contributions Welcome, 1.2 Nov 11, 2020

jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 28, 2021

github-actions bot assigned NumberPiOso Jan 18, 2022

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 20, 2022

TST: Add tests string series min max

70ef8e2

In line with pandas-dev#31746

NumberPiOso mentioned this issue Jan 20, 2022

TST: Add tests string series min max #45505

Merged

3 tasks

jreback removed this from the Contributions Welcome milestone Jan 21, 2022

jreback added this to the 1.5 milestone Jan 21, 2022

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 21, 2022

TST: Parametrize test using functions pandas-dev#31746

dfdddfa

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 22, 2022

TST: Parametrize expected output pandas-dev#31746

4b7be33

jreback closed this as completed in #45505 Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746

New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746

tdpetrou commented Feb 6, 2020

INSTALLED VERSIONS

jorisvandenbossche commented Feb 6, 2020

dsaxton commented Feb 6, 2020

jorisvandenbossche commented Feb 6, 2020

TomAugspurger commented Mar 11, 2020

jbrockmendel commented Sep 23, 2020

mroeschke commented Jul 28, 2021

NumberPiOso commented Jan 18, 2022

New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746

New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746

Comments

tdpetrou commented Feb 6, 2020

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Feb 6, 2020

dsaxton commented Feb 6, 2020

jorisvandenbossche commented Feb 6, 2020

TomAugspurger commented Mar 11, 2020

jbrockmendel commented Sep 23, 2020

mroeschke commented Jul 28, 2021

NumberPiOso commented Jan 18, 2022

Output of `pd.show_versions()`