Skip to content

Commit

Permalink
API: Change the sum of all-NA / all-Empty sum / prod
Browse files Browse the repository at this point in the history
  • Loading branch information
TomAugspurger committed Dec 23, 2017
1 parent 0ed1a13 commit c213aeb
Show file tree
Hide file tree
Showing 19 changed files with 382 additions and 94 deletions.
175 changes: 171 additions & 4 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,179 @@
v0.22.0
-------

This is a major release from 0.21.1 and includes a number of API changes,
deprecations, new features, enhancements, and performance improvements along
with a large number of bug fixes. We recommend that all users upgrade to this
version.
This is a major release from 0.21.1 and includes a single, API breaking change.
We recommend that all users upgrade to this version after carefully reading the
release note (singular!).

.. _whatsnew_0220.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The
summary is that

* The sum of an all-NA or empty series is now 0
* The product of an all-NA or empty series is now 1
* We've added a ``min_count`` parameter to ``.sum`` and ``.prod`` to control
the minimum number of valid values for the result to be valid. If fewer than
``min_count`` valid values are present, the result is NA. The default is
``0``. To restore the 0.21 behavior, use ``min_count=1``.

Some background: In pandas 0.21.1, we fixed a long-standing inconsistency
in the return value of all-NA series depending on whether or not bottleneck
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same
time, we changed the sum and prod of an empty Series to also be ``NaN``.

Based on feedback, we've partially reverted those changes. The default sum
for all-NA and empty series is now 0 (1 for ``prod``).

*pandas 0.21*

.. code-block:: ipython

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: pd.Series([]).sum()
Out[3]: nan

In [4]: pd.Series([np.nan]).sum()
Out[4]: nan

*pandas 0.22.0*

.. ipython:: python

pd.Series([]).sum()
pd.Series([np.nan]).sum()

To have the sum of an empty series return ``NaN``, use the ``min_count``
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-NA
series is conceptually the same as on an empty. The ``min_count`` parameter
refers to the minimum number of *valid* values required for a non-NA sum
or product.

.. ipython:: python

pd.Series([]).sum(min_count=1)
pd.Series([np.nan]).sum(min_count=1)

Note that this affects some other places in the library:

1. Grouping by a Categorical with some unobserved categories

*pandas 0.21*

.. code-block:: ipython

In [3]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [4]: pd.Series([1, 2]).groupby(grouper).sum()
Out[4]:
a 3.0
b NaN
dtype: float64

*pandas 0.22*

.. ipython:: python

grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
pd.Series([1, 2]).groupby(grouper).sum()

pd.Series([1, 2]).groupby(groupuer).sum(min_count=1)

2. Resampling

The output for an all-NaN bin will change:

*pandas 0.21.0*

.. code-block:: ipython

In [1]: import pandas as pd; import numpy as np;

In [2]: s = pd.Series([1, 1, np.nan, np.nan],
...: index=pd.date_range('2017', periods=4))
...:

In [3]: s
Out[3]:
2017-01-01 1.0
2017-01-02 1.0
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64

In [4]: s.resample('2d').sum()
Out[4]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([1, 1, np.nan, np.nan],
index=pd.date_range('2017', periods=4))
s.resample('2d').sum()

To restore the 0.21 behavior, use ``min_count>=1``

.. ipython:: python

s.resample('2d').sum(min_count=1)

Upsampling in particular is affected, as this will introduce all-NaN series even
if your original series was entirely valid.

*pandas 0.21.0*

.. code-block:: ipython

In [5]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])

In [6]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[6]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64

*pandas 0.22.0*

.. ipython:: python

idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
pd.Series([1, 2], index=idx).resample("12H").sum()

pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)

3. Rolling / Expanding

Rolling and expanding already have a ``min_periods`` keyword that behaves
similarly to ``min_count``. The only case that changes is when doing a rolling
or expanding sum on an all-NaN series with ``min_periods=0``.

*pandas 0.21.1*

.. ipython:: python

In [7]: s = pd.Series([np.nan, np.nan])

In [8]: s.rolling(2, min_periods=0).sum()
Out[8]:
0 NaN
1 NaN
dtype: float64

*pandas 0.22.0*

.. ipython:: python

In [2]: s = pd.Series([np.nan, np.nan])

In [3]: s.rolling(2, min_periods=0).sum()
4 changes: 2 additions & 2 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down Expand Up @@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down
11 changes: 11 additions & 0 deletions pandas/_libs/window.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -443,10 +443,17 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
double val, prev_x, sum_x = 0
int64_t s, e
int64_t nobs = 0, i, j, N
int64_t minp2 = -1
bint is_variable
ndarray[int64_t] start, end
ndarray[double_t] output

if minp == 0:
# in get_window_indexer, we ensure that minp >= 1. That's fine for
# all cases except nobs = 0 (all missing values) and minp=0. For
# any other minp, the sum will be NA. For minp=0, the sum will be 0.
# So we track that here and pass it later if needed.
minp2 = 0
start, end, N, win, minp, is_variable = get_window_indexer(input, win,
minp, index,
closed)
Expand Down Expand Up @@ -483,6 +490,8 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
for j in range(end[i - 1], e):
add_sum(input[j], &nobs, &sum_x)

if minp2 == 0:
minp = 0
output[i] = calc_sum(minp, nobs, sum_x)

else:
Expand All @@ -503,6 +512,8 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
prev_x = input[i - win]
remove_sum(prev_x, &nobs, &sum_x)

if minp2 == 0:
minp = 0
output[i] = calc_sum(minp, nobs, sum_x)

return output
Expand Down
34 changes: 17 additions & 17 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7619,48 +7619,48 @@ def _doc_parms(cls):
_sum_examples = """\
Examples
--------
By default, the sum of an empty series is ``NaN``.
By default, the sum of an empty series is ``0``.
>>> pd.Series([]).sum() # min_count=1 is the default
nan
>>> pd.Series([]).sum() # min_count=0 is the default
0.0
This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be 0, pass ``min_count=0``.
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.
>>> pd.Series([]).sum(min_count=0)
0.0
>>> pd.Series([]).sum(min_count=1)
nan
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.
>>> pd.Series([np.nan]).sum()
nan
>>> pd.Series([np.nan]).sum(min_count=0)
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""

_prod_examples = """\
Examples
--------
By default, the product of an empty series is ``NaN``
By default, the product of an empty series is ``1``
>>> pd.Series([]).prod()
nan
1.0
This can be controlled with the ``min_count`` parameter
>>> pd.Series([]).prod(min_count=0)
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.
>>> pd.Series([np.nan]).prod()
nan
>>> pd.Series([np.nan]).sum(min_count=0)
1.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""


Expand All @@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
examples=examples)
@Appender(_num_doc)
def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
min_count=1,
min_count=0,
**kwargs):
nv.validate_stat_func(tuple(), kwargs, fname=name)
if skipna is None:
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1286,8 +1286,8 @@ def last(x):
else:
return last(x)

cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
cls.first = groupby_function('first', 'first', first_compat,
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ def nanall(values, axis=None, skipna=True):

@disallow('M8')
@bottleneck_switch()
def nansum(values, axis=None, skipna=True, min_count=1):
def nansum(values, axis=None, skipna=True, min_count=0):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
dtype_sum = dtype_max
if is_float_dtype(dtype):
Expand Down Expand Up @@ -645,7 +645,7 @@ def nankurt(values, axis=None, skipna=True):


@disallow('M8', 'm8')
def nanprod(values, axis=None, skipna=True, min_count=1):
def nanprod(values, axis=None, skipna=True, min_count=0):
mask = isna(values)
if skipna and not is_any_int_dtype(values):
values = values.copy()
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -605,7 +605,7 @@ def size(self):
# downsample methods
for method in ['sum', 'prod']:

def f(self, _method=method, min_count=1, *args, **kwargs):
def f(self, _method=method, min_count=0, *args, **kwargs):
nv.validate_resampler_func(_method, args, kwargs)
return self._downsample(_method, min_count=min_count)
f.__doc__ = getattr(GroupBy, method).__doc__
Expand Down
Loading

0 comments on commit c213aeb

Please sign in to comment.