Skip to content

Commit

Permalink
Breaking changes for sum / prod of empty / all-NA (#18921)
Browse files Browse the repository at this point in the history
* API: Change the sum of all-NA / all-Empty sum / prod

* Max, not min

* Update whatsnew

* Parametrize test

* Minor cleanups

* Refactor skipna_alternative

* Split test

* Added issue

* More updates

* linting

* linting

* Added skips

* Doc fixup

* DOC: More whatsnew
  • Loading branch information
TomAugspurger authored Dec 29, 2017
1 parent fae7920 commit dedfce9
Show file tree
Hide file tree
Showing 20 changed files with 547 additions and 159 deletions.
214 changes: 210 additions & 4 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,218 @@
v0.22.0
-------

This is a major release from 0.21.1 and includes a number of API changes,
deprecations, new features, enhancements, and performance improvements along
with a large number of bug fixes. We recommend that all users upgrade to this
version.
This is a major release from 0.21.1 and includes a single, API-breaking change.
We recommend that all users upgrade to this version after carefully reading the
release note (singular!).

.. _whatsnew_0220.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
summary is that

* The sum of an empty or all-*NA* ``Series`` is now ``0``
* The product of an empty or all-*NA* ``Series`` is now ``1``
* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` controlling
the minimum number of valid values for the result to be valid. If fewer than
``min_count`` non-*NA* values are present, the result is *NA*. The default is
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.

Some background: In pandas 0.21, we fixed a long-standing inconsistency
in the return value of all-*NA* series depending on whether or not bottleneck
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`. At the same
time, we changed the sum and prod of an empty ``Series`` to also be ``NaN``.

Based on feedback, we've partially reverted those changes.

Arithmetic Operations
^^^^^^^^^^^^^^^^^^^^^

The default sum for empty or all-*NA* ``Series`` is now ``0``.

*pandas 0.21.x*

.. code-block:: ipython

In [1]: pd.Series([]).sum()
Out[1]: nan

In [2]: pd.Series([np.nan]).sum()
Out[2]: nan

*pandas 0.22.0*

.. ipython:: python

pd.Series([]).sum()
pd.Series([np.nan]).sum()

The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
also matches the behavior of NumPy's ``np.nansum`` on empty and all-*NA* arrays.

To have the sum of an empty series return ``NaN`` (the default behavior of
pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the ``min_count``
keyword.

.. ipython:: python

pd.Series([]).sum(min_count=1)

Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
series is conceptually the same as the ``.sum`` of an empty one with
``skipna=True`` (the default).

.. ipython:: python

pd.Series([np.nan]).sum(min_count=1) # skipna=True by default

The ``min_count`` parameter refers to the minimum number of *non-null* values
required for a non-NA sum or product.

:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`,
returning ``1`` instead.

.. ipython:: python

pd.Series([]).prod()
pd.Series([np.nan]).prod()
pd.Series([]).prod(min_count=1)

These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
Finally, a few less obvious places in pandas are affected by this change.

Grouping by a Categorical
^^^^^^^^^^^^^^^^^^^^^^^^^

Grouping by a ``Categorical`` and summing now returns ``0`` instead of
``NaN`` for categories with no observations. The product now returns ``1``
instead of ``NaN``.

*pandas 0.21.x*

.. code-block:: ipython

In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a 3.0
b NaN
dtype: float64

*pandas 0.22*

.. ipython:: python

grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
pd.Series([1, 2]).groupby(grouper).sum()

To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
use ``min_count>=1``.

.. ipython:: python

pd.Series([1, 2]).groupby(grouper).sum(min_count=1)

Resample
^^^^^^^^

The sum and product of all-*NA* bins has changed from ``NaN`` to ``0`` for
sum and ``1`` for product.

*pandas 0.21.x*

.. code-block:: ipython

In [11]: s = pd.Series([1, 1, np.nan, np.nan],
...: index=pd.date_range('2017', periods=4))
...: s
Out[11]:
2017-01-01 1.0
2017-01-02 1.0
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64

In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([1, 1, np.nan, np.nan],
index=pd.date_range('2017', periods=4))
s.resample('2d').sum()

To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.

.. ipython:: python

s.resample('2d').sum(min_count=1)

In particular, upsampling and taking the sum or product is affected, as
upsampling introduces missing values even if the original series was
entirely valid.

*pandas 0.21.x*

.. code-block:: ipython

In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])

In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[15]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64

*pandas 0.22.0*

.. ipython:: python

idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
pd.Series([1, 2], index=idx).resample("12H").sum()

Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.

.. ipython:: python

pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)

Rolling and Expanding
^^^^^^^^^^^^^^^^^^^^^

Rolling and expanding already have a ``min_periods`` keyword that behaves
similar to ``min_count``. The only case that changes is when doing a rolling
or expanding sum with ``min_periods=0``. Previously this returned ``NaN``,
when fewer than ``min_periods`` non-*NA* values were in the window. Now it
returns ``0``.

*pandas 0.21.1*

.. code-block:: ipython

In [17]: s = pd.Series([np.nan, np.nan])

In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0 NaN
1 NaN
dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([np.nan, np.nan])
s.rolling(2, min_periods=0).sum()

The default behavior of ``min_periods=None``, implying that ``min_periods``
equals the window size, is unchanged.
4 changes: 2 additions & 2 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down Expand Up @@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down
21 changes: 13 additions & 8 deletions pandas/_libs/window.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -220,14 +220,16 @@ cdef class VariableWindowIndexer(WindowIndexer):
right_closed: bint
right endpoint closedness
True if the right endpoint is closed, False if open
floor: optional
unit for flooring the unit
"""
def __init__(self, ndarray input, int64_t win, int64_t minp,
bint left_closed, bint right_closed, ndarray index):
bint left_closed, bint right_closed, ndarray index,
object floor=None):

self.is_variable = 1
self.N = len(index)
self.minp = _check_minp(win, minp, self.N)
self.minp = _check_minp(win, minp, self.N, floor=floor)

self.start = np.empty(self.N, dtype='int64')
self.start.fill(-1)
Expand Down Expand Up @@ -342,7 +344,7 @@ def get_window_indexer(input, win, minp, index, closed,

if index is not None:
indexer = VariableWindowIndexer(input, win, minp, left_closed,
right_closed, index)
right_closed, index, floor)
elif use_mock:
indexer = MockFixedWindowIndexer(input, win, minp, left_closed,
right_closed, index, floor)
Expand Down Expand Up @@ -441,15 +443,16 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
object index, object closed):
cdef:
double val, prev_x, sum_x = 0
int64_t s, e
int64_t s, e, range_endpoint
int64_t nobs = 0, i, j, N
bint is_variable
ndarray[int64_t] start, end
ndarray[double_t] output

start, end, N, win, minp, is_variable = get_window_indexer(input, win,
minp, index,
closed)
closed,
floor=0)
output = np.empty(N, dtype=float)

# for performance we are going to iterate
Expand Down Expand Up @@ -489,13 +492,15 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,

# fixed window

range_endpoint = int_max(minp, 1) - 1

with nogil:

for i in range(0, minp - 1):
for i in range(0, range_endpoint):
add_sum(input[i], &nobs, &sum_x)
output[i] = NaN

for i in range(minp - 1, N):
for i in range(range_endpoint, N):
val = input[i]
add_sum(val, &nobs, &sum_x)

Expand Down
34 changes: 17 additions & 17 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7619,48 +7619,48 @@ def _doc_parms(cls):
_sum_examples = """\
Examples
--------
By default, the sum of an empty series is ``NaN``.
By default, the sum of an empty or all-NA Series is ``0``.
>>> pd.Series([]).sum() # min_count=1 is the default
nan
>>> pd.Series([]).sum() # min_count=0 is the default
0.0
This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be 0, pass ``min_count=0``.
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.
>>> pd.Series([]).sum(min_count=0)
0.0
>>> pd.Series([]).sum(min_count=1)
nan
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.
>>> pd.Series([np.nan]).sum()
nan
>>> pd.Series([np.nan]).sum(min_count=0)
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""

_prod_examples = """\
Examples
--------
By default, the product of an empty series is ``NaN``
By default, the product of an empty or all-NA Series is ``1``
>>> pd.Series([]).prod()
nan
1.0
This can be controlled with the ``min_count`` parameter
>>> pd.Series([]).prod(min_count=0)
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.
>>> pd.Series([np.nan]).prod()
nan
>>> pd.Series([np.nan]).sum(min_count=0)
1.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""


Expand All @@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
examples=examples)
@Appender(_num_doc)
def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
min_count=1,
min_count=0,
**kwargs):
nv.validate_stat_func(tuple(), kwargs, fname=name)
if skipna is None:
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1363,8 +1363,8 @@ def last(x):
else:
return last(x)

cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
cls.first = groupby_function('first', 'first', first_compat,
Expand Down
Loading

0 comments on commit dedfce9

Please sign in to comment.