Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes for sum / prod of empty / all-NA #18921

Merged
merged 14 commits into from
Dec 29, 2017
214 changes: 210 additions & 4 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,218 @@
v0.22.0
-------

This is a major release from 0.21.1 and includes a number of API changes,
deprecations, new features, enhancements, and performance improvements along
with a large number of bug fixes. We recommend that all users upgrade to this
version.
This is a major release from 0.21.1 and includes a single, API-breaking change.
We recommend that all users upgrade to this version after carefully reading the
release note (singular!).

.. _whatsnew_0220.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
summary is that

* The sum of an empty or all-*NA* ``Series`` is now ``0``
* The product of an empty or all-*NA* ``Series`` is now ``1``
* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` controlling
the minimum number of valid values for the result to be valid. If fewer than
``min_count`` non-*NA* values are present, the result is *NA*. The default is
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.

Some background: In pandas 0.21, we fixed a long-standing inconsistency
in the return value of all-*NA* series depending on whether or not bottleneck
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`. At the same
time, we changed the sum and prod of an empty ``Series`` to also be ``NaN``.

Based on feedback, we've partially reverted those changes.

Arithmetic Operations
^^^^^^^^^^^^^^^^^^^^^

The default sum for empty or all-*NA* ``Series`` is now ``0``.

*pandas 0.21.x*

.. code-block:: ipython

In [1]: pd.Series([]).sum()
Out[1]: nan

In [2]: pd.Series([np.nan]).sum()
Out[2]: nan

*pandas 0.22.0*

.. ipython:: python

pd.Series([]).sum()
pd.Series([np.nan]).sum()

The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say matches numpy? (I know you are saying np.nansum, but can't hurt to actually say numpy)

also matches the behavior of NumPy's ``np.nansum`` on empty and all-*NA* arrays.

To have the sum of an empty series return ``NaN`` (the default behavior of
pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the ``min_count``
keyword.

.. ipython:: python

pd.Series([]).sum(min_count=1)

Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
series is conceptually the same as the ``.sum`` of an empty one with
``skipna=True`` (the default).

.. ipython:: python

pd.Series([np.nan]).sum(min_count=1) # skipna=True by default

The ``min_count`` parameter refers to the minimum number of *non-null* values
required for a non-NA sum or product.

:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`,
returning ``1`` instead.

.. ipython:: python

pd.Series([]).prod()
pd.Series([np.nan]).prod()
pd.Series([]).prod(min_count=1)

These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
Finally, a few less obvious places in pandas are affected by this change.

Grouping by a Categorical
^^^^^^^^^^^^^^^^^^^^^^^^^

Grouping by a ``Categorical`` and summing now returns ``0`` instead of
``NaN`` for categories with no observations. The product now returns ``1``
instead of ``NaN``.

*pandas 0.21.x*

.. code-block:: ipython

In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a 3.0
b NaN
dtype: float64

*pandas 0.22*

.. ipython:: python

grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
pd.Series([1, 2]).groupby(grouper).sum()

To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
use ``min_count>=1``.

.. ipython:: python

pd.Series([1, 2]).groupby(grouper).sum(min_count=1)

Resample
^^^^^^^^

The sum and product of all-*NA* bins has changed from ``NaN`` to ``0`` for
sum and ``1`` for product.

*pandas 0.21.x*

.. code-block:: ipython

In [11]: s = pd.Series([1, 1, np.nan, np.nan],
...: index=pd.date_range('2017', periods=4))
...: s
Out[11]:
2017-01-01 1.0
2017-01-02 1.0
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64

In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([1, 1, np.nan, np.nan],
index=pd.date_range('2017', periods=4))
s.resample('2d').sum()

To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.

.. ipython:: python

s.resample('2d').sum(min_count=1)

In particular, upsampling and taking the sum or product is affected, as
upsampling introduces missing values even if the original series was
entirely valid.

*pandas 0.21.x*

.. code-block:: ipython

In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])

In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[15]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64

*pandas 0.22.0*

.. ipython:: python

idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
pd.Series([1, 2], index=idx).resample("12H").sum()

Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.

.. ipython:: python

pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)

Rolling and Expanding
^^^^^^^^^^^^^^^^^^^^^

Rolling and expanding already have a ``min_periods`` keyword that behaves
similar to ``min_count``. The only case that changes is when doing a rolling
or expanding sum with ``min_periods=0``. Previously this returned ``NaN``,
when fewer than ``min_periods`` non-*NA* values were in the window. Now it
returns ``0``.

*pandas 0.21.1*

.. code-block:: ipython

In [17]: s = pd.Series([np.nan, np.nan])

In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0 NaN
1 NaN
dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([np.nan, np.nan])
s.rolling(2, min_periods=0).sum()

The default behavior of ``min_periods=None``, implying that ``min_periods``
equals the window size, is unchanged.
4 changes: 2 additions & 2 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down Expand Up @@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down
21 changes: 13 additions & 8 deletions pandas/_libs/window.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -220,14 +220,16 @@ cdef class VariableWindowIndexer(WindowIndexer):
right_closed: bint
right endpoint closedness
True if the right endpoint is closed, False if open

floor: optional
unit for flooring the unit
"""
def __init__(self, ndarray input, int64_t win, int64_t minp,
bint left_closed, bint right_closed, ndarray index):
bint left_closed, bint right_closed, ndarray index,
object floor=None):

self.is_variable = 1
self.N = len(index)
self.minp = _check_minp(win, minp, self.N)
self.minp = _check_minp(win, minp, self.N, floor=floor)

self.start = np.empty(self.N, dtype='int64')
self.start.fill(-1)
Expand Down Expand Up @@ -342,7 +344,7 @@ def get_window_indexer(input, win, minp, index, closed,

if index is not None:
indexer = VariableWindowIndexer(input, win, minp, left_closed,
right_closed, index)
right_closed, index, floor)
elif use_mock:
indexer = MockFixedWindowIndexer(input, win, minp, left_closed,
right_closed, index, floor)
Expand Down Expand Up @@ -441,15 +443,16 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
object index, object closed):
cdef:
double val, prev_x, sum_x = 0
int64_t s, e
int64_t s, e, range_endpoint
int64_t nobs = 0, i, j, N
bint is_variable
ndarray[int64_t] start, end
ndarray[double_t] output

start, end, N, win, minp, is_variable = get_window_indexer(input, win,
minp, index,
closed)
closed,
floor=0)
output = np.empty(N, dtype=float)

# for performance we are going to iterate
Expand Down Expand Up @@ -489,13 +492,15 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,

# fixed window

range_endpoint = int_max(minp, 1) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


with nogil:

for i in range(0, minp - 1):
for i in range(0, range_endpoint):
add_sum(input[i], &nobs, &sum_x)
output[i] = NaN

for i in range(minp - 1, N):
for i in range(range_endpoint, N):
val = input[i]
add_sum(val, &nobs, &sum_x)

Expand Down
34 changes: 17 additions & 17 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7619,48 +7619,48 @@ def _doc_parms(cls):
_sum_examples = """\
Examples
--------
By default, the sum of an empty series is ``NaN``.
By default, the sum of an empty or all-NA Series is ``0``.

>>> pd.Series([]).sum() # min_count=1 is the default
nan
>>> pd.Series([]).sum() # min_count=0 is the default
0.0

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be 0, pass ``min_count=0``.
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.

>>> pd.Series([]).sum(min_count=0)
0.0
>>> pd.Series([]).sum(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
nan

>>> pd.Series([np.nan]).sum(min_count=0)
0.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""

_prod_examples = """\
Examples
--------
By default, the product of an empty series is ``NaN``
By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([]).prod()
nan
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([]).prod(min_count=0)
1.0
>>> pd.Series([]).prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
nan

>>> pd.Series([np.nan]).sum(min_count=0)
1.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""


Expand All @@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
examples=examples)
@Appender(_num_doc)
def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
min_count=1,
min_count=0,
**kwargs):
nv.validate_stat_func(tuple(), kwargs, fname=name)
if skipna is None:
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1363,8 +1363,8 @@ def last(x):
else:
return last(x)

cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
cls.first = groupby_function('first', 'first', first_compat,
Expand Down
Loading