Breaking changes for sum / prod of empty / all-NA (#18921)

* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew
pandas-dev · Dec 29, 2017 · dedfce9 · dedfce9
1 parent fae7920
commit dedfce9
Show file tree

Hide file tree

Showing 20 changed files with 547 additions and 159 deletions.
diff --git a/doc/source/whatsnew/v0.22.0.txt b/doc/source/whatsnew/v0.22.0.txt
@@ -3,12 +3,218 @@
 v0.22.0
 -------
 
-This is a major release from 0.21.1 and includes a number of API changes,
-deprecations, new features, enhancements, and performance improvements along
-with a large number of bug fixes. We recommend that all users upgrade to this
-version.
+This is a major release from 0.21.1 and includes a single, API-breaking change.
+We recommend that all users upgrade to this version after carefully reading the
+release note (singular!).
 
 .. _whatsnew_0220.api_breaking:
 
 Backwards incompatible API changes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
+summary is that
+
+* The sum of an empty or all-*NA* ``Series`` is now ``0``
+* The product of an empty or all-*NA* ``Series`` is now ``1``
+* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` controlling
+  the minimum number of valid values for the result to be valid. If fewer than
+  ``min_count`` non-*NA* values are present, the result is *NA*. The default is
+  ``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.
+
+Some background: In pandas 0.21, we fixed a long-standing inconsistency
+in the return value of all-*NA* series depending on whether or not bottleneck
+was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`. At the same
+time, we changed the sum and prod of an empty ``Series`` to also be ``NaN``.
+
+Based on feedback, we've partially reverted those changes.
+
+Arithmetic Operations
+^^^^^^^^^^^^^^^^^^^^^
+
+The default sum for empty or all-*NA* ``Series`` is now ``0``.
+
+*pandas 0.21.x*
+
+.. code-block:: ipython
+
+   In [1]: pd.Series([]).sum()
+   Out[1]: nan
+
+   In [2]: pd.Series([np.nan]).sum()
+   Out[2]: nan
+
+*pandas 0.22.0*
+
+.. ipython:: python
+
+   pd.Series([]).sum()
+   pd.Series([np.nan]).sum()
+
+The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
+also matches the behavior of NumPy's ``np.nansum`` on empty and all-*NA* arrays.
+
+To have the sum of an empty series return ``NaN`` (the default behavior of
+pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the ``min_count``
+keyword.
+
+.. ipython:: python
+
+   pd.Series([]).sum(min_count=1)
+
+Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
+series is conceptually the same as the ``.sum`` of an empty one with
+``skipna=True`` (the default).
+
+.. ipython:: python
+
+   pd.Series([np.nan]).sum(min_count=1)  # skipna=True by default
+
+The ``min_count`` parameter refers to the minimum number of *non-null* values
+required for a non-NA sum or product.
+
+:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`,
+returning ``1`` instead.
+
+.. ipython:: python
+
+   pd.Series([]).prod()
+   pd.Series([np.nan]).prod()
+   pd.Series([]).prod(min_count=1)
+
+These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
+Finally, a few less obvious places in pandas are affected by this change.
+
+Grouping by a Categorical
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Grouping by a ``Categorical`` and summing now returns ``0`` instead of
+``NaN`` for categories with no observations. The product now returns ``1``
+instead of ``NaN``.
+
+*pandas 0.21.x*
+
+.. code-block:: ipython
+
+   In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
+
+   In [9]: pd.Series([1, 2]).groupby(grouper).sum()
+   Out[9]:
+   a    3.0
+   b    NaN
+   dtype: float64
+
+*pandas 0.22*
+
+.. ipython:: python
+
+   grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
+   pd.Series([1, 2]).groupby(grouper).sum()
+
+To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
+use ``min_count>=1``.
+
+.. ipython:: python
+
+   pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
+
+Resample
+^^^^^^^^
+
+The sum and product of all-*NA* bins has changed from ``NaN`` to ``0`` for
+sum and ``1`` for product.
+
+*pandas 0.21.x*
+
+.. code-block:: ipython
+
+   In [11]: s = pd.Series([1, 1, np.nan, np.nan],
+      ...:                index=pd.date_range('2017', periods=4))
+      ...:  s
+   Out[11]:
+   2017-01-01    1.0
+   2017-01-02    1.0
+   2017-01-03    NaN
+   2017-01-04    NaN
+   Freq: D, dtype: float64
+
+   In [12]: s.resample('2d').sum()
+   Out[12]:
+   2017-01-01    2.0
+   2017-01-03    NaN
+   Freq: 2D, dtype: float64
+
+*pandas 0.22.0*
+
+.. ipython:: python
+
+   s = pd.Series([1, 1, np.nan, np.nan],
+                 index=pd.date_range('2017', periods=4))
+   s.resample('2d').sum()
+
+To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.
+
+.. ipython:: python
+
+   s.resample('2d').sum(min_count=1)
+
+In particular, upsampling and taking the sum or product is affected, as
+upsampling introduces missing values even if the original series was
+entirely valid.
+
+*pandas 0.21.x*
+
+.. code-block:: ipython
+
+   In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
+
+   In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
+   Out[15]:
+   2017-01-01 00:00:00    1.0
+   2017-01-01 12:00:00    NaN
+   2017-01-02 00:00:00    2.0
+   Freq: 12H, dtype: float64
+
+*pandas 0.22.0*
+
+.. ipython:: python
+
+   idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
+   pd.Series([1, 2], index=idx).resample("12H").sum()
+
+Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.
+
+.. ipython:: python
+
+   pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
+
+Rolling and Expanding
+^^^^^^^^^^^^^^^^^^^^^
+
+Rolling and expanding already have a ``min_periods`` keyword that behaves
+similar to ``min_count``. The only case that changes is when doing a rolling
+or expanding sum with ``min_periods=0``. Previously this returned ``NaN``,
+when fewer than ``min_periods`` non-*NA* values were in the window. Now it
+returns ``0``.
+
+*pandas 0.21.1*
+
+.. code-block:: ipython
+
+   In [17]: s = pd.Series([np.nan, np.nan])
+
+   In [18]: s.rolling(2, min_periods=0).sum()
+   Out[18]:
+   0   NaN
+   1   NaN
+   dtype: float64
+
+*pandas 0.22.0*
+
+.. ipython:: python
+
+   s = pd.Series([np.nan, np.nan])
+   s.rolling(2, min_periods=0).sum()
+
+The default behavior of ``min_periods=None``, implying that ``min_periods``
+equals the window size, is unchanged.
diff --git a/pandas/_libs/groupby_helper.pxi.in b/pandas/_libs/groupby_helper.pxi.in
@@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
                        ndarray[int64_t] counts,
                        ndarray[{{c_type}}, ndim=2] values,
                        ndarray[int64_t] labels,
-                       Py_ssize_t min_count=1):
+                       Py_ssize_t min_count=0):
     """
     Only aggregates on axis=0
     """
@@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
                         ndarray[int64_t] counts,
                         ndarray[{{c_type}}, ndim=2] values,
                         ndarray[int64_t] labels,
-                        Py_ssize_t min_count=1):
+                        Py_ssize_t min_count=0):
     """
     Only aggregates on axis=0
     """

diff --git a/pandas/_libs/window.pyx b/pandas/_libs/window.pyx
@@ -220,14 +220,16 @@ cdef class VariableWindowIndexer(WindowIndexer):
     right_closed: bint
         right endpoint closedness
         True if the right endpoint is closed, False if open
-
+    floor: optional
+        unit for flooring the unit
     """
     def __init__(self, ndarray input, int64_t win, int64_t minp,
-                 bint left_closed, bint right_closed, ndarray index):
+                 bint left_closed, bint right_closed, ndarray index,
+                 object floor=None):
 
         self.is_variable = 1
         self.N = len(index)
-        self.minp = _check_minp(win, minp, self.N)
+        self.minp = _check_minp(win, minp, self.N, floor=floor)
 
         self.start = np.empty(self.N, dtype='int64')
         self.start.fill(-1)
@@ -342,7 +344,7 @@ def get_window_indexer(input, win, minp, index, closed,
 
     if index is not None:
         indexer = VariableWindowIndexer(input, win, minp, left_closed,
-                                        right_closed, index)
+                                        right_closed, index, floor)
     elif use_mock:
         indexer = MockFixedWindowIndexer(input, win, minp, left_closed,
                                          right_closed, index, floor)
@@ -441,15 +443,16 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
              object index, object closed):
     cdef:
         double val, prev_x, sum_x = 0
-        int64_t s, e
+        int64_t s, e, range_endpoint
         int64_t nobs = 0, i, j, N
         bint is_variable
         ndarray[int64_t] start, end
         ndarray[double_t] output
 
     start, end, N, win, minp, is_variable = get_window_indexer(input, win,
                                                                minp, index,
-                                                               closed)
+                                                               closed,
+                                                               floor=0)
     output = np.empty(N, dtype=float)
 
     # for performance we are going to iterate
@@ -489,13 +492,15 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
 
         # fixed window
 
+        range_endpoint = int_max(minp, 1) - 1
+
         with nogil:
 
-            for i in range(0, minp - 1):
+            for i in range(0, range_endpoint):
                 add_sum(input[i], &nobs, &sum_x)
                 output[i] = NaN
 
-            for i in range(minp - 1, N):
+            for i in range(range_endpoint, N):
                 val = input[i]
                 add_sum(val, &nobs, &sum_x)
 

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
@@ -7619,48 +7619,48 @@ def _doc_parms(cls):
 _sum_examples = """\
 Examples
 --------
-By default, the sum of an empty series is ``NaN``.
+By default, the sum of an empty or all-NA Series is ``0``.
 
->>> pd.Series([]).sum()  # min_count=1 is the default
-nan
+>>> pd.Series([]).sum()  # min_count=0 is the default
+0.0
 
 This can be controlled with the ``min_count`` parameter. For example, if
-you'd like the sum of an empty series to be 0, pass ``min_count=0``.
+you'd like the sum of an empty series to be NaN, pass ``min_count=1``.
 
->>> pd.Series([]).sum(min_count=0)
-0.0
+>>> pd.Series([]).sum(min_count=1)
+nan
 
 Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
 empty series identically.
 
 >>> pd.Series([np.nan]).sum()
-nan
-
->>> pd.Series([np.nan]).sum(min_count=0)
 0.0
+
+>>> pd.Series([np.nan]).sum(min_count=1)
+nan
 """
 
 _prod_examples = """\
 Examples
 --------
-By default, the product of an empty series is ``NaN``
+By default, the product of an empty or all-NA Series is ``1``
 
 >>> pd.Series([]).prod()
-nan
+1.0
 
 This can be controlled with the ``min_count`` parameter
 
->>> pd.Series([]).prod(min_count=0)
-1.0
+>>> pd.Series([]).prod(min_count=1)
+nan
 
 Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
 empty series identically.
 
 >>> pd.Series([np.nan]).prod()
-nan
-
->>> pd.Series([np.nan]).sum(min_count=0)
 1.0
+
+>>> pd.Series([np.nan]).sum(min_count=1)
+nan
 """
 
 
@@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
                   examples=examples)
     @Appender(_num_doc)
     def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
-                  min_count=1,
+                  min_count=0,
                   **kwargs):
         nv.validate_stat_func(tuple(), kwargs, fname=name)
         if skipna is None:

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
@@ -1363,8 +1363,8 @@ def last(x):
             else:
                 return last(x)
 
-        cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
-        cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
+        cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
+        cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
         cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
         cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
         cls.first = groupby_function('first', 'first', first_compat,