Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes for sum / prod of empty / all-NA #18921

Merged
merged 14 commits into from
Dec 29, 2017

Conversation

TomAugspurger
Copy link
Contributor

Changes the defaults for min_count so that sum([]) and sum([np.nan]) are 0 by default, and NaN with min_count>=1.

I'd recommend looking at only the latest commit until #18876 is merged. I'll probably force push changes here to keep all the relevant changes in the last commit until #18876 is in, rebase on that, and then start pushing changes regularly.

cc @jreback @jorisvandenbossche @shoyer @wesm

@TomAugspurger TomAugspurger added API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 23, 2017
@TomAugspurger TomAugspurger added this to the 0.22.0 milestone Dec 23, 2017
idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
pd.Series([1, 2], index=idx).resample("12H").sum()

pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add .rolling and .expanding with all-NA and min_count=0.

def wrapper(x):
return alternative(x.values)

if skipna_alternative:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like the easiest way of getting these tests to pass. skipna_alternative is only used in sum and prod, and is nansum or nanprod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

@@ -813,8 +813,6 @@ def test__cython_agg_general(self):
('mean', np.mean),
('median', lambda x: np.median(x) if len(x) > 0 else np.nan),
('var', lambda x: np.var(x, ddof=1)),
('add', lambda x: np.sum(x) if len(x) > 0 else np.nan),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these to their own tests in e390d1c#diff-1c9535bf6ad763b307a780e37cc4185cR833 due to #18869

@@ -147,7 +147,7 @@ def test_empty_multi(self, method, unit):
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize(
"method", ['sum', 'mean', 'median', 'std', 'var'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: make sure I have coverage for this elsewhere... I don't remember why I removed it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are handled up above in test_empty.

def test_nanops(self):
@pytest.mark.parametrize('klass', [Index, Series])
@pytest.mark.parametrize('op', ['max', 'min'])
def test_nanops_max_min(self, op, klass):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought I did this parametrization in #18876. I can move it there if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these seem to be a remnant of an earlier version where I was adding min_count everywhere. I'll just remove it.

for axis in list(range(targarval.ndim)) + [None]:
for skipna in [False, True]:
targartempval = targarval if skipna else targarnanval
try:
if skipna and empty_targfunc and pd.isna(targartempval).all():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reasonably sure I refactored these correctly, but would appreciate a close review here. empty_targfunc should be like nansum or nanprod, in which case we want to call it on targartempval. I should double check why just passing nansum for targfunc didn't work though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the imported isna

@@ -1358,9 +1390,10 @@ def get_result(arr, window, min_periods=None, center=False):
assert notna(result[4])

# min_periods=0
result0 = get_result(arr, 20, min_periods=0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to disable this subset of tests, since I don't think it's true that min_period=0 is the same as min_period=1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so make new ones that make this diff clear (i agree they should be different).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a keyword to control this check, since it is true for most operations. sum is tested up above in test_missing_minp_zero.

bint is_variable
ndarray[int64_t] start, end
ndarray[double_t] output

if minp == 0:
# in get_window_indexer, we ensure that minp >= 1. That's fine for
# all cases except nobs = 0 (all missing values) and minp=0. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the minp logic is handled externally to window.pyx (in window.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying to move this there? I don't think that's possible without rewriting roll_sum. All of it's logic assumes that minp is at least 1. The only thing I need the user-provided minp is for passing to calc_sum later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should be changed. This is way too hacky. ALL logic related to minp is passed in.

result.fill(fill_value)
return result
if values.size == 0 and kwds.get('min_count') is None:
# We are empty, returning NA for our type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn’t min_count always passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just when the caller is sum or prod. Methods like mean don't pass it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very error prone. The internal functions can always pass min_count I think, so it will be not-None. or at the very least comment on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here passing through the correct arguments to the nanop. pandas.core.nanops.sum takes min_count, while mean_doesn't. The function calling f knows that, but f doesn't. So I would either

  1. refactor f to not do this special casing, and have each nanop properly handle size-0 arrays
  2. add an unused min_count to each nanop, and always pass it through
  3. have f add min_count to kwds when it's not None.

I think 1 is the best option, but will take a bit of time. 3 doesn't strike me as any better than the current method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just pass it thru in make_stat_function? e.g. .you always just pass

        # not sum/prod (I know you have a separate function for this)
        min_count=-1
        return self._reduce(f, name, axis=axis, skipna=skipna,
                            numeric_only=numeric_only, min_count=min_count)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be option 2. I'd have to add a min_count keyword to each of our methods in pandas.core.nanops, which I don't think we want.

('prod', 1),
])
def test_sum_prod_nanops(self, method, unit):
idx = ['a', 'b', 'c']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add tests for Timedelta here as well (should work)

("sum", 0.0),
("prod", 1.0)
])
def test_empty(self, method, unit, use_bottleneck):
with pd.option_context("use_bottleneck", use_bottleneck):
# GH 9422
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this issue number

pad = DataFrame([[np.nan, np.nan, np.nan, np.nan]], index=[3],

if func == 'sum':
fill_value = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u parametrize on func name and fill_value

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 23, 2017

This turned up a slight inconsistency between:

a) groupby([pd.Grouper(key1, freq), key2]).agg
b) groupby(key2).resample(freq).agg

The first doesn't upsample to include "unobserved" values, while the latter does. An example:

        df = pd.DataFrame({
            'A': [1] * 6 + [2] * 6,
            'B': pd.to_datetime(['2017'] * 3 +
                                ['2019'] * 3 +
                                ['2017'] * 3 +
                                ['2019'] * 3),
            'C': [1] * 12
        })
        a = df.groupby([pd.Grouper(key='B', freq='AS'), 'A']).C.count()
        b = df.set_index("B").sort_index().groupby("A").resample("AS").C.count()
In [53]: a  # groupby([grouper, key]).agg
Out[53]:
B           A
2017-01-01  1    3
            2    3
2019-01-01  1    3
            2    3
Name: C, dtype: int64

In [54]: b  # groupby(key).resample().agg
Out[54]:
A  B
1  2017-01-01    3
   2018-01-01    0
   2019-01-01    3
2  2017-01-01    3
   2018-01-01    0
   2019-01-01    3
Name: C, dtype: int64

I've use count here, but the same thing is affecting sum and prod. Do we want these to be the same?

@pep8speaks
Copy link

pep8speaks commented Dec 23, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 29, 2017 at 12:25 Hours UTC

@shoyer
Copy link
Member

shoyer commented Dec 23, 2017

RE: Grouoer/resample: that does look inconsistent. I think the resample behavior should be used for both?

@jorisvandenbossche
Copy link
Member

Somehow I could give an explanation for the difference in behaviour between the groupby and resample example (resample: conform to new index given the frequency -> full time series, groupby: makes groups of my data based on given frequency -> only those groups contained in data). But of course there are also good reasons to have this consistent.
But therefore: does this need to be fixed here? (is keeping it as it is complicating the PR?) If not, I would open a separate issue for it.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed only the whatsnew note.

In general, for the "changes" part of the whatsnew, should we also note somehow that whether this changed for you compared to pre-0.21 behaviour, depends on whether bottlneneck was installed or not (although that is maybe only for the main sum and not for groupby or resample?)

* We've added a ``min_count`` parameter to ``.sum`` and ``.prod`` to control
the minimum number of valid values for the result to be valid. If fewer than
``min_count`` valid values are present, the result is NA. The default is
``0``. To restore the 0.21 behavior, use ``min_count=1``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly say in the summary here that this result is np.nan

``min_count`` valid values are present, the result is NA. The default is
``0``. To restore the 0.21 behavior, use ``min_count=1``.

Some background: In pandas 0.21.1, we fixed a long-standing inconsistency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.21.1 -> 0.21.0 ?

grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
pd.Series([1, 2]).groupby(grouper).sum()

pd.Series([1, 2]).groupby(groupuer).sum(min_count=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gruupeur -> grouper


.. code-block:: ipython

In [1]: import pandas as pd; import numpy as np;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not needed


s.resample('2d').sum(min_count=1)

Upsampling in particular is affected, as this will introduce all-NaN series even
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we emphasize here somewhere this change is only for sum/prod, and not for all other aggregation functions? (I know it is already said above, but I think we have to stress enough it that those examples only apply for specific cases)

@TomAugspurger
Copy link
Contributor Author

But therefore: does this need to be fixed here? (is keeping it as it is complicating the PR?) If not, I would open a separate issue for it.

I'm not sure. We have two tests that relied on that equivalence to construct the expected result.

@TomAugspurger TomAugspurger force-pushed the sum-mincount-breaking branch 2 times, most recently from 46590dd to d5e475f Compare December 27, 2017 14:52
ndarray[int64_t] start, end
ndarray[double_t] output

if minp == 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thoughts on this approach? no_min is a boolean that indicates whether or not to consider min_p when choosing if the result is valid in calc_sum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this. Why can't you just dispatch on minp==0 above

e.g. line 416

if nobs >= minp or minp==0

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had earlier with the minp2 stuff.

The problem there was that was that in roll_sum, minp gets set to min(minp, 1), so we don't know if the user specified min_periods=0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't see that. maybe can just remove that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, a25e793 should be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks bigger than it is because I had to ensure VariableWindowIndexer passes through floor. The main change is just using floor=0 in roll_sum and iterating like range(min(1, minp), ...) instead of just range(minmp, ...).

@TomAugspurger TomAugspurger force-pushed the sum-mincount-breaking branch 3 times, most recently from a034a70 to b8d1f2b Compare December 27, 2017 15:28
@TomAugspurger
Copy link
Contributor Author

The only failing tests are the equivalence tests between df.groupby(key).resample and df.groupby([key, Grouper]).

Somehow I could give an explanation for the difference in behaviour between the groupby and resample example

@jorisvandenbossche could you explain a bit further? I don't quite understand. IMO, saying Grouper(key, freq='AS') necessarily implies a resample.

Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The
summary is that

* The sum of an all-NA or empty series is now 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add Series in capitalize with double-backticks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly also use double-backticks around all-NA (or maybe italics)

Based on feedback, we've partially reverted those changes. The default sum for
all-NA and empty series is now 0 (1 for ``prod``).

*pandas 0.21*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas 0.21.x

Returning ``NaN`` was the default behavior for pandas 0.20.3 without bottleneck
installed.

Note that this affects some other places in the library:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break out these to separate sub-sections I think (e.g the numbered ones)

ndarray[int64_t] start, end
ndarray[double_t] output

if minp == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this. Why can't you just dispatch on minp==0 above

e.g. line 416

if nobs >= minp or minp==0

?

_sum_examples = """\
Examples
--------
By default, the sum of an empty series is ``0``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add all-NaN as well

@@ -292,6 +282,22 @@ def _wrap_results(result, dtype):
return result


def _na_for_min_count(values, axis):
# we either return np.nan or pd.NaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add doc-string

def wrapper(x):
return alternative(x.values)

if skipna_alternative:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

@@ -273,6 +273,21 @@ def test_timegrouper_with_reg_groups(self):
'whole_cost'].sum()
assert_series_equal(result2, expected)

def test_groupby_resample_equivalence(self):
# Currently failing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is failing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a simplification of test_timegrouper_with_reg_groups, they're failing for the same reason (#18921 (comment))

columns=['A', 'B', 'C', 'D'])
normal_result = getattr(normal_grouped, func)()
dt_result = getattr(dt_grouped, func)()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this test moved above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, just parametrized.

@@ -1358,9 +1390,10 @@ def get_result(arr, window, min_periods=None, center=False):
assert notna(result[4])

# min_periods=0
result0 = get_result(arr, 20, min_periods=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so make new ones that make this diff clear (i agree they should be different).

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 27, 2017

Returning to #18921 (comment), it's only an issue for when there are multiple groupers:

In [14]: df = pd.DataFrame({"A": pd.to_datetime(['2015', '2017']), "B": [1, 1]})

In [15]: df.groupby(pd.Grouper(key="A", freq="AS")).B.count()
Out[15]:
A
2015-01-01    1
2016-01-01    0
2017-01-01    1
Freq: AS-JAN, Name: B, dtype: int64

In [16]: df.groupby([pd.Grouper(key="A", freq="AS"), [0, 0]]).B.count()
Out[16]:
A
2015-01-01  0    1
2017-01-01  0    1
Name: B, dtype: int64

Still trying to track down the appropriate spot for the fix. Will we want to include this bugfix in 0.22, or save it for 0.23.0?

@TomAugspurger
Copy link
Contributor Author

Fixing this is turning into quite the rabbit hole :/

@TomAugspurger
Copy link
Contributor Author

IIUC, the issue is in how the new groups are determined

In [9]: df = pd.DataFrame({"A": pd.to_datetime(['2015', '2017']), "B": [1, 1]})

In [10]: gr1 = df.groupby(pd.Grouper(key="A", freq="AS")).B

In [11]: gr2 = df.groupby([pd.Grouper(key="A", freq="AS"), [0, 0]])

In [12]: gr1.grouper.result_index
Out[12]: DatetimeIndex(['2015-01-01', '2016-01-01', '2017-01-01'], dtype='datetime64[ns]', name='A', freq='AS-JAN')

In [13]: gr2.grouper.result_index
Out[13]:
MultiIndex(levels=[[2015-01-01 00:00:00, 2016-01-01 00:00:00, 2017-01-01 00:00:00], [0]],
           labels=[[0, 2], [0, 0]],
           names=['A', None])

So gr2.grouper.result_index has a level with no observations, the 2016-01-01. Then when we go to construct the groups it's dropped:

In [21]: gr2.grouper.groups
Out[21]:
{(Timestamp('2015-01-01 00:00:00'), 0): Int64Index([0], dtype='int64'),
 (Timestamp('2017-01-01 00:00:00'), 0): Int64Index([1], dtype='int64')}

I don't know if that's the root problem yet, or just a symptom.

@codecov
Copy link

codecov bot commented Dec 28, 2017

Codecov Report

Merging #18921 into master will decrease coverage by 0.04%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18921      +/-   ##
==========================================
- Coverage    91.6%   91.56%   -0.05%     
==========================================
  Files         150      150              
  Lines       48939    48977      +38     
==========================================
+ Hits        44833    44845      +12     
- Misses       4106     4132      +26
Flag Coverage Δ
#multiple 89.92% <93.33%> (-0.05%) ⬇️
#single 41.72% <40%> (+0.54%) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.95% <ø> (ø) ⬆️
pandas/core/groupby.py 92.14% <100%> (ø) ⬆️
pandas/core/resample.py 96.39% <100%> (ø) ⬆️
pandas/core/nanops.py 96.68% <100%> (ø) ⬆️
pandas/util/testing.py 84.95% <90%> (+0.05%) ⬆️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️
pandas/core/indexes/interval.py 92.61% <0%> (-1.21%) ⬇️
pandas/core/sparse/array.py 91.82% <0%> (-0.47%) ⬇️
pandas/core/dtypes/cast.py 88.42% <0%> (-0.18%) ⬇️
pandas/core/indexes/datetimelike.py 97.04% <0%> (-0.16%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dbec3c9...a97e133. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of tiny comments

Some background: In pandas 0.21, we fixed a long-standing inconsistency
in the return value of all-*NA* series depending on whether or not bottleneck
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same
time, we changed the sum and prod of an empty Series to also be ``NaN``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks around Series.

pd.Series([]).sum()
pd.Series([np.nan]).sum()

The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say matches numpy? (I know you are saying np.nansum, but can't hurt to actually say numpy)

also matches the behavior of ``np.nansum`` on empty and all-*NA* arrays.

To have the sum of an empty series return ``NaN``, use the ``min_count``
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipna=True

To have the sum of an empty series return ``NaN``, use the ``min_count``
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
series is conceptually the same as on an empty. The ``min_count`` parameter
refers to the minimum number of *valid* values required for a non-NA sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid -> not-null

pd.Series([1, 2]).groupby(grouper).sum()

To restore the 0.21 behavior of returning ``NaN`` of unobserved groups,
use ``min_count>=1``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually tests min_count > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added a few more in frame/test_analytics.py, test_resample.py, and test_groupby.py.

^^^^^^^^^^^^^^^^^^^^^

Rolling and expanding already have a ``min_periods`` keyword that behaves
similarly to ``min_count``. The only case that changes is when doing a rolling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar

@@ -1,14 +1,213 @@
.. _whatsnew_0220:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't u need this?

@jreback
Copy link
Contributor

jreback commented Dec 28, 2017

tiny comment. ok merging on green.

@TomAugspurger
Copy link
Contributor Author

OK, all green.

I'm going to

  1. go through this diff one more time
  2. Merge it to master
  3. Push a 0.23.0.dev0 tag to master
  4. branch 0.22.x off 0.21.x
  5. backport these two PRs to 0.22.x

@TomAugspurger
Copy link
Contributor Author

All green. Here we go.

@TomAugspurger TomAugspurger merged commit dedfce9 into pandas-dev:master Dec 29, 2017
@TomAugspurger TomAugspurger deleted the sum-mincount-breaking branch December 29, 2017 13:05
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 29, 2017
* API: Change the sum of all-NA / all-Empty sum / prod

* Max, not min

* Update whatsnew

* Parametrize test

* Minor cleanups

* Refactor skipna_alternative

* Split test

* Added issue

* More updates

* linting

* linting

* Added skips

* Doc fixup

* DOC: More whatsnew

(cherry picked from commit dedfce9)
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 29, 2017
* API: Change the sum of all-NA / all-Empty sum / prod

* Max, not min

* Update whatsnew

* Parametrize test

* Minor cleanups

* Refactor skipna_alternative

* Split test

* Added issue

* More updates

* linting

* linting

* Added skips

* Doc fixup

* DOC: More whatsnew

(cherry picked from commit dedfce9)
TomAugspurger added a commit that referenced this pull request Dec 30, 2017
* API: Change the sum of all-NA / all-Empty sum / prod

* Max, not min

* Update whatsnew

* Parametrize test

* Minor cleanups

* Refactor skipna_alternative

* Split test

* Added issue

* More updates

* linting

* linting

* Added skips

* Doc fixup

* DOC: More whatsnew

(cherry picked from commit dedfce9)
@topper-123
Copy link
Contributor

topper-123 commented Dec 30, 2017

I not sure understand the rationale for making Series([np.nan]).prod() equal 1. e.g. for percent-change calculations, a calculation for all nan series, should logically give nan, IMO, but with this change it will give 0.

>>> s = pd.Series([0, 0.05, 0.12])
>>> (1+s).prod() -1
0.17600000000000016  # ok
>>> s = pd.Series([0, 0.0, 0.0])
>>> (1+s).prod() -1
0.0  # ok
>>> s = pd.Series([np.nan, np.nan, np.nan])
>>> (1+s).prod() -1
0.0  # how do you interpret this?.

IMO percent calculation will be difficult to interpret, when product of nan's will equal 1, as no-change and not-defined will give back the same result of 0.0.

Is there a write-up on this, in case I'm misunderstanding this?

@shoyer
Copy link
Member

shoyer commented Dec 30, 2017

I not sure understand the rationale for making Series([np.nan]).prod() equal 1.

The rationale is that this is consistent with how sum() works: 0 is the additive identity, and 1 is the multiplicative identity. It would be confusing to use a different rule.

There are absolutely use cases where these definitions for sum/prod is not desired, but it's impossible for the default behavior to work in all cases. Your use case of calculating percent change sounds like a good use for min_count=1 or even skipna=False.

hexgnu pushed a commit to hexgnu/pandas that referenced this pull request Jan 1, 2018
* API: Change the sum of all-NA / all-Empty sum / prod

* Max, not min

* Update whatsnew

* Parametrize test

* Minor cleanups

* Refactor skipna_alternative

* Split test

* Added issue

* More updates

* linting

* linting

* Added skips

* Doc fixup

* DOC: More whatsnew
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
Version 0.22.0

* tag 'v0.22.0': (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
* releases: (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants