Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added a min_count keyword to stat funcs #18876

Merged
merged 1 commit into from
Dec 28, 2017

Conversation

TomAugspurger
Copy link
Contributor

The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass. I'll add additional commits here changing the default to be 0.

Currently, only nansum and nanprod actually do anything with min_count. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.

Additional tests for the new behavior have been added.

closes #18678

@TomAugspurger TomAugspurger added this to the 0.22.0 milestone Dec 20, 2017
@TomAugspurger TomAugspurger added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 20, 2017
if len(self.kwargs) > 0:
for k, v in compat.iteritems(self.kwargs):
if k not in kwds:
kwds[k] = v
try:
if values.size == 0:

if values.size < min_count:
# we either return np.nan or pd.NaT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I moved this logic to _na_for_min_count but forgot to call it here.

@TomAugspurger
Copy link
Contributor Author

5630d9d passed all the CI tests, so that's good.

I just pushed another commit that fixes a few things, but is still backwards compatible with 0.21.x.

I'll push my breaking changes once I get things passing locally.

@jorisvandenbossche
Copy link
Member

I think we should for now only add min_count to sum and prod. Or does that complicate the implementation?

@TomAugspurger
Copy link
Contributor Author

Or does that complicate the implementation?

Not much. I mainly did it for all of them since it was easier, and would possibly be useful for other reductions (haven't thought it through). I'll push a commit that does it just for sum and prod.

@jorisvandenbossche
Copy link
Member

For other reductions like min, max, std, .. we already return NaN for empty or all-NA series, so not sure what min_count would be for

@TomAugspurger
Copy link
Contributor Author

Pushed another backwards compatible commit, including docstrings, written from the point of view of a hypothetical 0.21.2 release. I haven't through through whether it'd be worth backporting the non-breaking changes (adding min_count=1) to 0.21.x, but in case we do it'll just be the current set of commits:

Signature: pd.Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=1, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA or empty, the result
    will be NA
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a scalar
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for Series.
min_count : int, default 1
    The required number of valid values to perform the operation. If fewer than
    ``min_count`` non-NA values are present the result will be NA.

    .. versionadded :: 0.21.2

       Added with the default being 1. This means the sum or product
       of an all-NA or empty series is ``NaN``.

Returns
-------
sum : scalar or Series (if level specified)

Examples
--------
By default, the sum of an empty series is ``NaN``.

>>> pd.Series([]).sum()  # min_count=1 is the default
nan

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be 0, pass ``min_count=0``.

>>> pd.Series([]).sum(min_count=0)
0.0

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
nan

>>> pd.Series([np.nan]).sum(min_count=0)
0.0

@TomAugspurger
Copy link
Contributor Author

For reviewing, do people have a preference for

a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).

result = np.empty(result_shape, dtype=values.dtype)
result.fill(fill_value)
return result
if values.size < min_count:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values could be 2d here, so in theory should handle this on a per-column basis (but not 100% sure if it is ever practically 2d).

@@ -304,15 +308,18 @@ def nanall(values, axis=None, skipna=True):

@disallow('M8')
@bottleneck_switch()
def nansum(values, axis=None, skipna=True):
def nansum(values, axis=None, skipna=True, min_count=1):
if len(values) < min_count:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can have a ndim > 1 here

@@ -1759,6 +1759,48 @@ def test_value_counts_categorical_not_ordered(self):
tm.assert_series_equal(idx.value_counts(normalize=True), exp)


class TestMinCount():
@pytest.mark.parametrize("use_bottleneck", [True, False])
@pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for sum we never use bottleneck, I think for prod we DO use bottleneck. So maybe should just fix that here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _bn_ok_dtype

        # GH 9422
        # further we also want to preserve NaN when all elements
        # are NaN, unlinke bottleneck/numpy which consider this
        # to be 0
        if name in ['nansum', 'nanprod']:
            return False

so we don't use bottleneck. And apparently bottleneck doesn't have a nanprod, at least in my version installed locally.

@pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)])
def test_min_count_empty(self, method, use_bottleneck):
method, unit = method
s = pd.Series()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these tests should be at the very top, right after (or maybe) replacing

class TestSeriesAnalytics(TestData):

    @pytest.mark.parametrize("use_bottleneck", [True, False])
    @pytest.mark.parametrize("method", ["sum", "prod"])
    def test_empty(self, method, use_bottleneck):

        with pd.option_context("use_bottleneck", use_bottleneck):
            # GH 9422
            # treat all missing as NaN

those tests need to be changed, no?

@jreback
Copy link
Contributor

jreback commented Dec 21, 2017

For reviewing, do people have a preference for

a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).

a) maybe that is better, then we can see both passing

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 21, 2017

I think this is ready for review (cc @shoyer). I haven't written a whatsnew since I'll be doing that in the followup PR.

Just to verify, the current plan is to release 0.22.0 as 0.21.1 + this PR and my follow PR with the breaking changes?

@@ -32,36 +32,64 @@ class TestSeriesAnalytics(TestData):
@pytest.mark.parametrize("use_bottleneck", [True, False])
@pytest.mark.parametrize("method", ["sum", "prod"])
def test_empty(self, method, use_bottleneck):
if method == "sum":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just pass in unit with method

@pytest.mark.parametrize("method, unit", [("sum", 0.0), ("prod", 1.0)]))


with pd.option_context("use_bottleneck", use_bottleneck):
# GH 9422
# treat all missing as NaN
s = Series([])
result = getattr(s, method)()
assert isna(result)
result = getattr(s, method)(min_count=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make explcity passing min_count=1 in tests as well, and then add another entry w/o passing min_count (e.g. the default)

@TomAugspurger
Copy link
Contributor Author

FYI, I have a min_count keyword working for groupby on my next branch:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": pd.Categorical(['a', 'a'], categories=['a', 'b']), "B": [1, 1]})

In [3]: df.groupby("A").B.sum()
Out[3]:
A
a    2
b    0
Name: B, dtype: int64

In [4]: df.groupby("A").B.sum(min_count=1)
Out[4]:
A
a    2.0
b    NaN
Name: B, dtype: float64

I think I'll include that in this PR as well (with the default being 1).

@@ -88,7 +89,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

for i in range(ncounts):
for j in range(K):
if nobs[i, j] == 0:
if nobs[i, j] < min_count:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work for min_count==0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sumx starts out as zeros, so we just have to avoid setting it to NaN. Same for prod, but with ones.

@@ -2322,9 +2326,16 @@ def _aggregate(self, result, counts, values, comp_ids, agg_func,
for i, chunk in enumerate(values.transpose(2, 0, 1)):

chunk = chunk.squeeze()
agg_func(result[:, :, i], counts, chunk, comp_ids)
if min_count is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would always pass it to cython as an int. it can be -1 if its None somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly did this to avoid having to add an unused min_count keyword to all the group_* methods. If I added min_count to each of those, then I'd also need to raise if the non-default value is passed. I'll add a comment here that it's only expected for count and prod though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add it and assert == -1. these are all internal routines anyhow. an if is very ugly and error prone here. better to make them have the same uniform signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok that sounds easy enough.

agg_func(result, counts, values, comp_ids)
if min_count is None:
agg_func(result, counts, values, comp_ids)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -332,7 +343,8 @@ def get_dispatch(dtypes):
def group_last_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels):
ndarray[int64_t] labels,
Py_ssize_t min_count):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can set == -1 here (and actually all routines) i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do (had to for OHLC). I couldn't remember if cython took a perf hit for checking default values.

@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self):
grps = range(0, 55, 5)

for op, targop in ops:
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)
# calling _cython_agg_general directly, instead of via the user API
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit was unfortunate, but I think necessary since we're calling _cython_agg_general directly.

I'll need change min_count to 0 in the followup PR.

@TomAugspurger
Copy link
Contributor Author

Added min_count to resample in the last push.

In [4]: s.resample('30T').sum()  # default min_count=1 for now, will change to 0
Out[4]:
2017-01-01 00:00:00    1.0
2017-01-01 00:30:00    NaN
2017-01-01 01:00:00    1.0
Freq: 30T, dtype: float64

In [5]: s.resample('30T').sum(min_count=0)
Out[5]:
2017-01-01 00:00:00    1
2017-01-01 00:30:00    0
2017-01-01 01:00:00    1
Freq: 30T, dtype: int64

I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?

I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.

Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.

@Appender(_local_template)
def f(self, **kwargs):
return f_(self, **kwargs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? not sure u need this
shouldn’t the template just work for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required, but with it I get the min_count in the signature.

In [5]: gr.sum?
Signature: gr.sum(min_count=1, **kwargs)
Docstring:
Compute sum of group values

Without the if / else, things work , but then the sig is just **kwargs. I have a slight preference for including it the signature, but will revert it if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert it for now. we don't include this for anything else (IOW any parameters now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there is an issue about this. its needs a comprehensive soln.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I forgot about numeric_only, etc. OK.

@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self):
grps = range(0, 55, 5)

for op, targop in ops:
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should paramterize this

@jreback
Copy link
Contributor

jreback commented Dec 21, 2017

I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?

I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.

Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.

min_count is de-facto the same as min_periods or at least it performs the same function. I don't think it is necessary to add and would add much complexity and user confusion. We could just rename min_periods to min_count I suppose. (need a deprecation cycle).

@TomAugspurger
Copy link
Contributor Author

Ah, is min_periods the minimum number of observations, or the minimum number of non-NA observations? e.g.

In [2]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D"))
   ...: t = s.copy()
   ...: t.iloc[[3, 4]] = np.nan
   ...:

In [3]: s
Out[3]:
2017-01-01    1
2017-01-02    1
2017-01-03    1
2017-01-04    1
2017-01-05    1
2017-01-06    1
Freq: D, dtype: int64

In [4]: t
Out[4]:
2017-01-01    1.0
2017-01-02    1.0
2017-01-03    1.0
2017-01-04    NaN
2017-01-05    NaN
2017-01-06    1.0
Freq: D, dtype: float64

In [5]: s.rolling(2).sum()
Out[5]:
2017-01-01    NaN
2017-01-02    2.0
2017-01-03    2.0
2017-01-04    2.0
2017-01-05    2.0
2017-01-06    2.0
Freq: D, dtype: float64

In [6]: t.rolling(2).sum()
Out[6]:
2017-01-01    NaN   # (1,)
2017-01-02    2.0   # (1, 1)
2017-01-03    2.0   # (1, 1)
2017-01-04    NaN   # (1, nan)
2017-01-05    NaN   # (nan, nan)
2017-01-06    NaN   # (nan, 1)
Freq: D, dtype: float64

So will Out[6] be changing at all?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Do we need to add some tests with higher values for min_count. This 'works', so either should test it or disallow it.

As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:

In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))

In [16]: s
Out[16]: 
1  a    1.0
   b    NaN
2  a    NaN
   b    NaN
dtype: float64

In [17]: s.sum(level=0)
Out[17]: 
1    1.0
2    0.0
dtype: float64

In [18]: s.sum(level=0, min_count=0)
Out[18]: 
1    1.0
2    0.0
dtype: float64

In [19]: s.sum(level=0, min_count=1)
Out[19]: 
1    1.0
2    0.0
dtype: float64

(and it also changed compared to 0.21.0, there it gave 1.0, NaN in this case).

Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:

In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))

In [21]: s.groupby(level=0).sum()
Out[21]: 
1    1.0
2    0.0
dtype: float64

In [22]: s.groupby(level=0).sum(min_count=0)
Out[22]: 
1    1.0
2    0.0
dtype: float64

In [23]: s.groupby(level=0).sum(min_count=1)
Out[23]: 
1    1.0
2    NaN
dtype: float64

(I get the same when grouping by array of values or grouping by column in dataframe)

The required number of valid values to perform the operation. If fewer than
``min_count`` non-NA values are present the result will be NA.

.. versionadded :: 0.21.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it will become 0.22 ? (but can change later)

axis = self._stat_axis_number
if level is not None:
return self._agg_by_level(name, axis=axis, level=level,
skipna=skipna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this one also need to handle min_count ?

if axis is not None and getattr(result, 'ndim', False):
null_mask = (mask.shape[axis] - mask.sum(axis)) == 0
null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just substract here something if null_mask is later on used for indexing into result ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. null_mask will be a boolean array of which values in the output should be null because they have too few valid values.

@TomAugspurger
Copy link
Contributor Author

Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:

I just pushed a commit fixing that. min_count wasn't being passed through.

In [4]: s.groupby(level=0).sum()
Out[4]:
1    1.0
2    NaN
dtype: float64

As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:

This seems to be fixed (presumably level= is using groupy and was hit by the same issue)

In [10]: s.sum(level=0)
Out[10]:
1    1.0
2    NaN
dtype: float64

@jorisvandenbossche
Copy link
Member

Groupby is indeed fixed!

This seems to be fixed (presumably level= is using groupy and was hit by the same issue)

The default is indeed fixed, but min_count has not yet any effect, so that still needs to be passed through as well in some way I think

@codecov
Copy link

codecov bot commented Dec 22, 2017

Codecov Report

Merging #18876 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18876      +/-   ##
==========================================
+ Coverage   91.57%   91.65%   +0.07%     
==========================================
  Files         150      154       +4     
  Lines       48913    51394    +2481     
==========================================
+ Hits        44794    47107    +2313     
- Misses       4119     4287     +168
Flag Coverage Δ
#multiple 89.52% <100%> (-0.42%) ⬇️
#single 40.84% <49.12%> (-0.33%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.95% <100%> (+0.03%) ⬆️
pandas/core/nanops.py 96.68% <100%> (+0.01%) ⬆️
pandas/core/resample.py 96.38% <100%> (+0.01%) ⬆️
pandas/core/groupby.py 92.13% <100%> (+0.04%) ⬆️
pandas/io/json/json.py 92.08% <0%> (-0.48%) ⬇️
pandas/tseries/offsets.py 96.95% <0%> (-0.12%) ⬇️
pandas/core/indexes/period.py 92.93% <0%> (-0.07%) ⬇️
pandas/core/indexes/datetimelike.py 97.13% <0%> (-0.07%) ⬇️
pandas/core/indexes/base.py 96.44% <0%> (-0.01%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef75390...0e954f8. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

Ok, we should be good on series/frame, groupby, and resample. Does anyone have thoughts on expanding / window? Does min_periods mean

  1. The minimum number of observations in the window
  2. The minimum number of non-NA observations in the window

If it's 1, then the current behavior (on master) is buggy:

In [6]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D"))
   ...: t = s.copy()
   ...: t.iloc[[3, 4]] = np.nan
   ...:

In [7]: t.rolling(2, min_periods=2).apply(lambda x: np.nansum(x))
Out[7]:
2017-01-01    NaN  # (1,)
2017-01-02    2.0  # (1, 1)
2017-01-03    2.0  # (1, 1)
2017-01-04    NaN  # (1, nan)
2017-01-05    NaN  # (nan, nan)
2017-01-06    NaN  # (nan, 1)
Freq: D, dtype: float64

@jreback
Copy link
Contributor

jreback commented Dec 22, 2017

why would you think 1)? it always has meant non-NA

@jorisvandenbossche
Copy link
Member

I agree it currently means non-NA rows (although that is not explicit in the docstring).

I am thinking if there are possible reasons we would want to change this, and to have both min_periods in rolling and min_count in sum. In principle that would give the ability to distinguish between actuel NaNs in the data and no data in the period ("introduced" NaNs).
(and for all other functions apart from sum and prod, it wouldn't change anything as NaN values always return NaN anyway, so whether it is because of a NaN in the data or because of min_periods no including those NaN data, that does not matter).

I personally wouldn't rename min_periods to min_count, but just leave the name as is (also if we don't add a min_count to the rolling sum). It is in a different function (rolling, not sum) and is also relevant for other aggregations that don't have such a min_count in their normal dataframe counterpart.

@TomAugspurger
Copy link
Contributor Author

Agreed with @jorisvandenbossche.

There may be a bit of a slight issue with the current implementation of rolling.sum() when min_periods=0.

On my other branch:

In [9]: x = pd.Series([np.nan])

In [10]: x.rolling(1, min_periods=0).sum()

Out[10]:
0   NaN
dtype: float64

In [11]: x.rolling(1, min_periods=0).apply(lambda v: pd.Series(v).sum())

Out[11]:
0    0.0
dtype: float64

I assume we want those to be consistent?

@TomAugspurger
Copy link
Contributor Author

Here's master:

In [1]: import pandas as pd; import numpy as np;

In [2]: x = pd.Series([np.nan])

In [3]: x.rolling(1, min_periods=0).sum()
Out[3]:
0   NaN
dtype: float64

In [4]: x.rolling(1, min_periods=1).sum()
Out[4]:
0   NaN
dtype: float64

Here's my branch

In [1]: import pandas as pd; import numpy as np;

In [2]: x = pd.Series([np.nan])

In [3]: x.rolling(1, min_periods=0).sum()
Out[3]:
0    0.0
dtype: float64

In [4]: x.rolling(1, min_periods=1).sum()
Out[4]:
0   NaN
dtype: float64

In [5]: x.rolling(1).sum()
Out[5]:
0   NaN
dtype: float64

The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".

@jorisvandenbossche
Copy link
Member

The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".

To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum?
(which is fine for me)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 23, 2017 via email

@jreback
Copy link
Contributor

jreback commented Dec 28, 2017

@TomAugspurger why don't you rebase on master. can merge this on green (as I think you are taking care of any remaining comments in other PR). (and then backport, maybe create the new branch as well)?

The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass.

Currently, only nansum and nanprod actually do anything with `min_count`. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.

Additional tests for the new behavior have been added.
@TomAugspurger
Copy link
Contributor Author

Yeah, sounds good, this one should be OK.

I'll create a 0.22.x branch that starts at 0.21.x and backport this + the other PR once that's finished.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2017

@TomAugspurger I also think that need to fix the version number? (or I suppose once you tag 0.22, then can fix that)?

@TomAugspurger
Copy link
Contributor Author

Yeah, how should we do that? I suppose we could tag 0.23.dev0 on master any time right? So then it stays ahead of 0.22.x.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2017

yeah I think can change the tag on master. merge away when ready.

@TomAugspurger TomAugspurger merged commit dbec3c9 into pandas-dev:master Dec 28, 2017
@TomAugspurger TomAugspurger deleted the sum-mincount branch December 28, 2017 14:44
@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

thanks @TomAugspurger

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 29, 2017
The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass.

Currently, only nansum and nanprod actually do anything with `min_count`. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.

Additional tests for the new behavior have been added.

(cherry picked from commit dbec3c9)
TomAugspurger added a commit that referenced this pull request Dec 30, 2017
The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass.

Currently, only nansum and nanprod actually do anything with `min_count`. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.

Additional tests for the new behavior have been added.

(cherry picked from commit dbec3c9)
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
Version 0.22.0

* tag 'v0.22.0': (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
* releases: (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sum of all NA and empty should be 0
3 participants