ENH: Added a min_count keyword to stat funcs #18876

TomAugspurger · 2017-12-20T17:36:56Z

The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass. I'll add additional commits here changing the default to be 0.

Currently, only nansum and nanprod actually do anything with min_count. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.

Additional tests for the new behavior have been added.

closes #18678

TomAugspurger · 2017-12-20T17:39:18Z

pandas/core/nanops.py

            if len(self.kwargs) > 0:
                for k, v in compat.iteritems(self.kwargs):
                    if k not in kwds:
                        kwds[k] = v
            try:
-                if values.size == 0:
-
+                if values.size < min_count:
                    # we either return np.nan or pd.NaT


Whoops, I moved this logic to _na_for_min_count but forgot to call it here.

TomAugspurger · 2017-12-20T20:31:43Z

5630d9d passed all the CI tests, so that's good.

I just pushed another commit that fixes a few things, but is still backwards compatible with 0.21.x.

I'll push my breaking changes once I get things passing locally.

jorisvandenbossche · 2017-12-20T21:25:23Z

I think we should for now only add min_count to sum and prod. Or does that complicate the implementation?

TomAugspurger · 2017-12-20T21:29:00Z

Or does that complicate the implementation?

Not much. I mainly did it for all of them since it was easier, and would possibly be useful for other reductions (haven't thought it through). I'll push a commit that does it just for sum and prod.

jorisvandenbossche · 2017-12-20T21:32:09Z

For other reductions like min, max, std, .. we already return NaN for empty or all-NA series, so not sure what min_count would be for

TomAugspurger · 2017-12-20T22:26:34Z

Pushed another backwards compatible commit, including docstrings, written from the point of view of a hypothetical 0.21.2 release. I haven't through through whether it'd be worth backporting the non-breaking changes (adding min_count=1) to 0.21.x, but in case we do it'll just be the current set of commits:

Signature: pd.Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=1, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA or empty, the result
    will be NA
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a scalar
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for Series.
min_count : int, default 1
    The required number of valid values to perform the operation. If fewer than
    ``min_count`` non-NA values are present the result will be NA.

    .. versionadded :: 0.21.2

       Added with the default being 1. This means the sum or product
       of an all-NA or empty series is ``NaN``.

Returns
-------
sum : scalar or Series (if level specified)

Examples
--------
By default, the sum of an empty series is ``NaN``.

>>> pd.Series([]).sum()  # min_count=1 is the default
nan

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be 0, pass ``min_count=0``.

>>> pd.Series([]).sum(min_count=0)
0.0

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
nan

>>> pd.Series([np.nan]).sum(min_count=0)
0.0

TomAugspurger · 2017-12-20T22:34:59Z

For reviewing, do people have a preference for

a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).

jreback · 2017-12-21T00:04:41Z

pandas/core/nanops.py

-                        result = np.empty(result_shape, dtype=values.dtype)
-                        result.fill(fill_value)
-                        return result
+                if values.size < min_count:


values could be 2d here, so in theory should handle this on a per-column basis (but not 100% sure if it is ever practically 2d).

jreback · 2017-12-21T00:05:25Z

pandas/core/nanops.py

@@ -304,15 +308,18 @@ def nanall(values, axis=None, skipna=True):

 @disallow('M8')
 @bottleneck_switch()
-def nansum(values, axis=None, skipna=True):
+def nansum(values, axis=None, skipna=True, min_count=1):
+    if len(values) < min_count:


can have a ndim > 1 here

jreback · 2017-12-21T00:06:43Z

pandas/tests/series/test_analytics.py

@@ -1759,6 +1759,48 @@ def test_value_counts_categorical_not_ordered(self):
        tm.assert_series_equal(idx.value_counts(normalize=True), exp)


+class TestMinCount():
+    @pytest.mark.parametrize("use_bottleneck", [True, False])
+    @pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)])


so for sum we never use bottleneck, I think for prod we DO use bottleneck. So maybe should just fix that here as well.

In _bn_ok_dtype

# GH 9422 # further we also want to preserve NaN when all elements # are NaN, unlinke bottleneck/numpy which consider this # to be 0 if name in ['nansum', 'nanprod']: return False

so we don't use bottleneck. And apparently bottleneck doesn't have a nanprod, at least in my version installed locally.

jreback · 2017-12-21T00:08:46Z

pandas/tests/series/test_analytics.py

+    @pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)])
+    def test_min_count_empty(self, method, use_bottleneck):
+        method, unit = method
+        s = pd.Series()


all of these tests should be at the very top, right after (or maybe) replacing

class TestSeriesAnalytics(TestData): @pytest.mark.parametrize("use_bottleneck", [True, False]) @pytest.mark.parametrize("method", ["sum", "prod"]) def test_empty(self, method, use_bottleneck): with pd.option_context("use_bottleneck", use_bottleneck): # GH 9422 # treat all missing as NaN

those tests need to be changed, no?

jreback · 2017-12-21T00:10:40Z

For reviewing, do people have a preference for

a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).

a) maybe that is better, then we can see both passing

TomAugspurger · 2017-12-21T14:39:56Z

I think this is ready for review (cc @shoyer). I haven't written a whatsnew since I'll be doing that in the followup PR.

Just to verify, the current plan is to release 0.22.0 as 0.21.1 + this PR and my follow PR with the breaking changes?

jreback · 2017-12-21T15:44:18Z

pandas/tests/series/test_analytics.py

@@ -32,36 +32,64 @@ class TestSeriesAnalytics(TestData):
    @pytest.mark.parametrize("use_bottleneck", [True, False])
    @pytest.mark.parametrize("method", ["sum", "prod"])
    def test_empty(self, method, use_bottleneck):
+        if method == "sum":


you should just pass in unit with method

@pytest.mark.parametrize("method, unit", [("sum", 0.0), ("prod", 1.0)]))

jreback · 2017-12-21T15:45:08Z

pandas/tests/series/test_analytics.py


        with pd.option_context("use_bottleneck", use_bottleneck):
            # GH 9422
            # treat all missing as NaN
            s = Series([])
            result = getattr(s, method)()
            assert isna(result)
+            result = getattr(s, method)(min_count=0)


I would make explcity passing min_count=1 in tests as well, and then add another entry w/o passing min_count (e.g. the default)

TomAugspurger · 2017-12-21T17:25:42Z

FYI, I have a min_count keyword working for groupby on my next branch:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": pd.Categorical(['a', 'a'], categories=['a', 'b']), "B": [1, 1]})

In [3]: df.groupby("A").B.sum()
Out[3]:
A
a    2
b    0
Name: B, dtype: int64

In [4]: df.groupby("A").B.sum(min_count=1)
Out[4]:
A
a    2.0
b    NaN
Name: B, dtype: float64

I think I'll include that in this PR as well (with the default being 1).

jreback · 2017-12-21T18:38:27Z

pandas/_libs/groupby_helper.pxi.in

@@ -88,7 +89,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

        for i in range(ncounts):
            for j in range(K):
-                if nobs[i, j] == 0:
+                if nobs[i, j] < min_count:


does this work for min_count==0?

Yeah, sumx starts out as zeros, so we just have to avoid setting it to NaN. Same for prod, but with ones.

jreback · 2017-12-21T18:40:38Z

pandas/core/groupby.py

@@ -2322,9 +2326,16 @@ def _aggregate(self, result, counts, values, comp_ids, agg_func,
            for i, chunk in enumerate(values.transpose(2, 0, 1)):

                chunk = chunk.squeeze()
-                agg_func(result[:, :, i], counts, chunk, comp_ids)
+                if min_count is None:


I would always pass it to cython as an int. it can be -1 if its None somehow

I mainly did this to avoid having to add an unused min_count keyword to all the group_* methods. If I added min_count to each of those, then I'd also need to raise if the non-default value is passed. I'll add a comment here that it's only expected for count and prod though.

I would just add it and assert == -1. these are all internal routines anyhow. an if is very ugly and error prone here. better to make them have the same uniform signature.

Ah, ok that sounds easy enough.

jreback · 2017-12-21T18:40:45Z

pandas/core/groupby.py

-            agg_func(result, counts, values, comp_ids)
+            if min_count is None:
+                agg_func(result, counts, values, comp_ids)
+            else:


jreback · 2017-12-21T20:04:56Z

pandas/_libs/groupby_helper.pxi.in

@@ -332,7 +343,8 @@ def get_dispatch(dtypes):
 def group_last_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
                        ndarray[int64_t] counts,
                        ndarray[{{c_type}}, ndim=2] values,
-                        ndarray[int64_t] labels):
+                        ndarray[int64_t] labels,
+                        Py_ssize_t min_count):


you can set == -1 here (and actually all routines) i think

OK, will do (had to for OHLC). I couldn't remember if cython took a perf hit for checking default values.

TomAugspurger · 2017-12-21T20:06:36Z

pandas/tests/groupby/test_aggregate.py

@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self):
        grps = range(0, 55, 5)

        for op, targop in ops:
-            result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)
+            # calling _cython_agg_general directly, instead of via the user API


This bit was unfortunate, but I think necessary since we're calling _cython_agg_general directly.

I'll need change min_count to 0 in the followup PR.

TomAugspurger · 2017-12-21T20:07:37Z

Added min_count to resample in the last push.

In [4]: s.resample('30T').sum()  # default min_count=1 for now, will change to 0
Out[4]:
2017-01-01 00:00:00    1.0
2017-01-01 00:30:00    NaN
2017-01-01 01:00:00    1.0
Freq: 30T, dtype: float64

In [5]: s.resample('30T').sum(min_count=0)
Out[5]:
2017-01-01 00:00:00    1
2017-01-01 00:30:00    0
2017-01-01 01:00:00    1
Freq: 30T, dtype: int64

I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?

I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.

Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.

jreback · 2017-12-21T20:06:59Z

pandas/core/groupby.py

+                @Appender(_local_template)
+                def f(self, **kwargs):
+                    return f_(self, **kwargs)
+            else:


? not sure u need this
shouldn’t the template just work for this?

It's not required, but with it I get the min_count in the signature.

In [5]: gr.sum? Signature: gr.sum(min_count=1, **kwargs) Docstring: Compute sum of group values

Without the if / else, things work , but then the sig is just **kwargs. I have a slight preference for including it the signature, but will revert it if you want.

I would revert it for now. we don't include this for anything else (IOW any parameters now).

IIRC there is an issue about this. its needs a comprehensive soln.

Yeah, I forgot about numeric_only, etc. OK.

jreback · 2017-12-21T20:08:38Z

pandas/tests/groupby/test_aggregate.py

@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self):
        grps = range(0, 55, 5)

        for op, targop in ops:
-            result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)


should paramterize this

jreback · 2017-12-21T20:12:39Z

I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?

I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.

Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.

min_count is de-facto the same as min_periods or at least it performs the same function. I don't think it is necessary to add and would add much complexity and user confusion. We could just rename min_periods to min_count I suppose. (need a deprecation cycle).

TomAugspurger · 2017-12-21T20:24:04Z

Ah, is min_periods the minimum number of observations, or the minimum number of non-NA observations? e.g.

In [2]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D"))
   ...: t = s.copy()
   ...: t.iloc[[3, 4]] = np.nan
   ...:

In [3]: s
Out[3]:
2017-01-01    1
2017-01-02    1
2017-01-03    1
2017-01-04    1
2017-01-05    1
2017-01-06    1
Freq: D, dtype: int64

In [4]: t
Out[4]:
2017-01-01    1.0
2017-01-02    1.0
2017-01-03    1.0
2017-01-04    NaN
2017-01-05    NaN
2017-01-06    1.0
Freq: D, dtype: float64

In [5]: s.rolling(2).sum()
Out[5]:
2017-01-01    NaN
2017-01-02    2.0
2017-01-03    2.0
2017-01-04    2.0
2017-01-05    2.0
2017-01-06    2.0
Freq: D, dtype: float64

In [6]: t.rolling(2).sum()
Out[6]:
2017-01-01    NaN   # (1,)
2017-01-02    2.0   # (1, 1)
2017-01-03    2.0   # (1, 1)
2017-01-04    NaN   # (1, nan)
2017-01-05    NaN   # (nan, nan)
2017-01-06    NaN   # (nan, 1)
Freq: D, dtype: float64

So will Out[6] be changing at all?

jorisvandenbossche

Great work!

Do we need to add some tests with higher values for min_count. This 'works', so either should test it or disallow it.

As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:

In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))

In [16]: s
Out[16]: 
1  a    1.0
   b    NaN
2  a    NaN
   b    NaN
dtype: float64

In [17]: s.sum(level=0)
Out[17]: 
1    1.0
2    0.0
dtype: float64

In [18]: s.sum(level=0, min_count=0)
Out[18]: 
1    1.0
2    0.0
dtype: float64

In [19]: s.sum(level=0, min_count=1)
Out[19]: 
1    1.0
2    0.0
dtype: float64

(and it also changed compared to 0.21.0, there it gave 1.0, NaN in this case).

Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:

In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))

In [21]: s.groupby(level=0).sum()
Out[21]: 
1    1.0
2    0.0
dtype: float64

In [22]: s.groupby(level=0).sum(min_count=0)
Out[22]: 
1    1.0
2    0.0
dtype: float64

In [23]: s.groupby(level=0).sum(min_count=1)
Out[23]: 
1    1.0
2    NaN
dtype: float64

(I get the same when grouping by array of values or grouping by column in dataframe)

jorisvandenbossche · 2017-12-22T14:32:08Z

pandas/core/generic.py

+    The required number of valid values to perform the operation. If fewer than
+    ``min_count`` non-NA values are present the result will be NA.
+
+    .. versionadded :: 0.21.2


probably it will become 0.22 ? (but can change later)

jorisvandenbossche · 2017-12-22T14:33:15Z

pandas/core/generic.py

+            axis = self._stat_axis_number
+        if level is not None:
+            return self._agg_by_level(name, axis=axis, level=level,
+                                      skipna=skipna)


Does this one also need to handle min_count ?

jorisvandenbossche · 2017-12-22T15:01:23Z

pandas/core/nanops.py

    if axis is not None and getattr(result, 'ndim', False):
-        null_mask = (mask.shape[axis] - mask.sum(axis)) == 0
+        null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0


Can you just substract here something if null_mask is later on used for indexing into result ?

I'm not sure what you mean here. null_mask will be a boolean array of which values in the output should be null because they have too few valid values.

TomAugspurger · 2017-12-22T15:08:11Z

Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:

I just pushed a commit fixing that. min_count wasn't being passed through.

In [4]: s.groupby(level=0).sum()
Out[4]:
1    1.0
2    NaN
dtype: float64

As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:

This seems to be fixed (presumably level= is using groupy and was hit by the same issue)

In [10]: s.sum(level=0)
Out[10]:
1    1.0
2    NaN
dtype: float64

jorisvandenbossche · 2017-12-22T15:27:09Z

Groupby is indeed fixed!

This seems to be fixed (presumably level= is using groupy and was hit by the same issue)

The default is indeed fixed, but min_count has not yet any effect, so that still needs to be passed through as well in some way I think

codecov · 2017-12-22T17:32:40Z

Codecov Report

Merging #18876 into master will increase coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18876      +/-   ##
==========================================
+ Coverage   91.57%   91.65%   +0.07%     
==========================================
  Files         150      154       +4     
  Lines       48913    51394    +2481     
==========================================
+ Hits        44794    47107    +2313     
- Misses       4119     4287     +168

Flag	Coverage Δ
#multiple	`89.52% <100%> (-0.42%)`	⬇️
#single	`40.84% <49.12%> (-0.33%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`95.95% <100%> (+0.03%)`	⬆️
pandas/core/nanops.py	`96.68% <100%> (+0.01%)`	⬆️
pandas/core/resample.py	`96.38% <100%> (+0.01%)`	⬆️
pandas/core/groupby.py	`92.13% <100%> (+0.04%)`	⬆️
pandas/io/json/json.py	`92.08% <0%> (-0.48%)`	⬇️
pandas/tseries/offsets.py	`96.95% <0%> (-0.12%)`	⬇️
pandas/core/indexes/period.py	`92.93% <0%> (-0.07%)`	⬇️
pandas/core/indexes/datetimelike.py	`97.13% <0%> (-0.07%)`	⬇️
pandas/core/indexes/base.py	`96.44% <0%> (-0.01%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef75390...0e954f8. Read the comment docs.

TomAugspurger · 2017-12-22T18:51:36Z

Ok, we should be good on series/frame, groupby, and resample. Does anyone have thoughts on expanding / window? Does min_periods mean

The minimum number of observations in the window
The minimum number of non-NA observations in the window

If it's 1, then the current behavior (on master) is buggy:

In [6]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D"))
   ...: t = s.copy()
   ...: t.iloc[[3, 4]] = np.nan
   ...:

In [7]: t.rolling(2, min_periods=2).apply(lambda x: np.nansum(x))
Out[7]:
2017-01-01    NaN  # (1,)
2017-01-02    2.0  # (1, 1)
2017-01-03    2.0  # (1, 1)
2017-01-04    NaN  # (1, nan)
2017-01-05    NaN  # (nan, nan)
2017-01-06    NaN  # (nan, 1)
Freq: D, dtype: float64

jreback · 2017-12-22T18:55:13Z

why would you think 1)? it always has meant non-NA

jorisvandenbossche · 2017-12-22T21:06:28Z

I agree it currently means non-NA rows (although that is not explicit in the docstring).

I am thinking if there are possible reasons we would want to change this, and to have both min_periods in rolling and min_count in sum. In principle that would give the ability to distinguish between actuel NaNs in the data and no data in the period ("introduced" NaNs).
(and for all other functions apart from sum and prod, it wouldn't change anything as NaN values always return NaN anyway, so whether it is because of a NaN in the data or because of min_periods no including those NaN data, that does not matter).

I personally wouldn't rename min_periods to min_count, but just leave the name as is (also if we don't add a min_count to the rolling sum). It is in a different function (rolling, not sum) and is also relevant for other aggregations that don't have such a min_count in their normal dataframe counterpart.

TomAugspurger · 2017-12-22T21:58:34Z

Agreed with @jorisvandenbossche.

There may be a bit of a slight issue with the current implementation of rolling.sum() when min_periods=0.

On my other branch:

In [9]: x = pd.Series([np.nan])

In [10]: x.rolling(1, min_periods=0).sum()

Out[10]:
0   NaN
dtype: float64

In [11]: x.rolling(1, min_periods=0).apply(lambda v: pd.Series(v).sum())

Out[11]:
0    0.0
dtype: float64

I assume we want those to be consistent?

TomAugspurger · 2017-12-23T12:34:16Z

Here's master:

In [1]: import pandas as pd; import numpy as np;

In [2]: x = pd.Series([np.nan])

In [3]: x.rolling(1, min_periods=0).sum()
Out[3]:
0   NaN
dtype: float64

In [4]: x.rolling(1, min_periods=1).sum()
Out[4]:
0   NaN
dtype: float64

Here's my branch

In [1]: import pandas as pd; import numpy as np;

In [2]: x = pd.Series([np.nan])

In [3]: x.rolling(1, min_periods=0).sum()
Out[3]:
0    0.0
dtype: float64

In [4]: x.rolling(1, min_periods=1).sum()
Out[4]:
0   NaN
dtype: float64

In [5]: x.rolling(1).sum()
Out[5]:
0   NaN
dtype: float64

The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".

jorisvandenbossche · 2017-12-23T14:05:51Z

The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".

To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum?
(which is fine for me)

TomAugspurger · 2017-12-23T15:11:42Z

If we interpret min_periods as the minimum number of non-na periods, then yes. There won’t be a need for rolling.sum to take min_count.

…

________________________________ From: Joris Van den Bossche <notifications@github.com> Sent: Saturday, December 23, 2017 8:05:55 AM To: pandas-dev/pandas Cc: Tom Augspurger; Author Subject: Re: [pandas-dev/pandas] ENH: Added a min_count keyword to stat funcs (#18876) The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing". To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum? (which is fine for me) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#18876 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIr1ucILhc6rd72DpuGQSd7Yn-ObQks5tDQjDgaJpZM4RIuoM>.

jreback · 2017-12-28T12:34:09Z

@TomAugspurger why don't you rebase on master. can merge this on green (as I think you are taking care of any remaining comments in other PR). (and then backport, maybe create the new branch as well)?

The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass. Currently, only nansum and nanprod actually do anything with `min_count`. It will not be hard to adjust other nan* methods use it if we want. This was just simplest for now. Additional tests for the new behavior have been added.

TomAugspurger · 2017-12-28T12:37:30Z

Yeah, sounds good, this one should be OK.

I'll create a 0.22.x branch that starts at 0.21.x and backport this + the other PR once that's finished.

jreback · 2017-12-28T12:39:27Z

@TomAugspurger I also think that need to fix the version number? (or I suppose once you tag 0.22, then can fix that)?

TomAugspurger · 2017-12-28T12:44:23Z

Yeah, how should we do that? I suppose we could tag 0.23.dev0 on master any time right? So then it stays ahead of 0.22.x.

jreback · 2017-12-28T14:18:37Z

yeah I think can change the tag on master. merge away when ready.

jreback · 2017-12-29T00:58:23Z

thanks @TomAugspurger

The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass. Currently, only nansum and nanprod actually do anything with `min_count`. It will not be hard to adjust other nan* methods use it if we want. This was just simplest for now. Additional tests for the new behavior have been added. (cherry picked from commit dbec3c9)

Version 0.22.0 * tag 'v0.22.0': (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...

* releases: (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...

TomAugspurger mentioned this pull request Dec 20, 2017

[WIP] Prod/Sum of all-NA / all-empty #18871

Closed

TomAugspurger added this to the 0.22.0 milestone Dec 20, 2017

TomAugspurger added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 20, 2017

TomAugspurger commented Dec 20, 2017

View reviewed changes

jreback requested changes Dec 21, 2017

View reviewed changes

jreback mentioned this pull request Dec 21, 2017

ENH: Let Categorical.rename_categories take a callable #18862

Merged

3 tasks

jreback requested changes Dec 21, 2017

View reviewed changes

jreback reviewed Dec 21, 2017

View reviewed changes

TomAugspurger commented Dec 21, 2017

View reviewed changes

jreback requested changes Dec 21, 2017

View reviewed changes

jorisvandenbossche reviewed Dec 22, 2017

View reviewed changes

TomAugspurger mentioned this pull request Dec 23, 2017

Breaking changes for sum / prod of empty / all-NA #18921

Merged

TomAugspurger force-pushed the sum-mincount branch from 0ed1a13 to b9decb6 Compare December 27, 2017 14:43

jreback approved these changes Dec 28, 2017

View reviewed changes

TomAugspurger force-pushed the sum-mincount branch from b9decb6 to 0e954f8 Compare December 28, 2017 12:36

TomAugspurger merged commit dbec3c9 into pandas-dev:master Dec 28, 2017

TomAugspurger deleted the sum-mincount branch December 28, 2017 14:44

ENH: Added a min_count keyword to stat funcs #18876

ENH: Added a min_count keyword to stat funcs #18876

Conversation

TomAugspurger commented Dec 20, 2017

Choose a reason for hiding this comment

TomAugspurger commented Dec 20, 2017

jorisvandenbossche commented Dec 20, 2017

TomAugspurger commented Dec 20, 2017

jorisvandenbossche commented Dec 20, 2017

TomAugspurger commented Dec 20, 2017

TomAugspurger commented Dec 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 21, 2017

TomAugspurger commented Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 21, 2017

TomAugspurger commented Dec 21, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 22, 2017

jorisvandenbossche commented Dec 22, 2017

codecov bot commented Dec 22, 2017 • edited Loading

Codecov Report

TomAugspurger commented Dec 22, 2017

jreback commented Dec 22, 2017

jorisvandenbossche commented Dec 22, 2017

TomAugspurger commented Dec 22, 2017

TomAugspurger commented Dec 23, 2017

jorisvandenbossche commented Dec 23, 2017

TomAugspurger commented Dec 23, 2017 via email

jreback commented Dec 28, 2017 • edited Loading

TomAugspurger commented Dec 28, 2017

jreback commented Dec 28, 2017

TomAugspurger commented Dec 28, 2017

jreback commented Dec 28, 2017

jreback commented Dec 29, 2017

TomAugspurger commented Dec 21, 2017 •

edited

Loading

codecov bot commented Dec 22, 2017 •

edited

Loading

jreback commented Dec 28, 2017 •

edited

Loading