ENH: numerically stable rolling_skew and rolling_kurt #8424

jaimefrio · 2014-09-30T03:26:27Z

Close #6929. Hat tip to @behzadnouri for the z-scores idea in #8270, and to @seth-p for the general structure of the code, separating removal and addition of observations.

jaimefrio · 2014-09-30T03:47:52Z

Some rationale behind some of the design choices:

I have joined roll_skew and roll_kurt into a single roll_higher_moment, since they shared most of the code.
Since both skewness and kurtosis require a division by the variance, there is detection of exactly zero variance built into the function, to exactly detect NaNs.
The actual computations of the skewness and kurtosis from the moments may look weird and inefficient, but they are trading extra multiplications for less divisions and/or square roots. This is the fastest I could manage, and doesn't seem to affect numerical stability.
Rather than explicitly coding the four cases of val and prev being NaN or not, I have used @seth-p's idea of handling the removal and addition of an observation separately. This is slightly less efficient for the general case where neither is NaN, but I felt there was already enough complication. And the performance of the functions is either improved or maintained from current master, more details below. But there maya be a 15% of extra performance in there.
Both kurtosis and skewness have terrible numerical stability when both the fourth or third central moment and the second, are very small. To minimize the effect on computations on some pathological cases, I have implemented @behzadnouri's idea of using z-scores as part of _rolling_func and _rolling_moment: they now take center_data and norm_data flags that trigger subtracting the mean and dividing by the standard deviation.
To avoid unnecessarily slowing down execution, rolling_skew only centers the data, while rolling_kurt centers and normalizes it, see timings below.
The loss of stability seems to be related with the removal of observations, so neither expanding_skew nor expanding_kurt do any centering or normalizing, although it would be very easy to implement.

Timings below are on x = np.random.rand(1e6) calling rolling_xxx(x, 100, 0). The chosen implementation is marked with ** **. Performance is unchanged for rolling_skew and almost:doubled for rolling_kurt:

                rolling_skew rolling_kurt
master              55 ms       129 ms
data as-is          50 ms        47 ms
data centered     **56 ms**      54 ms
data normalized     71 ms      **69 ms**

seth-p · 2014-09-30T04:57:53Z

Would it make sense to an a ''ddof'' argument?

On Sep 29, 2014, at 11:48 PM, Jaime notifications@github.com wrote:

Some rationale behind some of the design choices:

I have joined roll_skew and roll_kurt into a single roll_higher_moment, since they shared most of the code.
Since both skewness and kurtosis require a division by the variance, there is detection of exactly zero variance built into the function, to exactly detect NaNs.
The actual computations of the skewness and kurtosis from the moments may look weird and inefficient, but they are trading extra multiplications for less divisions and/or square roots. This is the fastest I could manage, and doesn't seem to affect numerical stability.
Rather than explicitly coding the four cases of val and prev being NaN or not, I have used @seth-p's idea of handling the removal and addition of an observation separately. This is slightly less efficient for the general case where neither is NaN, but I felt there was already enough complication. And the performance of the functions is either improved or maintained from current master, more details below. But there maya be a 15% of extra performance in there.
Both kurtosis and skewness have terrible numerical stability when both the fourth or third central moment and the second, are very small. To minimize the effect on computations on some pathological cases, I have implemented @behzadnouri's idea of using z-scores as part of _rolling_func and _rolling_moment: they now take center_data and norm_data flags that trigger subtracting the mean and dividing by the standard deviation.
To avoid unnecessarily slowing down execution, rolling_skew only centers the data, while rolling_kurt centers and normalizes it, see timings below.
The loss of stability seems to be related with the removal of observations, so neither expanding_skew nor expanding_kurt do any centering or normalizing, although it would be very easy to implement.
Timings below are on x = np.random.rand(1e6) calling rolling_xxx(x, 100, 0). The chosen implementation is marked with ** **. Performance is unchanged for rolling_skew and almost:doubled for rolling_kurt:
            rolling_skew rolling_kurt
master 55 ms 129 ms
data as-is 50 ms 47 ms
data centered 56 ms 54 ms
data normalized 71 ms 69 ms
—
Reply to this email directly or view it on GitHub.

seth-p · 2014-09-30T05:01:34Z

Also, I'd suggest adding consistency checks on rolling/expanding_skew/kurt (both ddof=0 and ddof=1) to _test_moments_consistency(), test_expanding_consistency(), and test_rolling_consistency().

jaimefrio · 2014-09-30T05:03:53Z

I don't think it is well defined what ddof would mean here... We could add a bias=True/False following scipy, but everything in pandas is unbiased by default, not sure if it would make much sense.

seth-p · 2014-09-30T05:08:04Z

I'm not sure I agree that Inf values should be converted to NaN. I'm not sure they shouldn't, but my initial reaction is that they shouldn't be, ie should be left as is.

Also, rather than explicitly masking out NaN values, why not use np.nanmean() and np.nanstd()? (I don't recall in which version of numpy they were introduced.) I suspect that will be quite a bit faster.

seth-p · 2014-09-30T05:09:32Z

Everything is unbiased by default, but (now, as of 0.15.0) is specifyable. Seems like these should be too.

jaimefrio · 2014-09-30T05:58:39Z

np.nanmean and np.nanstd were included in numpy 1.8, but they mostly do the same as here, you can look at the source here. And if we called nanmean and nanstd we would be creating (almost) the same mask three times, which is what I was trying to avoid.

Also, if you don't mask out the np.infs they mess up the rolling-ness of the calculation, and everything will be np.inf or np.nan thereafter. This was already in there, I just refactored it to reuse the mask. It can be changed, but we should change it for all the rolling functions, not just for these two: I don;t think it belongs in this PR.

jreback · 2014-09-30T18:08:51Z

pandas/stats/moments.py

@@ -374,6 +375,11 @@ def _rolling_moment(arg, window, func, minp, axis=0, freq=None, center=False,
        Passed on to func
    kwargs : dict
        Passed on to func
+    center_data : bool


put the defaults here (after bool),e .g. center_data : bool, default False

seth-p · 2014-10-01T00:30:29Z

@jaimefrio, I'm not sure how ddof would figure into the calculation of skewness and kurtosis, but I think it would be easy to support a bias parameter analogous to that in ewmvar(). Following the notation in the second page of http://www.jstor.org/discover/10.2307/2988433 (page "184), I think the existing (bias=False) behavior is to calculate skew and kurtosis as G1 = K3 / K2^(3/2) and G2 = K4 / (K2^2), respectively. I'd then suggest that the bias=True behavior would be to calculate skew and kurtosis as m3 / m2^(3/2) and m4 / (m2^2) - 3, respectively. Though perhaps bias=True isn't the right way to refer to the latter, as m3 / m2^(3/2) isn't actually biased. At any rate, with these "biased" estimators you could then have simple consistency tests of moments on the lines of the existing check that biased var(x) == mean(x^2) - mean(x)^2 in _test_moments_consistency():

            mean_x = mean(x)
            ...
            for (std, var, cov, skew, kurt) in [
                (std_biased, var_biased, cov_biased, skew_biased, kurt_biased),
                (std_unbiased, var_unbiased, cov_unbiased, skew_unbiased, kurt_unbiased)]:
                var_x = var(x)
                std_x = std(x)
                skew_x = skew(x)
                kurt_x = kurt(x)

                ...
                if var is var_biased:
                    # check that biased var(x) == mean(x^2) - mean(x)^2
                    mean_x2 = mean(x * x)
                    assert_equal(var_x, mean_x2 - (mean_x * mean_x))
                    # check that "biased" skew(x) == third_moment / second_moment^(3/2)
                    third_moment_x = mean(x**3) - 3 * mean_x2 * mean_x + 3 * mean_x**3 - mean_x**3 # should collapse last two terms to + 2 * mean_x**3
                    assert_equal(skew_x, third_moment_x / (std_x**3))
                    # check that "biased" kurt(x) == fourth_moment / second_moment^2 - 3
                    fourth_moment_x = mean(x**4) - 4 * mean(x**3) * mean_x + 6 * mean_x2 * mean_x**2 - 4 * mean_x**4 + mean_x**4 # should collapse last two terms to - 3 * mean_x**4
                    assert_equal(skew_x, fourth_moment / var_x**2 - 3.)

There are probably typos in the above code, so caveat emptor.

jaimefrio · 2014-10-01T03:46:42Z

I'll just replicate what scipy.stats does, which is mostly what you describe.

jreback · 2015-01-18T21:19:10Z

@jaimefrio can you rebase/revisit ?

jreback · 2015-03-03T01:13:09Z

@jaimefrio can you rebase and add a release note, lgtm otherwise.

jreback · 2015-04-04T18:53:02Z

@jaimefrio can you rebase / revist ?

jaimefrio · 2015-04-05T02:49:32Z

Yes, will do, @jreback. Sorry I missed your earlier pings!

jreback · 2015-05-09T20:06:20Z

@jaimefrio can you update for 0.17.0

jreback · 2015-07-28T21:54:41Z

can you update?

jreback · 2015-08-15T23:19:40Z

@jaimefrio can you rebase?

jreback · 2015-08-20T18:35:55Z

@jaimefrio can you rebase

jaimefrio · 2015-08-21T04:43:50Z

I've rebased, but I don't remember anymore what was missing for this to be ready...

jreback · 2015-08-21T20:38:38Z

pandas/stats/moments.py

@@ -374,6 +375,11 @@ def _rolling_moment(arg, window, func, minp, axis=0, freq=None, center=False,
        Passed on to func
    kwargs : dict
        Passed on to func
+    center_data : bool
+        If True, subtract the mean of the data from the values
+    norm_data: bool


can you add a versionadded directiver here

jreback · 2015-08-21T20:42:00Z

can you add the examples from top of the issue #6929 e.g. validating the problem skew/kurt results that we were getting before (IOW, calculate and compare to what they should be)

jreback · 2015-08-21T20:42:51Z

can you add a release note in v0.17.0 (maybe a separate mini-section), to show what has changed? (e.g. you can use the same examples that show the stability problem).

jreback · 2015-09-01T12:10:19Z

@jaimefrio can you update

jreback · 2015-10-18T14:03:35Z

can you add a release note to 0.17.1

jreback · 2015-11-18T20:16:34Z

@jaimefrio can you rebase/update?

jreback · 2016-01-02T23:17:41Z

closing, but pls reopen when you can update

jreback changed the title ~~ENH: numerically stable rolling_skew and rolling_kurt~~ ENH: numerically stable rolling_skew and rolling_kurt Sep 30, 2014

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement labels Sep 30, 2014

jreback added this to the 0.15.1 milestone Sep 30, 2014

jreback reviewed Sep 30, 2014
View reviewed changes

jreback mentioned this pull request Sep 30, 2014

ndarray.std returns strange value with small log numpy/numpy#5141

Closed

jreback modified the milestones: 0.16.1, 0.16.0 Mar 5, 2015

kay1793 mentioned this pull request Mar 10, 2015

precision issue with rolling_std #9420

Closed

jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015

jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015

jaimefrio force-pushed the stable-roll-skew-kurt branch from 4afb45c to 015d509 Compare August 21, 2015 04:42

jreback reviewed Aug 21, 2015
View reviewed changes

jreback modified the milestones: 0.17.0, Next Major Release Aug 21, 2015

jreback modified the milestones: Next Major Release, 0.17.0 Sep 2, 2015

ENH: numerically stable roll_skew and roll_kurt

d68300a

jaimefrio force-pushed the stable-roll-skew-kurt branch from 015d509 to d68300a Compare November 20, 2015 03:11

jreback closed this Jan 2, 2016

jorisvandenbossche added the Closed PR label Jul 11, 2016

jorisvandenbossche modified the milestones: No action, Next Major Release Jul 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: numerically stable rolling_skew and rolling_kurt #8424

ENH: numerically stable rolling_skew and rolling_kurt #8424

jaimefrio commented Sep 30, 2014

jaimefrio commented Sep 30, 2014

seth-p commented Sep 30, 2014

seth-p commented Sep 30, 2014

jaimefrio commented Sep 30, 2014

seth-p commented Sep 30, 2014

seth-p commented Sep 30, 2014

jaimefrio commented Sep 30, 2014

jreback Sep 30, 2014

seth-p commented Oct 1, 2014

jaimefrio commented Oct 1, 2014

jreback commented Jan 18, 2015

jreback commented Mar 3, 2015

jreback commented Apr 4, 2015

jaimefrio commented Apr 5, 2015

jreback commented May 9, 2015

jreback commented Jul 28, 2015

jreback commented Aug 15, 2015

jreback commented Aug 20, 2015

jaimefrio commented Aug 21, 2015

jreback Aug 21, 2015

jreback commented Aug 21, 2015

jreback commented Aug 21, 2015

jreback commented Sep 1, 2015

jreback commented Oct 18, 2015

jreback commented Nov 18, 2015

jreback commented Jan 2, 2016

ENH: numerically stable rolling_skew and rolling_kurt #8424

ENH: numerically stable rolling_skew and rolling_kurt #8424

Conversation

jaimefrio commented Sep 30, 2014

jaimefrio commented Sep 30, 2014

seth-p commented Sep 30, 2014

seth-p commented Sep 30, 2014

jaimefrio commented Sep 30, 2014

seth-p commented Sep 30, 2014

seth-p commented Sep 30, 2014

jaimefrio commented Sep 30, 2014

jreback Sep 30, 2014

Choose a reason for hiding this comment

seth-p commented Oct 1, 2014

jaimefrio commented Oct 1, 2014

jreback commented Jan 18, 2015

jreback commented Mar 3, 2015

jreback commented Apr 4, 2015

jaimefrio commented Apr 5, 2015

jreback commented May 9, 2015

jreback commented Jul 28, 2015

jreback commented Aug 15, 2015

jreback commented Aug 20, 2015

jaimefrio commented Aug 21, 2015

jreback Aug 21, 2015

Choose a reason for hiding this comment

jreback commented Aug 21, 2015

jreback commented Aug 21, 2015

jreback commented Sep 1, 2015

jreback commented Oct 18, 2015

jreback commented Nov 18, 2015

jreback commented Jan 2, 2016