Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: numerically stable rolling_skew and rolling_kurt #8424

Closed
wants to merge 1 commit into from

Conversation

jaimefrio
Copy link
Contributor

Close #6929. Hat tip to @behzadnouri for the z-scores idea in #8270, and to @seth-p for the general structure of the code, separating removal and addition of observations.

@jaimefrio
Copy link
Contributor Author

Some rationale behind some of the design choices:

  • I have joined roll_skew and roll_kurt into a single roll_higher_moment, since they shared most of the code.
  • Since both skewness and kurtosis require a division by the variance, there is detection of exactly zero variance built into the function, to exactly detect NaNs.
  • The actual computations of the skewness and kurtosis from the moments may look weird and inefficient, but they are trading extra multiplications for less divisions and/or square roots. This is the fastest I could manage, and doesn't seem to affect numerical stability.
  • Rather than explicitly coding the four cases of val and prev being NaN or not, I have used @seth-p's idea of handling the removal and addition of an observation separately. This is slightly less efficient for the general case where neither is NaN, but I felt there was already enough complication. And the performance of the functions is either improved or maintained from current master, more details below. But there maya be a 15% of extra performance in there.
  • Both kurtosis and skewness have terrible numerical stability when both the fourth or third central moment and the second, are very small. To minimize the effect on computations on some pathological cases, I have implemented @behzadnouri's idea of using z-scores as part of _rolling_func and _rolling_moment: they now take center_data and norm_data flags that trigger subtracting the mean and dividing by the standard deviation.
  • To avoid unnecessarily slowing down execution, rolling_skew only centers the data, while rolling_kurt centers and normalizes it, see timings below.
  • The loss of stability seems to be related with the removal of observations, so neither expanding_skew nor expanding_kurt do any centering or normalizing, although it would be very easy to implement.

Timings below are on x = np.random.rand(1e6) calling rolling_xxx(x, 100, 0). The chosen implementation is marked with ** **. Performance is unchanged for rolling_skew and almost:doubled for rolling_kurt:

                rolling_skew rolling_kurt
master              55 ms       129 ms
data as-is          50 ms        47 ms
data centered     **56 ms**      54 ms
data normalized     71 ms      **69 ms**

@seth-p
Copy link
Contributor

seth-p commented Sep 30, 2014

Would it make sense to an a ''ddof'' argument?

On Sep 29, 2014, at 11:48 PM, Jaime notifications@github.com wrote:

Some rationale behind some of the design choices:

I have joined roll_skew and roll_kurt into a single roll_higher_moment, since they shared most of the code.
Since both skewness and kurtosis require a division by the variance, there is detection of exactly zero variance built into the function, to exactly detect NaNs.
The actual computations of the skewness and kurtosis from the moments may look weird and inefficient, but they are trading extra multiplications for less divisions and/or square roots. This is the fastest I could manage, and doesn't seem to affect numerical stability.
Rather than explicitly coding the four cases of val and prev being NaN or not, I have used @seth-p's idea of handling the removal and addition of an observation separately. This is slightly less efficient for the general case where neither is NaN, but I felt there was already enough complication. And the performance of the functions is either improved or maintained from current master, more details below. But there maya be a 15% of extra performance in there.
Both kurtosis and skewness have terrible numerical stability when both the fourth or third central moment and the second, are very small. To minimize the effect on computations on some pathological cases, I have implemented @behzadnouri's idea of using z-scores as part of _rolling_func and _rolling_moment: they now take center_data and norm_data flags that trigger subtracting the mean and dividing by the standard deviation.
To avoid unnecessarily slowing down execution, rolling_skew only centers the data, while rolling_kurt centers and normalizes it, see timings below.
The loss of stability seems to be related with the removal of observations, so neither expanding_skew nor expanding_kurt do any centering or normalizing, although it would be very easy to implement.
Timings below are on x = np.random.rand(1e6) calling rolling_xxx(x, 100, 0). The chosen implementation is marked with ** **. Performance is unchanged for rolling_skew and almost:doubled for rolling_kurt:

            rolling_skew rolling_kurt

master 55 ms 129 ms
data as-is 50 ms 47 ms
data centered 56 ms 54 ms
data normalized 71 ms 69 ms

Reply to this email directly or view it on GitHub.

@seth-p
Copy link
Contributor

seth-p commented Sep 30, 2014

Also, I'd suggest adding consistency checks on rolling/expanding_skew/kurt (both ddof=0 and ddof=1) to _test_moments_consistency(), test_expanding_consistency(), and test_rolling_consistency().

@jaimefrio
Copy link
Contributor Author

I don't think it is well defined what ddof would mean here... We could add a bias=True/False following scipy, but everything in pandas is unbiased by default, not sure if it would make much sense.

@seth-p
Copy link
Contributor

seth-p commented Sep 30, 2014

I'm not sure I agree that Inf values should be converted to NaN. I'm not sure they shouldn't, but my initial reaction is that they shouldn't be, ie should be left as is.

Also, rather than explicitly masking out NaN values, why not use np.nanmean() and np.nanstd()? (I don't recall in which version of numpy they were introduced.) I suspect that will be quite a bit faster.

@seth-p
Copy link
Contributor

seth-p commented Sep 30, 2014

Everything is unbiased by default, but (now, as of 0.15.0) is specifyable. Seems like these should be too.

@jaimefrio
Copy link
Contributor Author

np.nanmean and np.nanstd were included in numpy 1.8, but they mostly do the same as here, you can look at the source here. And if we called nanmean and nanstd we would be creating (almost) the same mask three times, which is what I was trying to avoid.

Also, if you don't mask out the np.infs they mess up the rolling-ness of the calculation, and everything will be np.inf or np.nan thereafter. This was already in there, I just refactored it to reuse the mask. It can be changed, but we should change it for all the rolling functions, not just for these two: I don;t think it belongs in this PR.

@jreback jreback changed the title ENH: numerically stable rolling_skew and rolling_kurt ENH: numerically stable rolling_skew and rolling_kurt Sep 30, 2014
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement labels Sep 30, 2014
@jreback jreback added this to the 0.15.1 milestone Sep 30, 2014
@@ -374,6 +375,11 @@ def _rolling_moment(arg, window, func, minp, axis=0, freq=None, center=False,
Passed on to func
kwargs : dict
Passed on to func
center_data : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the defaults here (after bool),e .g. center_data : bool, default False

@seth-p
Copy link
Contributor

seth-p commented Oct 1, 2014

@jaimefrio, I'm not sure how ddof would figure into the calculation of skewness and kurtosis, but I think it would be easy to support a bias parameter analogous to that in ewmvar(). Following the notation in the second page of http://www.jstor.org/discover/10.2307/2988433 (page "184), I think the existing (bias=False) behavior is to calculate skew and kurtosis as G1 = K3 / K2^(3/2) and G2 = K4 / (K2^2), respectively. I'd then suggest that the bias=True behavior would be to calculate skew and kurtosis as m3 / m2^(3/2) and m4 / (m2^2) - 3, respectively. Though perhaps bias=True isn't the right way to refer to the latter, as m3 / m2^(3/2) isn't actually biased. At any rate, with these "biased" estimators you could then have simple consistency tests of moments on the lines of the existing check that biased var(x) == mean(x^2) - mean(x)^2 in _test_moments_consistency():

            mean_x = mean(x)
            ...
            for (std, var, cov, skew, kurt) in [
                (std_biased, var_biased, cov_biased, skew_biased, kurt_biased),
                (std_unbiased, var_unbiased, cov_unbiased, skew_unbiased, kurt_unbiased)]:
                var_x = var(x)
                std_x = std(x)
                skew_x = skew(x)
                kurt_x = kurt(x)

                ...
                if var is var_biased:
                    # check that biased var(x) == mean(x^2) - mean(x)^2
                    mean_x2 = mean(x * x)
                    assert_equal(var_x, mean_x2 - (mean_x * mean_x))
                    # check that "biased" skew(x) == third_moment / second_moment^(3/2)
                    third_moment_x = mean(x**3) - 3 * mean_x2 * mean_x + 3 * mean_x**3 - mean_x**3 # should collapse last two terms to + 2 * mean_x**3
                    assert_equal(skew_x, third_moment_x / (std_x**3))
                    # check that "biased" kurt(x) == fourth_moment / second_moment^2 - 3
                    fourth_moment_x = mean(x**4) - 4 * mean(x**3) * mean_x + 6 * mean_x2 * mean_x**2 - 4 * mean_x**4 + mean_x**4 # should collapse last two terms to - 3 * mean_x**4
                    assert_equal(skew_x, fourth_moment / var_x**2 - 3.)

There are probably typos in the above code, so caveat emptor.

@jaimefrio
Copy link
Contributor Author

I'll just replicate what scipy.stats does, which is mostly what you describe.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@jaimefrio can you rebase/revisit ?

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

@jaimefrio can you rebase and add a release note, lgtm otherwise.

@jreback jreback modified the milestones: 0.16.1, 0.16.0 Mar 5, 2015
@jreback
Copy link
Contributor

jreback commented Apr 4, 2015

@jaimefrio can you rebase / revist ?

@jaimefrio
Copy link
Contributor Author

Yes, will do, @jreback. Sorry I missed your earlier pings!

@jreback jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015
@jreback
Copy link
Contributor

jreback commented May 9, 2015

@jaimefrio can you update for 0.17.0

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

can you update?

@jreback
Copy link
Contributor

jreback commented Aug 15, 2015

@jaimefrio can you rebase?

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015
@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

@jaimefrio can you rebase

@jaimefrio jaimefrio force-pushed the stable-roll-skew-kurt branch from 4afb45c to 015d509 Compare August 21, 2015 04:42
@jaimefrio
Copy link
Contributor Author

I've rebased, but I don't remember anymore what was missing for this to be ready...

@@ -374,6 +375,11 @@ def _rolling_moment(arg, window, func, minp, axis=0, freq=None, center=False,
Passed on to func
kwargs : dict
Passed on to func
center_data : bool
If True, subtract the mean of the data from the values
norm_data: bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionadded directiver here

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

can you add the examples from top of the issue #6929 e.g. validating the problem skew/kurt results that we were getting before (IOW, calculate and compare to what they should be)

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

can you add a release note in v0.17.0 (maybe a separate mini-section), to show what has changed? (e.g. you can use the same examples that show the stability problem).

@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 21, 2015
@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

@jaimefrio can you update

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Sep 2, 2015
@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

can you add a release note to 0.17.1

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

@jaimefrio can you rebase/update?

@jaimefrio jaimefrio force-pushed the stable-roll-skew-kurt branch from 015d509 to d68300a Compare November 20, 2015 03:11
@jreback
Copy link
Contributor

jreback commented Jan 2, 2016

closing, but pls reopen when you can update

@jreback jreback closed this Jan 2, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Jul 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Poor numerical stability of rolling_kurt and rolling_skew
4 participants