BUG: low variance arrays' kurtosis wrongfully set to zero #58176

dontgoto · 2024-04-07T23:17:48Z

closes BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This fixes an issue with low variance arrays' kurtosis. It was previously set to zero due to a cutoff that was chosen too conservatively.

…is_low_value_cutoff

WillAyd · 2024-04-08T01:26:46Z

pandas/_libs/window/aggregations.pyx

@@ -712,7 +712,8 @@ cdef float64_t calc_kurt(int64_t minp, int64_t nobs,
            #         if the variance is less than 1e-14, it could be
            #         treat as zero, here we follow the original
            #         skew/kurt behaviour to check B <= 1e-14
-            if B <= 1e-14:
+            # #57972: for non-zero but low variance arrays the cutoff can be lowered
+            if B <= 1e-281:


What is this constant? The minimum number of significant decimal digits that have guaranteed precision for a double is 15, which I assume is where 1e-14 came from.

It is the variance of the observations. The e-14 cutoff is too conservative and also sets numerically stable results to NaN, e.g. when the mean of the observations is very low.

I did some more testing after your comment and setting the cutoff as low as in nanops (e-281) prevents the false positives, but it also lets numerically unstable results pass, so I reverted it. Schemes that take into account the mean etc. of the observations in the cutoff were also not really satisfactory.

I'll look into this and make another PR if I find some satisfactory solution for the equation in the .pyx here. Numerically it behaves very different from the one in nanops.

Since double precision is only guaranteed up to 15 significant decimal digits across implementations choosing anything smaller than 1e-14 is not going to work

The e-14 might make one think that it is about a float's number of significant digits, but B <= e-14 only checks for potential numerical instabilities irrespective of that.

E.g.,

n = 1_000_0 scale = 1e12 data = np.array([2.3000001*scale, 2.3*scale]*n)

Is numerically unstable, though the variance is larger than e-14 (rolling kurt = e15).

On the other hand

n = 1_000_0 scale = 1e-15 data = np.array([2.4*scale, 2.3*scale]*n)

Is numerically stable, but the variance is smaller than e-14 (rolling kurt = 2.5).

In nanops the equations become unstable only around e-281 in my tests, but here it's more complex.

Example code:

import numpy as np import scipy.stats as st n = 1_000_0 scale = 1e12 data = np.array([2.3000001*scale, 2.3*scale]*n) pdkurt = pd.Series(data).kurt() scipykurt = st.kurtosis(data, bias=False) print(pdkurt) print(scipykurt) print(pd.Series(data).rolling(10).kurt()) n = 1_000_0 scale = 1e-15 data = np.array([2.4*scale, 2.3*scale]*n) pdkurt = pd.Series(data).kurt() scipykurt = st.kurtosis(data, bias=False) print(pdkurt) print(scipykurt) print(pd.Series(data).rolling(10).kurt())

Is there a whitepaper that lays out what you are trying to accomplish? The problem with your local results is that it is dependent on your hardware, and floating point implementations an vary.

I don't believe that a statement like # scipy.kurt is nan at e-81 is generally True (nan can be generated from quite a few different patterns, although technically platforms should be choosing one canonical pattern), and the e-72 and e-281 sentinels seem arbitrary

I also agree adjusting this arbitrary limit is not ideal. IMO ideally we shouldn't have one in the first place. We removed a similar arbitrary limit for var and std a few releases ago #40505

I would support just removing this limit and just documenting floating point precision artifacts

Good points @WillAyd, @mroeschke.

I was setting the limit(s) to better reflect the actual stability range of the calculations, but the limit is arbitrary due to dependencies on inputs as well as machine platforms.

For the nanops version I would agree with removing the check, since it was introduced before the equation there was stabilised. From my tests the kurt calculation there is about as stable as the scipy implementation and only becomes unstable for very extreme cases.

From my tests, the kurt implementation in aggregation.pyx here seems comparatively much more unstable. The equations involved have a lot of potential cancellations. I would suggest first stabilising the equations there. That's something I could do.

github-actions · 2024-05-22T00:05:36Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-05-31T18:42:15Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

dontgoto added 10 commits March 23, 2024 16:35

add check for not completely unstable parameter range

9923460

refine cutoff for small arrays

d413792

add cutoff in cython

15aa86e

add test

79e3381

add comment and code path

2e67416

Merge remote-tracking branch 'upstream/main' into 240323_57972_kurtos…

2a958af

…is_low_value_cutoff

add whatsnew

b030d9b

remove skew

6c807cd

update threshold

7a65c8b

update comments

ac68587

dontgoto requested a review from WillAyd as a code owner April 7, 2024 23:17

WillAyd requested changes Apr 8, 2024

View reviewed changes

dontgoto added 3 commits April 9, 2024 11:20

Update v3.0.0.rst

c96fc4f

Revert aggregations.pyx

af13513

Merge branch 'main' into 240323_57972_kurtosis_low_value_cutoff

35b3c70

github-actions bot added the Stale label May 22, 2024

mroeschke closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: low variance arrays' kurtosis wrongfully set to zero #58176

BUG: low variance arrays' kurtosis wrongfully set to zero #58176

dontgoto commented Apr 7, 2024 •

edited

Loading

WillAyd Apr 8, 2024

dontgoto Apr 9, 2024

WillAyd Apr 9, 2024

dontgoto Apr 12, 2024

WillAyd Apr 12, 2024 •

edited

Loading

mroeschke Apr 12, 2024

dontgoto Apr 21, 2024

github-actions bot commented May 22, 2024

mroeschke commented May 31, 2024

BUG: low variance arrays' kurtosis wrongfully set to zero #58176

BUG: low variance arrays' kurtosis wrongfully set to zero #58176

Conversation

dontgoto commented Apr 7, 2024 • edited Loading

WillAyd Apr 8, 2024

Choose a reason for hiding this comment

dontgoto Apr 9, 2024

Choose a reason for hiding this comment

WillAyd Apr 9, 2024

Choose a reason for hiding this comment

dontgoto Apr 12, 2024

Choose a reason for hiding this comment

WillAyd Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

mroeschke Apr 12, 2024

Choose a reason for hiding this comment

dontgoto Apr 21, 2024

Choose a reason for hiding this comment

github-actions bot commented May 22, 2024

mroeschke commented May 31, 2024

dontgoto commented Apr 7, 2024 •

edited

Loading

WillAyd Apr 12, 2024 •

edited

Loading