-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: low variance arrays' kurtosis wrongfully set to zero #58176
BUG: low variance arrays' kurtosis wrongfully set to zero #58176
Conversation
…is_low_value_cutoff
pandas/_libs/window/aggregations.pyx
Outdated
@@ -712,7 +712,8 @@ cdef float64_t calc_kurt(int64_t minp, int64_t nobs, | |||
# if the variance is less than 1e-14, it could be | |||
# treat as zero, here we follow the original | |||
# skew/kurt behaviour to check B <= 1e-14 | |||
if B <= 1e-14: | |||
# #57972: for non-zero but low variance arrays the cutoff can be lowered | |||
if B <= 1e-281: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this constant? The minimum number of significant decimal digits that have guaranteed precision for a double is 15, which I assume is where 1e-14
came from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the variance of the observations. The e-14 cutoff is too conservative and also sets numerically stable results to NaN, e.g. when the mean of the observations is very low.
I did some more testing after your comment and setting the cutoff as low as in nanops (e-281) prevents the false positives, but it also lets numerically unstable results pass, so I reverted it. Schemes that take into account the mean etc. of the observations in the cutoff were also not really satisfactory.
I'll look into this and make another PR if I find some satisfactory solution for the equation in the .pyx here. Numerically it behaves very different from the one in nanops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since double precision is only guaranteed up to 15 significant decimal digits across implementations choosing anything smaller than 1e-14
is not going to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The e-14 might make one think that it is about a float's number of significant digits, but B <= e-14
only checks for potential numerical instabilities irrespective of that.
E.g.,
n = 1_000_0
scale = 1e12
data = np.array([2.3000001*scale, 2.3*scale]*n)
Is numerically unstable, though the variance is larger than e-14 (rolling kurt = e15).
On the other hand
n = 1_000_0
scale = 1e-15
data = np.array([2.4*scale, 2.3*scale]*n)
Is numerically stable, but the variance is smaller than e-14 (rolling kurt = 2.5).
In nanops the equations become unstable only around e-281 in my tests, but here it's more complex.
Example code:
import numpy as np
import scipy.stats as st
n = 1_000_0
scale = 1e12
data = np.array([2.3000001*scale, 2.3*scale]*n)
pdkurt = pd.Series(data).kurt()
scipykurt = st.kurtosis(data, bias=False)
print(pdkurt)
print(scipykurt)
print(pd.Series(data).rolling(10).kurt())
n = 1_000_0
scale = 1e-15
data = np.array([2.4*scale, 2.3*scale]*n)
pdkurt = pd.Series(data).kurt()
scipykurt = st.kurtosis(data, bias=False)
print(pdkurt)
print(scipykurt)
print(pd.Series(data).rolling(10).kurt())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a whitepaper that lays out what you are trying to accomplish? The problem with your local results is that it is dependent on your hardware, and floating point implementations an vary.
I don't believe that a statement like # scipy.kurt is nan at e-81
is generally True (nan can be generated from quite a few different patterns, although technically platforms should be choosing one canonical pattern), and the e-72
and e-281
sentinels seem arbitrary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree adjusting this arbitrary limit is not ideal. IMO ideally we shouldn't have one in the first place. We removed a similar arbitrary limit for var and std a few releases ago #40505
I would support just removing this limit and just documenting floating point precision artifacts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points @WillAyd, @mroeschke.
I was setting the limit(s) to better reflect the actual stability range of the calculations, but the limit is arbitrary due to dependencies on inputs as well as machine platforms.
For the nanops version I would agree with removing the check, since it was introduced before the equation there was stabilised. From my tests the kurt calculation there is about as stable as the scipy implementation and only becomes unstable for very extreme cases.
From my tests, the kurt implementation in aggregation.pyx
here seems comparatively much more unstable. The equations involved have a lot of potential cancellations. I would suggest first stabilising the equations there. That's something I could do.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This fixes an issue with low variance arrays' kurtosis. It was previously set to zero due to a cutoff that was chosen too conservatively.