-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling skew and rolling kurt has wrong answer under all eq values #18044
Comments
@ogotaiking : Thanks for reporting this! For reference, do you mind posting the expected results too? |
@gfyoung The skew of a uniform distribution is zero. The kurtosis of the uniform distribution could be -6/5, depending on how you define it. While @ogotaiking says the kurtosis is calculated correctly in the non-rolling case, it's returning zero for me, so that might not be what's desired. |
The issue would be variance numerically unstable in "calc_skew" , the answer will divide by B , but the code only check B<=0, during rolling, add_skew/remove_skew may cause B in range(0,1e-9), so after that uniform distribution will get such large number in skew/kurt. #6817 from jaimefrio/rolling_var-welford resolve the roll_var stability , but seems it will also import to skew/kurt. # Rolling skewness
cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
double xxx) nogil:
cdef double result, dnobs
cdef double A, B, C, R
if nobs >= minp:
dnobs = <double>nobs
A = x / dnobs
B = xx / dnobs - A * A
C = xxx / dnobs - A * A * A - 3 * A * B
if B <= 0 or nobs < 3: <-- B's precision has some issue cause uniform distribution not be ZERO
result = NaN
else:
R = sqrt(B)
result = ((sqrt(dnobs * (dnobs - 1.)) * C) /
((dnobs - 2) * R * R * R)) <-- here will divde 1e-N
else:
result = NaN
return result |
in pandas/_libs/window.pyx xx += val * val
xx -= prev* prev but during rolling, it will cause numerically unstable in xx, |
@ogotaiking : Cool! Feel free to propose changes in a PR and add a test or two to confirm. We can look at it in more detail there. |
just did more test, the previous fix does not work in some particular numbers, and check core/nanops.py, during calculate m2 and m3 , it has the following function to validate the float issue. _zero_out_fperr(arg):
if isinstance(arg, np.ndarray):
with np.errstate(invalid='ignore'):
return np.where(np.abs(arg) < 1e-14, 0, arg)
else:
return arg.dtype.type(0) if np.abs(arg) < 1e-14 else arg but in window.pyx it's a cython code , i am not familiar with it to modify or do such validation. I just did some simple change diff --git a/pandas/_libs/window.pyx b/pandas/_libs/window.pyx
index b6bd6f9..64a1a40 100644
--- a/pandas/_libs/window.pyx
+++ b/pandas/_libs/window.pyx
@@ -788,7 +788,7 @@ cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
A = x / dnobs
B = xx / dnobs - A * A
C = xxx / dnobs - A * A * A - 3 * A * B
- if B <= 0 or nobs < 3:
+ if B <= 1e-14 or nobs < 3:
result = NaN
else:
R = sqrt(B)
@@ -915,7 +915,7 @@ cdef inline double calc_kurt(int64_t minp, int64_t nobs, double x, double xx,
R = R * A
D = xxxx / dnobs - R - 6 * B * A * A - 4 * C * A
- if B == 0 or nobs < 4:
+ if B <= 1e-14 or nobs < 4:
result = NaN
else:
K = (dnobs * dnobs - 1.) * D / (B * B) - 3 * ((dnobs - 1.) ** 2) and I did more test on random numbers with uniform distribution c=0.0
for i in range(0,10000):
a = pd.Series([np.random.rand(1)]*30)
b=max(a.rolling(10).skew().fillna(0))
if b > c :
c = b the result looks good , after try 10000 random numbers , c still eq zero. tc=0.0
for i in range(0,10000):
test=pd.Series(np.random.rand(500))
aa = test.rolling(10).skew().fillna(0)
bb = test.rolling(10).apply(lambda x: pd.Series(x).skew()).fillna(0)
cc = max(aa - bb)
if cc > tc:
tc = cc
tc the max diff between rolling_skew and nanops.nanskew is : 1.0452083643031074e-11 |
I encountered this bug again. import numpy as np
import pandas as pd
from scipy import stats
array_size = 20
zeros = pd.Series(np.zeros(array_size))
window_size = 5
rolling = zeros.rolling(window=window_size)
results_pandas = rolling.aggregate(['kurt', 'skew'])
assert results_pandas.isna().all().all()
results_scipy = rolling.aggregate([stats.kurtosis, stats.skew])
assert results_scipy.isna().sum().eq(window_size - 1).all() System information:I am using Python 3.8.9, Pandas 1.1.4, SciPy 1.4.1, and NumPy 1.20.2. |
Code Sample, a copy-pastable example if possible
Problem description
rolling skew() wrong result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 -1.056765e+08
10 -1.056765e+08
11 -1.056765e+08
12 -1.056765e+08
13 -1.056765e+08
14 -1.056765e+08
dtype: float64
rolling kurt() wrong result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 1.201335e+16
10 1.201335e+16
11 1.201335e+16
12 1.201335e+16
13 1.201335e+16
14 1.201335e+16
dtype: float64
But try pd.Series([1.1]*15).tail(10).skew() and pd.Series([1.1]*15).tail(10).kurt() could get the right result.
and also a reference workaround by using apply to call scipy:
seems it has bug in rolling_skew/rolling_kurt
Expected Output
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-93-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.24
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.2.0-b1
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.5.0
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: