Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling skew and rolling kurt has wrong answer under all eq values #18044

Closed
zartbot opened this issue Oct 31, 2017 · 9 comments · Fixed by #18065
Closed

rolling skew and rolling kurt has wrong answer under all eq values #18044

zartbot opened this issue Oct 31, 2017 · 9 comments · Fixed by #18065
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@zartbot
Copy link
Contributor

zartbot commented Oct 31, 2017

Code Sample, a copy-pastable example if possible

pd.Series([1.1]*15).rolling(10).skew()
pd.Series([1.1]*15).rolling(10).kurt()

Problem description

rolling skew() wrong result:

pd.Series([1.1]*15).rolling(10).skew()

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 -1.056765e+08
10 -1.056765e+08
11 -1.056765e+08
12 -1.056765e+08
13 -1.056765e+08
14 -1.056765e+08
dtype: float64

rolling kurt() wrong result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 1.201335e+16
10 1.201335e+16
11 1.201335e+16
12 1.201335e+16
13 1.201335e+16
14 1.201335e+16
dtype: float64

But try pd.Series([1.1]*15).tail(10).skew() and pd.Series([1.1]*15).tail(10).kurt() could get the right result.
and also a reference workaround by using apply to call scipy:

pd.Series([1.1]*15).rolling(10).apply(lambda x: scipy.stats.skew(x))
pd.Series([1.1]*15).rolling(10).apply(lambda x: scipy.stats.kurt(x))

seems it has bug in rolling_skew/rolling_kurt

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-93-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.24
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.2.0-b1
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.5.0
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Oct 31, 2017

@ogotaiking : Thanks for reporting this! For reference, do you mind posting the expected results too?

@selik
Copy link
Contributor

selik commented Oct 31, 2017

@gfyoung The skew of a uniform distribution is zero. The kurtosis of the uniform distribution could be -6/5, depending on how you define it. While @ogotaiking says the kurtosis is calculated correctly in the non-rolling case, it's returning zero for me, so that might not be what's desired.

@zartbot
Copy link
Contributor Author

zartbot commented Nov 1, 2017

The issue would be variance numerically unstable in "calc_skew" , the answer will divide by B , but the code only check B<=0, during rolling, add_skew/remove_skew may cause B in range(0,1e-9), so after that uniform distribution will get such large number in skew/kurt.

#6817 from jaimefrio/rolling_var-welford resolve the roll_var stability , but seems it will also import to skew/kurt.

# Rolling skewness

cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
                             double xxx) nogil:
    cdef double result, dnobs
    cdef double A, B, C, R

    if nobs >= minp:
        dnobs = <double>nobs
        A = x / dnobs
        B = xx / dnobs - A * A
        C = xxx / dnobs - A * A * A - 3 * A * B
        if B <= 0 or nobs < 3: <-- B's precision has some issue cause uniform distribution not be ZERO
            result = NaN
        else:
            R = sqrt(B)
            result = ((sqrt(dnobs * (dnobs - 1.)) * C) /
                      ((dnobs - 2) * R * R * R))  <-- here will divde 1e-N
    else:
        result = NaN

    return result

@zartbot
Copy link
Contributor Author

zartbot commented Nov 1, 2017

in pandas/_libs/window.pyx
RCA(Root Cause Analysis):
for uniform distribution, all values are eq.
function roll_skew will continues call add_skew / remove_skew.

xx  += val * val
xx -= prev* prev

but during rolling, it will cause numerically unstable in xx,
previously suggest is a bad fix

@gfyoung
Copy link
Member

gfyoung commented Nov 1, 2017

@ogotaiking : Cool! Feel free to propose changes in a PR and add a test or two to confirm. We can look at it in more detail there.

@zartbot
Copy link
Contributor Author

zartbot commented Nov 1, 2017

@gfyoung

just did more test, the previous fix does not work in some particular numbers, and check core/nanops.py,

during calculate m2 and m3 , it has the following function to validate the float issue.

_zero_out_fperr(arg):
    if isinstance(arg, np.ndarray):
        with np.errstate(invalid='ignore'):
            return np.where(np.abs(arg) < 1e-14, 0, arg)
    else:
        return arg.dtype.type(0) if np.abs(arg) < 1e-14 else arg

but in window.pyx it's a cython code , i am not familiar with it to modify or do such validation.

I just did some simple change

diff --git a/pandas/_libs/window.pyx b/pandas/_libs/window.pyx
index b6bd6f9..64a1a40 100644
--- a/pandas/_libs/window.pyx
+++ b/pandas/_libs/window.pyx
@@ -788,7 +788,7 @@ cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
         A = x / dnobs
         B = xx / dnobs - A * A
         C = xxx / dnobs - A * A * A - 3 * A * B
-        if B <= 0 or nobs < 3:
+        if B <= 1e-14 or nobs < 3:
             result = NaN
         else:
             R = sqrt(B)
@@ -915,7 +915,7 @@ cdef inline double calc_kurt(int64_t minp, int64_t nobs, double x, double xx,
         R = R * A
         D = xxxx / dnobs - R - 6 * B * A * A - 4 * C * A
 
-        if B == 0 or nobs < 4:
+        if B <= 1e-14 or nobs < 4:
             result = NaN
         else:
             K = (dnobs * dnobs - 1.) * D / (B * B) - 3 * ((dnobs - 1.) ** 2)

and I did more test on random numbers with uniform distribution

c=0.0
for i in range(0,10000):
    a = pd.Series([np.random.rand(1)]*30)
    b=max(a.rolling(10).skew().fillna(0))
    if b > c :
        c = b

the result looks good , after try 10000 random numbers , c still eq zero.
and also check random distribution skew:

tc=0.0
for i in range(0,10000):
    test=pd.Series(np.random.rand(500))
    aa = test.rolling(10).skew().fillna(0)
    bb = test.rolling(10).apply(lambda x: pd.Series(x).skew()).fillna(0)
    cc = max(aa - bb)
    if cc > tc:
        tc = cc
tc

the max diff between rolling_skew and nanops.nanskew is : 1.0452083643031074e-11

@enricogandini
Copy link

I encountered this bug again.
When applying the Pandas kurt and skew to a rolling window of a Series with all identical items (in this case, all items are zero),
the results are Series containing all NaN values. When applying the kurtosis and skew functions from scipy.stats, I get the expected results: Series with NaN values only in the first window_size - 1 elements.

import numpy as np
import pandas as pd
from scipy import stats

array_size = 20
zeros = pd.Series(np.zeros(array_size))

window_size = 5
rolling = zeros.rolling(window=window_size)

results_pandas = rolling.aggregate(['kurt', 'skew'])
assert results_pandas.isna().all().all()

results_scipy = rolling.aggregate([stats.kurtosis, stats.skew])
assert results_scipy.isna().sum().eq(window_size - 1).all()

System information:

I am using Python 3.8.9, Pandas 1.1.4, SciPy 1.4.1, and NumPy 1.20.2.

@alex7088
Copy link

alex7088 commented Jun 20, 2022

I have same proble. Rolling and tail skew() gives different result.
image

@baileymeta
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants