Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible overflow errors with pd.rolling(...).std() #28688

Closed
fjanoos opened this issue Sep 30, 2019 · 1 comment
Closed

Possible overflow errors with pd.rolling(...).std() #28688

fjanoos opened this issue Sep 30, 2019 · 1 comment
Labels
Bug Window rolling, ewma, expanding

Comments

@fjanoos
Copy link

fjanoos commented Sep 30, 2019

I'm experiencing a very weird bug with one very specific dataset - when I try to use pandas' rolling.std function on it.

Basically - the setup is -

  • I have a dataframe with 1 column stored in float32 format
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high    2186 non-null float32
dtypes: float32(1)
memory usage: 25.6 KB
ddf.head()

image

Here is a plot of the full timeseries (left column) along with a tail of 200 rows (right column)
image
Note the scale of the numbers here - it goes from 1e8 to 1e1.

Next I compute the rolling standard deviations using rolling means, as per:

rs = np.sqrt((ddf.high ** 2).rolling(10).mean()  - (ddf.high.rolling(10).mean() ) ** 2)

This is what the rolling std computed this way it looks like (and matches what I would expect):
image

But if I use

rs = ddf.rolling(10).high.std()

this is what I get:
image
Something has gotten corrupted - as we can see in the tail of 200 rows in the right.

Now - however, if I rescale the data to make the numbers sit in a smaller range, compute the rolling std and scale it back up

ddf = ddf.assign( high_rescaled=ddf.high / 1e8 )
rs = ddf.rolling(10).high_rescaled.std() * 1e8

This is what I get
image
which matches the output computed using rolling means !

Note - the original data was in np.float32 format. So I thought that this bug might be happening because of some overflow issues (which it really should not - the window is only 10 long !!).
So I converted the data to float64 to test this:

ddf.high = ddf.high.astype(np.float64)
display( ddf.info() )
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high    2186 non-null float64
dtypes: float64(1)
memory usage: 34.2 KB

In this case - even applying rolling.std() to rescaled version of the data (which worked for float32) - is broken !!

image

Minimum Reproducible Example

Regarding generating an MRE - here is the rub.
The bug seems to be a function of the numerics specific to this dataset - and I cannot reproduce it using random data (since I don't know what feature of the numerics is causing this).

Now, if I save the data as a pickle file (using df.to_pickle) and load it back in - I can reproduce these results exactly.

However, if I save it as a csv file (for sharing here) - and load it back in - I get a whole new level of badness. The results look really bad for all cases after this round-tripping. This seems to indicate that there is some thing about the exact numbers of the dataset that is triggering some numerical problems with rolling.std.

rs.zip

from pylab import *


import pandas as pd

ddf = pd.read_csv( '/path/to/rs.csv')

ddf.high = ddf.high
display( ddf.info() )


figure()
subplot(121)
plot( ddf.high, '-r' )
subplot(122)
plot( ddf.high.tail(200), '-r' )
gcf().suptitle( 'original data' )

figure()
subplot(121)
rs = np.sqrt( (ddf.high ** 2).rolling(10).mean()  - (ddf.high.rolling(10).mean() ) ** 2 )
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed using rolling means on original data' )

figure()
subplot(121)
rs = ddf.rolling(10).high.std()
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed on original data' )

figure()
subplot(121)
ddf = ddf.assign( high_rescaled=ddf.high / 1e8 )
rs = ddf.rolling(10).high_rescaled.std() * 1e8
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed on rescaled data' )

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-9-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.2.3
setuptools: 41.2.0
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.13.0
xarray: 0.13.0+6.g4617e68b
IPython: 7.8.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.4.1
bs4: 4.8.0
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.3.0

@jreback
@jbrockmendel jbrockmendel added the Window rolling, ewma, expanding label Oct 16, 2019
@mroeschke mroeschke added the Bug label May 8, 2020
@mroeschke
Copy link
Member

This may have been fixed by #37055.

If not, happy to reopen, but we would need a more slimmed down example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants