Possible overflow errors with pd.rolling(...).std() #28688

fjanoos · 2019-09-30T15:13:45Z

I'm experiencing a very weird bug with one very specific dataset - when I try to use pandas' rolling.std function on it.

Basically - the setup is -

I have a dataframe with 1 column stored in float32 format

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high    2186 non-null float32
dtypes: float32(1)
memory usage: 25.6 KB

ddf.head()

Here is a plot of the full timeseries (left column) along with a tail of 200 rows (right column)

Note the scale of the numbers here - it goes from 1e8 to 1e1.

Next I compute the rolling standard deviations using rolling means, as per:

rs = np.sqrt((ddf.high ** 2).rolling(10).mean()  - (ddf.high.rolling(10).mean() ) ** 2)

This is what the rolling std computed this way it looks like (and matches what I would expect):

But if I use

rs = ddf.rolling(10).high.std()

this is what I get:

Something has gotten corrupted - as we can see in the tail of 200 rows in the right.

Now - however, if I rescale the data to make the numbers sit in a smaller range, compute the rolling std and scale it back up

ddf = ddf.assign( high_rescaled=ddf.high / 1e8 )
rs = ddf.rolling(10).high_rescaled.std() * 1e8

This is what I get

which matches the output computed using rolling means !

Note - the original data was in np.float32 format. So I thought that this bug might be happening because of some overflow issues (which it really should not - the window is only 10 long !!).
So I converted the data to float64 to test this:

ddf.high = ddf.high.astype(np.float64)
display( ddf.info() )

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high    2186 non-null float64
dtypes: float64(1)
memory usage: 34.2 KB

In this case - even applying rolling.std() to rescaled version of the data (which worked for float32) - is broken !!

Minimum Reproducible Example

Regarding generating an MRE - here is the rub.
The bug seems to be a function of the numerics specific to this dataset - and I cannot reproduce it using random data (since I don't know what feature of the numerics is causing this).

Now, if I save the data as a pickle file (using df.to_pickle) and load it back in - I can reproduce these results exactly.

However, if I save it as a csv file (for sharing here) - and load it back in - I get a whole new level of badness. The results look really bad for all cases after this round-tripping. This seems to indicate that there is some thing about the exact numbers of the dataset that is triggering some numerical problems with rolling.std.

rs.zip

from pylab import *


import pandas as pd

ddf = pd.read_csv( '/path/to/rs.csv')

ddf.high = ddf.high
display( ddf.info() )


figure()
subplot(121)
plot( ddf.high, '-r' )
subplot(122)
plot( ddf.high.tail(200), '-r' )
gcf().suptitle( 'original data' )

figure()
subplot(121)
rs = np.sqrt( (ddf.high ** 2).rolling(10).mean()  - (ddf.high.rolling(10).mean() ) ** 2 )
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed using rolling means on original data' )

figure()
subplot(121)
rs = ddf.rolling(10).high.std()
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed on original data' )

figure()
subplot(121)
ddf = ddf.assign( high_rescaled=ddf.high / 1e8 )
rs = ddf.rolling(10).high_rescaled.std() * 1e8
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed on rescaled data' )

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-9-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.2.3
setuptools: 41.2.0
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.13.0
xarray: 0.13.0+6.g4617e68b
IPython: 7.8.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.4.1
bs4: 4.8.0
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.3.0

@jreback

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-10-18T00:18:02Z

This may have been fixed by #37055.

If not, happy to reopen, but we would need a more slimmed down example.

jbrockmendel added the Window rolling, ewma, expanding label Oct 16, 2019

mroeschke added the Bug label May 8, 2020

mroeschke closed this as completed Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible overflow errors with pd.rolling(...).std() #28688

Possible overflow errors with pd.rolling(...).std() #28688

fjanoos commented Sep 30, 2019 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Oct 18, 2020

Possible overflow errors with pd.rolling(...).std() #28688

Possible overflow errors with pd.rolling(...).std() #28688

Comments

fjanoos commented Sep 30, 2019 • edited Loading

Minimum Reproducible Example

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Oct 18, 2020

fjanoos commented Sep 30, 2019 •

edited

Loading

Output of `pd.show_versions()`