You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experiencing a very weird bug with one very specific dataset - when I try to use pandas' rolling.std function on it.
Basically - the setup is -
I have a dataframe with 1 column stored in float32 format
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high 2186 non-null float32
dtypes: float32(1)
memory usage: 25.6 KB
ddf.head()
Here is a plot of the full timeseries (left column) along with a tail of 200 rows (right column)
Note the scale of the numbers here - it goes from 1e8 to 1e1.
Next I compute the rolling standard deviations using rolling means, as per:
This is what I get
which matches the output computed using rolling means !
Note - the original data was in np.float32 format. So I thought that this bug might be happening because of some overflow issues (which it really should not - the window is only 10 long !!).
So I converted the data to float64 to test this:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high 2186 non-null float64
dtypes: float64(1)
memory usage: 34.2 KB
In this case - even applying rolling.std() to rescaled version of the data (which worked for float32) - is broken !!
Minimum Reproducible Example
Regarding generating an MRE - here is the rub.
The bug seems to be a function of the numerics specific to this dataset - and I cannot reproduce it using random data (since I don't know what feature of the numerics is causing this).
Now, if I save the data as a pickle file (using df.to_pickle) and load it back in - I can reproduce these results exactly.
However, if I save it as a csv file (for sharing here) - and load it back in - I get a whole new level of badness. The results look really bad for all cases after this round-tripping. This seems to indicate that there is some thing about the exact numbers of the dataset that is triggering some numerical problems with rolling.std.
I'm experiencing a very weird bug with one very specific dataset - when I try to use pandas' rolling.std function on it.
Basically - the setup is -
Here is a plot of the full timeseries (left column) along with a tail of 200 rows (right column)
Note the scale of the numbers here - it goes from 1e8 to 1e1.
Next I compute the rolling standard deviations using rolling means, as per:
This is what the rolling std computed this way it looks like (and matches what I would expect):
But if I use
this is what I get:
Something has gotten corrupted - as we can see in the tail of 200 rows in the right.
Now - however, if I rescale the data to make the numbers sit in a smaller range, compute the rolling std and scale it back up
This is what I get
which matches the output computed using rolling means !
Note - the original data was in np.float32 format. So I thought that this bug might be happening because of some overflow issues (which it really should not - the window is only 10 long !!).
So I converted the data to float64 to test this:
In this case - even applying rolling.std() to rescaled version of the data (which worked for float32) - is broken !!
Minimum Reproducible Example
Regarding generating an MRE - here is the rub.
The bug seems to be a function of the numerics specific to this dataset - and I cannot reproduce it using random data (since I don't know what feature of the numerics is causing this).
Now, if I save the data as a pickle file (using df.to_pickle) and load it back in - I can reproduce these results exactly.
However, if I save it as a csv file (for sharing here) - and load it back in - I get a whole new level of badness. The results look really bad for all cases after this round-tripping. This seems to indicate that there is some thing about the exact numbers of the dataset that is triggering some numerical problems with rolling.std.
rs.zip
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-9-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.2.3
setuptools: 41.2.0
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.13.0
xarray: 0.13.0+6.g4617e68b
IPython: 7.8.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.4.1
bs4: 4.8.0
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.3.0
The text was updated successfully, but these errors were encountered: