-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC/PERF: Decide how to handle floating point artifacts during rolling calculations #37051
Comments
Would like to try working on that is possible. |
take |
@ukarroum Not really, my PR fixes problems with large numbers but not the problem mentioned above |
Oh my bad. Gonna retake it then. Thanks |
take |
It looks like (from PR : #37055) using kahan summations don't solve the issue. |
To summarize the current situation: Theoretically our implementation is stable for small numbers. Our implementation is not stable for cases like:
The following explains why: We are using Welfords Method (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) with Kahan summation. In the third add pass through we have the following values: The current ssqdm is
ssqm is The next pass through removes the following:
Theoretically this should lead to pandas/pandas/_libs/window/aggregations.pyx Line 290 in 125441c
result:
|
Thanks for the clear explanation @phofl. Since these floating point artifacts are unavoidable, we can either:
|
First one? Don't know. Both have their disadvantages unfortunately... |
Yeah I can see that. I am also entertaining the second option as well and pushing the responsibility of handing floating point artifacts to the user (in the final result but unfortunately not during the rolling calculation) |
It is pick your poison. In case of the second alternative, we have to adjust the docstring in pandas/pandas/core/window/rolling.py Line 641 in 7f2a768
and pandas/pandas/core/window/rolling.py Line 701 in 7f2a768
My example was based on that. This would cause doctests to fail otherwise |
crossposting from #39872 (comment) as i'm not sure that issue is followed any longer. We are encountering a problem while calculating mean (and std) on top of Crypto asset prices (which can become very low numbers (1e-7)). The release-logs for pandas 1.2 mention this change, however there's no mention of this side-effect to low value numbers. I don't think the below example should be impacted by this - as the expected results are 1e-9 - so nowhere near the mentioned threshold of 1e-15. A very simple example: import pandas as pd
print(pd.__version__)
df = pd.DataFrame(data = {'data':
[
0.00000054,
0.00000053,
0.00000054,
]}
)
df['mean'] = df['data'].rolling(2).mean()
df['std'] = df['data'].rolling(2).std()
print(df) with pandas < 1.2.0, the return is as follows:
while 1.2.0 returns:
The values are nowhere near the mentioned threshold of 1e-15. |
The relevant result is the variance, which is used to calculate the std. The variance is e-17, meaning the threshold is met Edit: This is also explained here: #39872 (comment) |
you're right, it's the variance that's this low (did miss that part) - however the relevant part from a user perspective is the endresult - which is std in this case - so the final error i receive from pandas is 5e-07 - not the variance - even though the intermediate result is wrong by only 1e-17. I do still see this as a regression / bug in Pandas - as the version update from 1.1.5 to 1.2.0 broke the result of a calculation that was correct beforehand. |
a hard threshold definitely seems like a bug. It seems that it has to be the case that |
Currently we have a check here that artificially handles a numerical precision issue in
rolling.var
androlling.std
where our rolling variance calculation is carrying forward floating point artifacts. Ideally we should be using a more numerically stable algorithm (maybe Kahan summation) so this check isn't so arbitrary.pandas/pandas/_libs/window/aggregations.pyx
Line 305 in 601eff1
The text was updated successfully, but these errors were encountered: