Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC/PERF: Decide how to handle floating point artifacts during rolling calculations #37051

Closed
mroeschke opened this issue Oct 11, 2020 · 16 comments · Fixed by #40505
Closed

DOC/PERF: Decide how to handle floating point artifacts during rolling calculations #37051

mroeschke opened this issue Oct 11, 2020 · 16 comments · Fixed by #40505
Labels
Docs Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding
Milestone

Comments

@mroeschke
Copy link
Member

Currently we have a check here that artificially handles a numerical precision issue in rolling.var and rolling.std where our rolling variance calculation is carrying forward floating point artifacts. Ideally we should be using a more numerically stable algorithm (maybe Kahan summation) so this check isn't so arbitrary.

if result < 1e-15:

@mroeschke mroeschke added Performance Memory or execution speed performance Window rolling, ewma, expanding labels Oct 11, 2020
@ukarroum
Copy link
Contributor

Would like to try working on that is possible.

@ukarroum
Copy link
Contributor

take

@ukarroum
Copy link
Contributor

@phofl : It looks like you have a working PR : #37055
so i m gonna unassign myself.

@ukarroum ukarroum removed their assignment Oct 12, 2020
@phofl
Copy link
Member

phofl commented Oct 12, 2020

@ukarroum Not really, my PR fixes problems with large numbers but not the problem mentioned above

@ukarroum
Copy link
Contributor

Oh my bad.

Gonna retake it then.

Thanks

@ukarroum
Copy link
Contributor

take

@ukarroum
Copy link
Contributor

It looks like (from PR : #37055) using kahan summations don't solve the issue.
couldn't find another way, so i'm just gonna unassign myself.

@ukarroum ukarroum removed their assignment Oct 25, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020
@mroeschke mroeschke changed the title ENH: Implement a more numerically stable algorithm for rolling var ENH: Implement a more numerically stable algorithm for rolling var for small values Jan 2, 2021
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.2.1 Jan 2, 2021
@phofl
Copy link
Member

phofl commented Jan 4, 2021

To summarize the current situation:

Theoretically our implementation is stable for small numbers.

Our implementation is not stable for cases like:

s = pd.Series([7, 5, 5, 5])
print(s.rolling(3).var())

The following explains why:

We are using Welfords Method (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) with Kahan summation. In the third add pass through we have the following values: The current ssqdm is 2.0

prev_mean = 6.0
new_mean = 5 + 2/3
val=6.0
ssqdm=2 + 2/3

ssqm is 2.666666666666667

The next pass through removes the following:

prev_mean = 5.666666666666667
new_mean = 5.0
val=7.0
ssqdm=2.666666666666667 - 2 - 2/3

Theoretically this should lead to 0, but because of floating point artifacts this leads to 8.881784197001252e-16. So without the line in the op, we would not return 0 here. That is the reason why this check is needed.
The implementation can be found at

cdef inline void add_var(float64_t val, float64_t *nobs, float64_t *mean_x,

result:

0         NaN
1         NaN
2    1.333333
3    0.000000
dtype: float64

@jreback jreback modified the milestones: 1.2.1, 1.3 Jan 4, 2021
@mroeschke
Copy link
Member Author

Thanks for the clear explanation @phofl.

Since these floating point artifacts are unavoidable, we can either:

  1. Just document that we round values less than 1e-15 to 0 due to floating point artifacts in our user_guide/window.rst
  2. Actually remove our artificial if result < 1e-15: and let floating point artifacts be apart of our implementation and document that.

@phofl
Copy link
Member

phofl commented Jan 4, 2021

First one? Don't know. Both have their disadvantages unfortunately...

@mroeschke
Copy link
Member Author

Yeah I can see that.

I am also entertaining the second option as well and pushing the responsibility of handing floating point artifacts to the user (in the final result but unfortunately not during the rolling calculation)

@mroeschke mroeschke removed the Performance Memory or execution speed performance label Jan 4, 2021
@mroeschke mroeschke added Docs Needs Discussion Requires discussion from core team before further action labels Jan 4, 2021
@mroeschke mroeschke changed the title ENH: Implement a more numerically stable algorithm for rolling var for small values DOC/PERF: Decide how to handle floating point artifacts during rolling calculations Jan 4, 2021
@phofl
Copy link
Member

phofl commented Jan 4, 2021

It is pick your poison. In case of the second alternative, we have to adjust the docstring in

>>> s = pd.Series([5, 5, 6, 7, 5, 5, 5])

and
>>> s = pd.Series([5, 5, 6, 7, 5, 5, 5])

My example was based on that. This would cause doctests to fail otherwise

@xmatthias
Copy link

xmatthias commented Feb 25, 2021

crossposting from #39872 (comment) as i'm not sure that issue is followed any longer.

We are encountering a problem while calculating mean (and std) on top of Crypto asset prices (which can become very low numbers (1e-7)).

The release-logs for pandas 1.2 mention this change, however there's no mention of this side-effect to low value numbers.

I don't think the below example should be impacted by this - as the expected results are 1e-9 - so nowhere near the mentioned threshold of 1e-15.

A very simple example:

import pandas as pd

print(pd.__version__)

df = pd.DataFrame(data = {'data':
    [
        0.00000054,
        0.00000053,
        0.00000054,
     ]}
    )

df['mean'] = df['data'].rolling(2).mean()
df['std'] = df['data'].rolling(2).std()
print(df)

with pandas < 1.2.0, the return is as follows:

1.1.5
           data          mean           std
0  5.400000e-07           NaN           NaN
1  5.300000e-07  5.350000e-07  7.071068e-09
2  5.400000e-07  5.350000e-07  7.071068e-09

while 1.2.0 returns:

1.2.0
           data          mean  std
0  5.400000e-07           NaN  NaN
1  5.300000e-07  5.350000e-07  0.0
2  5.400000e-07  5.350000e-07  0.0

The values are nowhere near the mentioned threshold of 1e-15.

@phofl
Copy link
Member

phofl commented Feb 25, 2021

The relevant result is the variance, which is used to calculate the std. The variance is e-17, meaning the threshold is met

Edit: This is also explained here: #39872 (comment)

@xmatthias
Copy link

xmatthias commented Feb 25, 2021

you're right, it's the variance that's this low (did miss that part) - however the relevant part from a user perspective is the endresult - which is std in this case - so the final error i receive from pandas is 5e-07 - not the variance - even though the intermediate result is wrong by only 1e-17.

I do still see this as a regression / bug in Pandas - as the version update from 1.1.5 to 1.2.0 broke the result of a calculation that was correct beforehand.

@bashtage
Copy link
Contributor

I do still see this as a regression / bug in Pandas - as the version update from 1.1.5 to 1.2.0 broke the result of a calculation that was correct beforehand.

a hard threshold definitely seems like a bug. It seems that it has to be the case that df.rolling(3).var() is the same as 10**10(df/(10**10)).rolling(3).var() up to some rounding. The threshold should be relative to the previous value I would think (or no threshold at all, which is what NumPy does).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants