Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use Kahan summation and Welfords method to calculate rolling var and std #37055

Merged
merged 10 commits into from
Oct 12, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Oct 11, 2020

As suggested by @mroeschke Kahan summation fixes the numerical problems. Additionally I used Welfords Method to calculate ssqdm, because previously the tests I have added would return

0             NaN
1    3.500000e+34
2    3.000000e+34
3    3.000000e+34
4    3.000000e+34
5    3.000000e+34
6    3.000000e+34
7    3.000000e+34
8    3.000000e+34
9    3.000000e+34
dtype: float64

for var(). I am running the asv and will post the results when available

@phofl phofl added the Window rolling, ewma, expanding label Oct 11, 2020
@jreback jreback added the Performance Memory or execution speed performance label Oct 11, 2020
@jreback
Copy link
Contributor

jreback commented Oct 11, 2020

great thanks @phofl

@pep8speaks
Copy link

pep8speaks commented Oct 11, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-12 19:04:44 UTC

@phofl
Copy link
Member Author

phofl commented Oct 11, 2020

Unfortunately I chose a bad example, which did miss one bug. Fixed it now and added the corresponding test.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2020

xref #6817

i guess this was there a long time ago, but i dont' think had enough tests to lock it down.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2020

and maybe some examples from here: #6929 (though that's obviously a separate issue)

@phofl
Copy link
Member Author

phofl commented Oct 11, 2020

Yes planned to Look into this in the future. Maybe we can improve this in a similar way.

Delta**2 was the problem with the modified version. Switching to regular welford fixes this

@phofl
Copy link
Member Author

phofl commented Oct 11, 2020

Looks like this will only help with large numbers.

@phofl
Copy link
Member Author

phofl commented Oct 11, 2020

Interestingly

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jreback jreback added this to the 1.2 milestone Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

can you merge master, ping on green

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/window/test_rolling.py
@phofl
Copy link
Member Author

phofl commented Oct 12, 2020

@jreback green

@@ -353,7 +362,8 @@ def roll_var(ndarray[float64_t] values, ndarray[int64_t] start,
Numerically stable implementation using Welford's method.
"""
cdef:
float64_t mean_x = 0, ssqdm_x = 0, nobs = 0,
float64_t mean_x = 0, ssqdm_x = 0, nobs = 0, compensation_add = 0,
float64_t compensation_remove = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a line in this func that you need to remove

eg the < 1e-14

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below

@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

@phofl this PR doesn't the close issue? can u show an example of when

@phofl
Copy link
Member Author

phofl commented Oct 12, 2020

Yeah thought so too initially, because I was not able to construct a counter example. But our docstrings do the job:
If we remove the line:

s = pd.Series([5, 5, 6, 7, 5, 5, 5])
print(s.rolling(3).var())

Returns

0             NaN
1             NaN
2    3.333333e-01
3    1.000000e+00
4    1.000000e+00
5    1.333333e+00
6    6.661338e-16
dtype: float64

Seems like Kahan summation and Welfords method only help for large numbers. Issues with numbers like 1/3 like here can't be fixed with that. Rounding issues after multiplication cause there problems.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

Yeah thought so too initially, because I was not able to construct a counter example. But our docstrings do the job:
If we remove the line:

s = pd.Series([5, 5, 6, 7, 5, 5, 5])
print(s.rolling(3).var())

Returns

0             NaN
1             NaN
2    3.333333e-01
3    1.000000e+00
4    1.000000e+00
5    1.333333e+00
6    6.661338e-16
dtype: float64

Seems like Kahan summation and Welfords method only help for large numbers. Issues with numbers like 1/3 like here can't be fixed with that. Rounding issues after multiplication cause there problems.

ok its prob worth adding an xfail test for that one. (followon ok)

@@ -192,6 +192,7 @@ Other enhancements
- Added methods :meth:`IntegerArray.prod`, :meth:`IntegerArray.min`, and :meth:`IntegerArray.max` (:issue:`33790`)
- Where possible :meth:`RangeIndex.difference` and :meth:`RangeIndex.symmetric_difference` will return :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`36564`)
- Added :meth:`Rolling.sem()` and :meth:`Expanding.sem()` to compute the standard error of mean (:issue:`26476`).
- :meth:`Rolling.var()` and :meth:`Rolling.std()` use Kahan summation and Welfords Method to avoid numerical issues (:issue:`37051`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not fully true, but its better so ok.

@jreback jreback merged commit 15d818b into pandas-dev:master Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

thanks @phofl

@phofl phofl deleted the 37051 branch October 14, 2020 20:03
@phofl
Copy link
Member Author

phofl commented Oct 14, 2020

@jreback with the line < 1e-14 this test would not fail. I could add a test which passes, but would fail, if somebody removes the line without fixing the underlying problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants