ENH: Use Kahan summation and Welfords method to calculate rolling var and std #37055

phofl · 2020-10-11T15:56:35Z

xref DOC/PERF: Decide how to handle floating point artifacts during rolling calculations #37051
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

As suggested by @mroeschke Kahan summation fixes the numerical problems. Additionally I used Welfords Method to calculate ssqdm, because previously the tests I have added would return

0             NaN
1    3.500000e+34
2    3.000000e+34
3    3.000000e+34
4    3.000000e+34
5    3.000000e+34
6    3.000000e+34
7    3.000000e+34
8    3.000000e+34
9    3.000000e+34
dtype: float64

for var(). I am running the asv and will post the results when available

jreback · 2020-10-11T16:37:58Z

great thanks @phofl

pep8speaks · 2020-10-11T17:03:58Z

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-12 19:04:44 UTC

phofl · 2020-10-11T17:06:37Z

Unfortunately I chose a bad example, which did miss one bug. Fixed it now and added the corresponding test.

jreback · 2020-10-11T17:18:45Z

xref #6817

i guess this was there a long time ago, but i dont' think had enough tests to lock it down.

jreback · 2020-10-11T17:20:09Z

and maybe some examples from here: #6929 (though that's obviously a separate issue)

phofl · 2020-10-11T17:24:31Z

Yes planned to Look into this in the future. Maybe we can improve this in a similar way.

Delta**2 was the problem with the modified version. Switching to regular welford fixes this

phofl · 2020-10-11T18:05:48Z

Looks like this will only help with large numbers.

phofl · 2020-10-11T23:43:15Z

Interestingly

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

jreback · 2020-10-12T15:30:37Z

can you merge master, ping on green

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/window/test_rolling.py

phofl · 2020-10-12T20:25:07Z

@jreback green

jreback · 2020-10-12T20:35:13Z

pandas/_libs/window/aggregations.pyx

@@ -353,7 +362,8 @@ def roll_var(ndarray[float64_t] values, ndarray[int64_t] start,
    Numerically stable implementation using Welford's method.
    """
    cdef:
-        float64_t mean_x = 0, ssqdm_x = 0, nobs = 0,
+        float64_t mean_x = 0, ssqdm_x = 0, nobs = 0, compensation_add = 0,
+        float64_t compensation_remove = 0,


isn't there a line in this func that you need to remove

eg the < 1e-14

?

jreback · 2020-10-12T20:38:58Z

@phofl this PR doesn't the close issue? can u show an example of when

phofl · 2020-10-12T20:46:05Z

Yeah thought so too initially, because I was not able to construct a counter example. But our docstrings do the job:
If we remove the line:

s = pd.Series([5, 5, 6, 7, 5, 5, 5])
print(s.rolling(3).var())

Returns

0             NaN
1             NaN
2    3.333333e-01
3    1.000000e+00
4    1.000000e+00
5    1.333333e+00
6    6.661338e-16
dtype: float64

Seems like Kahan summation and Welfords method only help for large numbers. Issues with numbers like 1/3 like here can't be fixed with that. Rounding issues after multiplication cause there problems.

jreback · 2020-10-12T22:52:48Z

Yeah thought so too initially, because I was not able to construct a counter example. But our docstrings do the job:
If we remove the line:
s = pd.Series([5, 5, 6, 7, 5, 5, 5])
print(s.rolling(3).var())
Returns
0             NaN
1             NaN
2    3.333333e-01
3    1.000000e+00
4    1.000000e+00
5    1.333333e+00
6    6.661338e-16
dtype: float64
Seems like Kahan summation and Welfords method only help for large numbers. Issues with numbers like 1/3 like here can't be fixed with that. Rounding issues after multiplication cause there problems.

ok its prob worth adding an xfail test for that one. (followon ok)

jreback · 2020-10-12T22:53:25Z

doc/source/whatsnew/v1.2.0.rst

@@ -192,6 +192,7 @@ Other enhancements
 - Added methods :meth:`IntegerArray.prod`, :meth:`IntegerArray.min`, and :meth:`IntegerArray.max` (:issue:`33790`)
 - Where possible :meth:`RangeIndex.difference` and :meth:`RangeIndex.symmetric_difference` will return :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`36564`)
 - Added :meth:`Rolling.sem()` and :meth:`Expanding.sem()` to compute the standard error of mean (:issue:`26476`).
+- :meth:`Rolling.var()` and :meth:`Rolling.std()` use Kahan summation and Welfords Method to avoid numerical issues (:issue:`37051`)


this is not fully true, but its better so ok.

jreback · 2020-10-12T22:53:42Z

thanks @phofl

phofl · 2020-10-14T20:04:13Z

@jreback with the line < 1e-14 this test would not fail. I could add a test which passes, but would fail, if somebody removes the line without fixing the underlying problem?

… and std (pandas-dev#37055)

phofl added 5 commits October 11, 2020 15:13

Implement bedfords algorithm

5066d54

Use kahan summation for var

9f83a4c

Add whatsnew and adjust comments to reflect new behavior

214d563

Add test

602443c

Delete old comment

a953f46

phofl added the Window rolling, ewma, expanding label Oct 11, 2020

Fix flake problems

3546935

jreback added the Performance Memory or execution speed performance label Oct 11, 2020

Adjust kahan summation and add new tests

d29e7c5

Run black

411ba68

Add old comment

95d652c

jreback added this to the 1.2 milestone Oct 12, 2020

ukarroum mentioned this pull request Oct 12, 2020

DOC/PERF: Decide how to handle floating point artifacts during rolling calculations #37051

Closed

Merge branch 'master' of https://github.com/pandas-dev/pandas into 37051

cfca750

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/window/test_rolling.py

jreback requested changes Oct 12, 2020

View reviewed changes

jreback reviewed Oct 12, 2020

View reviewed changes

jreback approved these changes Oct 12, 2020

View reviewed changes

jreback merged commit 15d818b into pandas-dev:master Oct 12, 2020

phofl deleted the 37051 branch October 14, 2020 20:03

mroeschke mentioned this pull request Oct 18, 2020

Possible overflow errors with pd.rolling(...).std() #28688

Closed

simonjayhawkins mentioned this pull request Oct 25, 2020

test_rolling_var_numerical_issues on linux py_3.8_32 failing on MacPython.pandas-wheels #37398

Open

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020

ENH: Use Kahan summation and Welfords method to calculate rolling var…

f51cc99

… and std (pandas-dev#37055)

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

ENH: Use Kahan summation and Welfords method to calculate rolling var…

58ee2ed

… and std (pandas-dev#37055)

fangchenli mentioned this pull request Mar 17, 2021

COMPAT/BLD: rolling failed on Arm64 and ppc64le Linux #38921

Open

juanmpga mentioned this pull request Feb 18, 2022

BUG: Pandas rolling std precision error #46049

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use Kahan summation and Welfords method to calculate rolling var and std #37055

ENH: Use Kahan summation and Welfords method to calculate rolling var and std #37055

phofl commented Oct 11, 2020 •

edited by jreback

Loading

jreback commented Oct 11, 2020

pep8speaks commented Oct 11, 2020 •

edited

Loading

phofl commented Oct 11, 2020

jreback commented Oct 11, 2020

jreback commented Oct 11, 2020

phofl commented Oct 11, 2020

phofl commented Oct 11, 2020

phofl commented Oct 11, 2020

jreback commented Oct 12, 2020

phofl commented Oct 12, 2020

jreback Oct 12, 2020

phofl Oct 12, 2020

jreback commented Oct 12, 2020

phofl commented Oct 12, 2020 •

edited

Loading

jreback commented Oct 12, 2020

jreback Oct 12, 2020

jreback commented Oct 12, 2020

phofl commented Oct 14, 2020

ENH: Use Kahan summation and Welfords method to calculate rolling var and std #37055

ENH: Use Kahan summation and Welfords method to calculate rolling var and std #37055

Conversation

phofl commented Oct 11, 2020 • edited by jreback Loading

jreback commented Oct 11, 2020

pep8speaks commented Oct 11, 2020 • edited Loading

Comment last updated at 2020-10-12 19:04:44 UTC

phofl commented Oct 11, 2020

jreback commented Oct 11, 2020

jreback commented Oct 11, 2020

phofl commented Oct 11, 2020

phofl commented Oct 11, 2020

phofl commented Oct 11, 2020

jreback commented Oct 12, 2020

phofl commented Oct 12, 2020

jreback Oct 12, 2020

Choose a reason for hiding this comment

phofl Oct 12, 2020

Choose a reason for hiding this comment

jreback commented Oct 12, 2020

phofl commented Oct 12, 2020 • edited Loading

jreback commented Oct 12, 2020

jreback Oct 12, 2020

Choose a reason for hiding this comment

jreback commented Oct 12, 2020

phofl commented Oct 14, 2020

phofl commented Oct 11, 2020 •

edited by jreback

Loading

pep8speaks commented Oct 11, 2020 •

edited

Loading

phofl commented Oct 12, 2020 •

edited

Loading