Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bottleneck.move_std() produces nans, doesn't match pandas.rolling_std() #50

Closed
erg opened this issue Sep 4, 2012 · 21 comments
Closed

Comments

@erg
Copy link

erg commented Sep 4, 2012

https://gist.github.com/3624548

pandas-dev/pandas#1840

@kwgoodman
Copy link
Collaborator

I don't know what you mean by produces NaNs. Could you give a short example showing the input (x) and what you would like bottleneck to return?

@erg
Copy link
Author

erg commented Sep 4, 2012

Test.pyx has the x vector in the gist link with exact instructions to reproduce the results. The results should look like the factor version or the pandas version without the nans. Closed on accident...

@erg erg closed this as completed Sep 4, 2012
@erg erg reopened this Sep 4, 2012
@kwgoodman
Copy link
Collaborator

Too much work to go through your example. Perhaps you can make a brief example as below.

Are these the NaNs you don't like? (I can't change that behavior)

In [8]: a = [1,2,10,12]
In [9]: bn.move_std(a, 4)
Out[9]: array([        nan,         nan,         nan,  4.81534007])

Do you want to normalize by N:

In [10]: bn.move_std(a, 4)[-1]
Out[10]: 4.815340071064556
In [11]: np.std(a)
Out[11]: 4.815340071064556

Or N-1?

In [14]: bn.move_std(a, 4, ddof=1)[-1]
Out[14]: 5.5602757725374259
In [15]: np.std(a, ddof=1)
Out[15]: 5.5602757725374259

@erg
Copy link
Author

erg commented Sep 4, 2012

To reproduce it, just copy/paste this entire thing into the shell:

git clone git://gist.github.com/3624548.git && cd 3624548 && ipython2
import numpy as np
import bottleneck as bn

x = np.load('testx.npy')
bn.move_std(x,3)

@erg
Copy link
Author

erg commented Sep 4, 2012

The preceding nans are not what I'm talking about--there are nans in the middle of the data. However, pandas has a kwarg num_periods=1 to start calculating even from the start. It would be awesome if bottleneck supported this feature, too.

import pandas


In [8]: pandas.rolling_mean(np.array([1,2,3,4,5]),3)
Out[8]: array([ nan,  nan,   2.,   3.,   4.])

In [9]: pandas.rolling_mean(np.array([1,2,3,4,5]),3, min_periods=1)
Out[9]: array([ 1. ,  1.5,  2. ,  3. ,  4. ])

@erg
Copy link
Author

erg commented Sep 4, 2012

And finally, I'm aware of the sample standard deviation vs population standard deviation. I suppose that accounts for the difference in the non-nan values. Maybe the docs should say you're using the sample std?

Edit: I guess you're calculating the biased std, N=0. There's no description of ddof in the parameters list.

@kwgoodman
Copy link
Collaborator

Yes, that's a bug (good find!). The docstring should include the description of ddof, which we can copy from np.std:

ddof  int, optional
    Means Delta Degrees of Freedom.  The divisor used in calculations
    is ``N - ddof``, where ``N`` represents the number of elements.
    By default `ddof` is zero.

@kwgoodman
Copy link
Collaborator

I added ddof to the docstrings of move_std() and move_nanstd():

1abdfa7

@kwgoodman
Copy link
Collaborator

You might want to try bn.move_nanstd() instead of bn.move_std().

@erg
Copy link
Author

erg commented Sep 6, 2012

Same bug.

In [5]: bn.move_nanstd(x,3)
Out[5]: 
array([             nan,              nan,   1.55715950e-04,
         5.61672553e-04,   6.29558646e-04,   4.04292042e-04,
         4.04292042e-04,              nan,              nan,
         1.34956920e-04,   1.34956920e-04,   2.69682220e-04,
         2.69682220e-04,              nan,              nan,
         3.59747798e-04,   3.59747798e-04,   4.51234346e-05,
         6.10358168e-04,   6.31667620e-04,              nan,
         2.26028251e-04,   2.26028251e-04,   0.00000000e+00,
         2.71807335e-04,   5.10356704e-04,   2.96900396e-04,
         4.52446992e-05,   4.52229970e-05,   4.52229970e-05,
         3.61714576e-04,   3.61714576e-04,   9.06372853e-05,
         9.06372853e-05,              nan,   4.54059450e-04,
         4.54059450e-04,   2.72330746e-04,   2.72330746e-04,
                    nan,   1.81518876e-04,   1.81518876e-04,
                    nan,   1.81641276e-04,   1.81641276e-04,
         4.09126392e-04,   4.09126392e-04,   3.36062087e-11,
         2.27731653e-04,   2.27731653e-04,   4.55947887e-04,
         4.05285108e-04,   2.77122055e-04,   1.82255759e-04,
...

@kwgoodman
Copy link
Collaborator

If there are three nans in a row in x then what should the std be when the window is 3?

@erg
Copy link
Author

erg commented Sep 6, 2012

There are no nans in my x at all. But you need the full precision, hence loading it from the file as described above.

In [7]: print x
[  0.00000000e+00   1.90748689e-04  -1.90675932e-04   1.14481963e-03
   1.14481963e-03   2.87186699e-04   2.87186699e-04   2.87186699e-04
   2.87186699e-04   8.99839353e-07   8.99839353e-07   5.72982219e-04
   5.72982219e-04   5.72982219e-04   5.72982219e-04  -1.90158103e-04
  -1.90158103e-04  -9.44368435e-05  -1.43440622e-03  -1.43440622e-03
  -1.43440622e-03  -9.54927888e-04  -9.54927888e-04  -9.54927888e-04
  -3.78337459e-04   2.93963731e-04   1.97985231e-04   1.97985231e-04
   1.02052767e-04   1.02052767e-04  -6.65259721e-04  -6.65259721e-04
  -4.72989004e-04  -4.72989004e-04  -4.72989004e-04   4.90216544e-04
   4.90216544e-04   1.06791730e-03   1.06791730e-03   1.06791730e-03
   1.45297698e-03   1.45297698e-03   1.45297698e-03   1.83829631e-03
   1.83829631e-03   2.70618445e-03   2.70618445e-03   2.70618445e-03
   2.22309266e-03   2.22309266e-03   3.19030419e-03   2.90042111e-03
   2.51379826e-03   2.51379826e-03   3.86710904e-03   4.25324670e-03
   4.73573247e-03   4.73573247e-03   4.73573247e-03   4.73573247e-03
   4.73573247e-03   4.35022746e-03   4.35022746e-03   4.35022746e-03
   4.35022746e-03   3.96479674e-03   3.96479674e-03   3.96479674e-03
   3.96479674e-03   3.96479674e-03   3.96479674e-03   3.96479674e-03

@erg
Copy link
Author

erg commented Sep 6, 2012

If there were nans in the x input, I suppose I don't care what move_std returns at all. :)

@erg
Copy link
Author

erg commented Sep 6, 2012

It might be that you're taking the sqrt of a negative number.

From move_std.pyx:


            y[i0] = sqrt((a2sum - asum * asum / count) / (count - ddof))

@erg
Copy link
Author

erg commented Sep 6, 2012

Yes, it's taking the sqrt of negative numbers!

In [2]: import numpy as np
import bottleneck as bn

x = np.load('testx.npy')
bn.move_std(x,3)
0.00000024742859987833254792788557907545765601753374, 0.00086156009635718323012854025222395648597739636898, 3, 0
-0.00000000000000000000012352563814125214061742603683
0.00000024742859987833254792788557907545765601753374, 0.00086156009635718323012854025222395648597739636898, 3, 0
-0.00000000000000000000012352563814125214061742603683
0.00000098492587061058721337463046918703213350454462, 0.00171894665764582782810365735315372148761525750160, 3, 0
-0.00000000000000000000014117215787571671534142474463
0.00000098492587061058721337463046918703213350454462, 0.00171894665764582782810365735315372148761525750160, 3, 0
-0.00000000000000000000014117215787571671534142474463
0.00000617256357447203218380502726114755773778597359, -0.00430321864694510961002471560732374200597405433655, 3, 0
-0.00000000000000000000028234431575143343068284948926
0.00000067115579297346920246844980892375609471400821, -0.00141896701121640295140124976569495629519224166870, 3, 0
-0.00000000000000000000042351647362715016953416125034
0.00000342134204903411354143196246302416341222851770, 0.00320375188600839633235040082581690512597560882568, 3, 0
-0.00000000000000000000183523805238431727592863466376
0.00000633342630963094281159458942553897031757514924, 0.00435893093876157239341395666087919380515813827515, 3, 0
-0.00000000000000000000141172157875716729447356954499
Out[2]: 
array([             nan,              nan,   1.55715950e-04,
         5.61672553e-04,   6.29558646e-04,   4.04292042e-04,
         4.04292042e-04,              nan,              nan,
         1.34956920e-04,   1.34956920e-04,   2.69682220e-04,
         2.69682220e-04,              nan,              nan,
         3.59747798e-04,   3.59747798e-04,   4.51234346e-05,
         6.10358168e-04,   6.31667620e-04,              nan,
         2.26028251e-04,   2.26028251e-04,   0.00000000e+00,
         2.71807335e-04,   5.10356704e-04,   2.96900396e-04,
         4.52446992e-05,   4.52229970e-05,   4.52229970e-05,
         3.61714576e-04,   3.61714576e-04,   9.06372853e-05,
         9.06372853e-05,              nan,   4.54059450e-04,
         4.54059450e-04,   2.72330746e-04,   2.72330746e-04,
                    nan,   1.81518876e-04,   1.81518876e-04,
                    nan,   1.81641276e-04,   1.81641276e-04,
         4.09126392e-04,   4.09126392e-04,   3.36062087e-11,

Here's the print statements I added:

        if count == window:
            foo = (a2sum - asum * asum / count) / (count - ddof)
            if foo < 0.:
                print "%.50f, %.50f, %d, %d" % (a2sum, asum, count, ddof)
                print "%.50f" % foo
            y[i0] = sqrt(foo)
        else:
            y[i0] = NAN

@erg
Copy link
Author

erg commented Sep 6, 2012

Note that these negative numbers are also less than negative zero, or else sqrt could handle them.

@kwgoodman
Copy link
Collaborator

Yep, we found the same thing. We plan to set the output to 0 when a2sum - asum * asum / count is negative.

@erg
Copy link
Author

erg commented Sep 6, 2012

Awesome. What about adding a move_var and min_periods=1 functionality?

@erg
Copy link
Author

erg commented Sep 6, 2012

Pandas also has rolling_kurtosis and rolling_skew.

kwgoodman added a commit that referenced this issue Sep 6, 2012
BUG move_std and move_nanstd neg sqrt bugs fixed.

As reported in issue #50.
@kwgoodman
Copy link
Collaborator

The negative sqrt issue is fixed:

6fb6f73

@kwgoodman
Copy link
Collaborator

docstring is fixed and negative sqrt bug is fixed, so I'm closing this issue. You can open a new issue with the min_periods feature request if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants