Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Ensure that rolling_var of identical values is exactly zero #8271

Closed
wants to merge 1 commit into from

Conversation

jaimefrio
Copy link
Contributor

Add a check to rolling_var for repeated observations, in order to
produce an exactly zero value of the variance when all entries are
identical. Related to the discussion in #7900

@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 15, 2014
@jreback jreback added this to the 0.15.0 milestone Sep 15, 2014
@jreback
Copy link
Contributor

jreback commented Sep 15, 2014

looks fine

cc @seth-p ?

@jreback
Copy link
Contributor

jreback commented Sep 15, 2014

@jaimefrio you are checking if the most recent non-nan is repeated. is this TOO restrictive?
[1,2, 1, 2] a problem as well? (e.g. isn't it number of repeats? and not their positinioning)

@behzadnouri
Copy link
Contributor

you are making the code more complex, without adding any practical usage.

var is by definition a floating point number. if your code depends on floating point numbers begin exactly equal, then there is something wrong somewhere else.

should there be also a check for when var is exactly 1.0? or 2.0?

@jaimefrio
Copy link
Contributor Author

Not sure I understand the question, do you mean the test or the code?

The only thing we are checking for in the code is whether all non-NaN observations in the window are identical (nobs == nrep), and then setting the variance to exactly 0 (ssqdm_x = 0; mean_x = rep). Your example [1, 2, 1, 2] does not fulfill that, so it will be calculated the same way it was before, not sure what could we do differently for that particular case.

The test is designed to test several of the code paths that can lead to explicitly setting the variance to 0, to check they all work. The most relevant one is that things get properly reset if at some point there are no valid observations in the window. With the huge amount of testing for these functions already in place, it didn't seem necessary to add anything apart from this very specific one.

@behzadnouri
Copy link
Contributor

if this patch goes in, i will submit mine for the case that var is exactly 1.0.

@jreback
Copy link
Contributor

jreback commented Sep 15, 2014

@jaimefrio makes sense

@behzadnouri not sure of your point. This has to do with repeated observations, NOT a special case check on the actual values.

@jaimefrio
Copy link
Contributor Author

@behzadnouri You may want to see the discussion in #7900 for some context. Not all floating point numbers are created equal, but even if they were, I'd argue that 0 is more equal than the others... ;-) And while the output is floating point, the input may very well be an integer sequence, or a fixed precision one, where exactly repeating values are not a theoretical construct, but the bread and butter of real world sequences.

The practical use of this is related to the rolling variance appearing as a factor in the denominator of several other expressions, namely the rolling correlation. Exactly detecting zero values in the denominator (an numerator) of those expressions is the only way of figuring the right value for them, instead of returning a meaningless value by dividing by an arbitrarily close to 0 value.

@behzadnouri
Copy link
Contributor

@jaimefrio see this SO question. What is the variance of this array: [1.386 - 2.1*(0.66), 0, 0]. you know that 1.386 = 2.1*(0.66)?. does your patch fix this? should we submit another patch for this as well?

@jreback master branch already returns correct results for the added test, as far as floating point arithmetic is understood. [1.386 - 2.1*(0.66), 0, 0] is also repeated observations.

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Overall I think the code looks good, but I have a few comments/questions:

  1. val is used both for each new input value (val = input[i]) as well as for the final result (val = ssqdm_x / (nobs - ddof), output[i] = val). This is slightly confusing. I'd suggest using different variables.
  2. I think the code is correct, but it's not immediately obvious that it properly handles the case where there are non-NaN values, then they all exit the window so that nobs decreases to 0, and then new values enter the window. The reason it's not immediately obvious to me is that when nobs decreases to 0, mean_x and sqdm_x appear to be left in an imprecise state in lines 1315-1316 (maybe close to 0, but not necessarily identical?). I guess it's ok, since at this point nrep and nobs are both 0, so the next time a value is encountered both will be set to 1, and in lines 1296-1297 ssqdm_x and mean_x will be set appropriately. So I think the code is ok, but can you please double-check?
  3. Though I haven't checked, I think the code will still fail the commented-out test in _test_moments_consistency() for auto-correlations being either NaN or 1:
# self.assertTrue(_non_null_values(corr_x_x).issubset(set([1.]))) # restore once rolling_cov(x, x) is identically equal to var(x)

In order to pass this test, I think need to implement algos.roll_cov() rather than algos.roll_var(), and implement rolling_var() and rolling_cov() in terms of algos.roll_cov()...

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

@behzadnouri, I think you're letting the perfect be the enemy of the good. No, this PR does not solve all floating-point inaccuracies. But there are certain basic identities that one wants to hold (see _test_moments_consistency() in test_moments.py), and if we can make those hold without overly complicating the code, then I think we should do so. As @jaimefrio points out, this is especially the case for variance calculations which (a) people expect to be non-negative (and which is addressed with a brute-force test); and (b) appear in the denominator of correlation calculations.

I'm in favor of including this PR, though as I mentioned above, think it would be even better to re-implement as algos.roll_cov().

@jreback
Copy link
Contributor

jreback commented Sep 15, 2014

is their a case where this DOES fail on current master? iow 'proves' (for the near 0 case) that this blows up because of numerical inaccuracies?

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Though I haven't checked, I presume the problem described in #7900 (comment) still remains in master until this PR is included.

@jaimefrio
Copy link
Contributor Author

@seth-p

  1. Fair point, will change. It is tempting to just write directly to output[i], although the premature optimizer in me worries about the cost of the indexing in the subsequent check for non-negativeness. Guess I will go with an out extra variable...
  2. When nobs decreases to 0 so does nrep, so the path in the code is the first one, nobs == nreps, line 1294, which sets ssqdm_x = 0, and has a specific check for nobs == 0 (mean_x = rep if nobs > 0 else 0). So I really think this is working fine. That specific test case is part of the new test I added, although arguably I am only checking the exact zeros. Will it make you more comfortable if we also checked the values of the non-zero variances in that array?
  3. Rolling covariance, as calculated right now, is plagued with the same precision problems that rolling variance used to have:
>>> x = np.random.rand(10)
>>> y = np.random.rand(10)

>>> pd.rolling_cov(x, y, 5)
array([        nan,         nan,         nan,         nan, -0.03823918,
       -0.04136157,  0.00516019, -0.01712565, -0.00264345,  0.03213796])
>>> pd.rolling_cov(x+1e9, y+1e9, 5) # This is badly broken
array([  nan,   nan,   nan,   nan,    0., -320., -320., -160.,  160.,    0.])

>>> pd.rolling_var(x, 5)
array([        nan,         nan,         nan,         nan,  0.06955873,
        0.11229032,  0.1298793 ,  0.05767171,  0.10295796,  0.13571737])
>>> pd.rolling_var(x+1e9, 5) # This is still getting 6-7 decimal places right (!!!)
array([        nan,         nan,         nan,         nan,  0.06955874,
        0.11229033,  0.12987929,  0.05767168,  0.10295792,  0.13571733])

So yes, it makes plenty of sense to implement a proper rolling_cov function, regardless of the consistency issues. What I am not sure is that we can simply throw rolling_var out the window, as I think the performance impact will not be negligible, as you need to independently check each array. I started looking into it a little last night, will try to follow up with a PR in the near future, although I am kind of busy lately. Not that you need my permission, but if you want to try your hand at it, you have my blessings, please go ahead!

I can also confirm that the commented test is failing, although the check may be a little too strict right now, here's a debugger session after the failure:

(Pdb) u
> c:\open_source\pandas\pandas\stats\tests\test_moments.py(733)_test_moments_consistency()
-> self.assertTrue(_non_null_values(corr_x_x).issubset(set([1.]))) # restore once rolling_cov(x, x) is identically equal to var(x)
(Pdb) p x
0    1
1    3
dtype: float64
(Pdb) p corr_x_x
0   NaN
1     1
dtype: float64
(Pdb) p corr_x_x[1]
0.99999999999999978

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

@jaimefrio:

Re 2. That's fine. Just wanted to confirm, as it wasn't 100% obvious at first glance.

Re 3. This is what I did for ewm*(), i.e. implemented ewmcov() and ewmvar() in terms of algos.ewmcov(). I think the resulting cov() calculation is faster than the mean(x_y) - mean(x)_mean(y) method, and calculating var(x)=cov(x,x) is only negligibly slower than a dedicated var(x) calculation. Though obviously best to test...

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Re 3 continued. I don't think the test is too strict. I believe that ewmcorr() passes the test (after implementing algos.ewmcov()). Once you have var(x) == cov(x,x) identically, you should be guaranteed that corr(x,x) will be either NaN or 1...

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

I may take a look at implementing algos.roll_cov(), though not sure I'll have time in the next week or two.

@jaimefrio
Copy link
Contributor Author

Since rolling_var is already in place, it will be easy as pie to check if rolling_var(x) is faster enough than rolling_cov(x, x) once this other is in place.

A question that may be relevant to the resulting performance of rolling_cov: once we have rolling_var and rolling_cov implemented using the same algorithm, will detecting exact zeros still be needed anywhere?

I think the relevant case to consider is something ike:

>>> x = np.array([1,1,1,1,1])
>>> y = np.random.rand(5)

>>> pd.rolling_cov(x, y, 3)
array([ nan,  nan,   0.,   0.,   0.])
>>> pd.rolling_var(x, 3)
array([ nan,  nan,   0.,   0.,   0.])
>>> pd.rolling_corr(x, y, 3)
array([ nan,  nan,  nan,  nan,  nan])

If we remove the guarantee of exactly zero, I am not 100% sure that in a case like this rounding errors may not lead to a rolling_corr close to 1, but not exactly 1, even when the calculated value should be 0 / 0

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Yes, I think algos.roll_cov() will need to check if each of the series is constant (over the current window). Note that I do that in algos.ewmcov() -- though that's much easier since the window is only expanding (i.e. just if mean_x != cur_x and if mean_y != cur_y).

Add a check to rolling_var for repeated observations, in order to
produce an exactly zero value of the variance when all entries are
identical. Related to the discussion in pandas-dev#7900
@jaimefrio
Copy link
Contributor Author

I have made the small style modification that @seth-p suggested and made another mostly style-related modification to the test. I think this is now ready to go.

@seth-p
Copy link
Contributor

seth-p commented Sep 16, 2014

Looks fine to me.

@jaimefrio
Copy link
Contributor Author

I have managed to put together a working Cython roll_cov function, but am not quite sure how to go about putting it into a PR. To avoid a rebase nightmare, it should wait until this and #8280 go in.

As a small teaser, I have also done some timings, and for large-ish arrays, doing roll_cov(x, x) is 1.5-2x slower than doing roll_var(x), so I vote for keeping roll_var around:

In [1]: x = np.random.rand(1e6)
In [2]: from pandas.algos import roll_var, roll_cov

In [4]: %timeit roll_var(x, 100, 1)
100 loops, best of 3: 19.2 ms per loop
In [5]: %timeit roll_cov(x, x, 100, 1)
10 loops, best of 3: 33.4 ms per loop

As implemented right now, roll_var and roll_cov are very, very similar, so the following nice result holds for random floating point arrays:

In [3]: np.array_equal(roll_var(x, 100, 1)[1:], roll_cov(x, x, 100, 1)[1:])
Out[3]: True

@behzadnouri
Copy link
Contributor

@seth-p @jaimefrio instead of comforting each other please do some readings.

You are making the code more complex and less efficient without any practical use because of your lack of understanding of floating point arithmetic.

@seth-p
Copy link
Contributor

seth-p commented Sep 16, 2014

@behzadnouri, let's try to be constructive and not let the conversation degenerate to ad-hominem attacks.

What you seem not to acknowledge is that the difference between a variance of 0 and a variance of 10^-16 is materially/qualitatively different form the difference between a variance of 1 and a variance of 1+10^e-16. If you do not agree, I'm happy to explain why I believe this is so.

If you do agree with my previous statement, perhaps you still think that properly handling the variance=0 case does not merit extra tests making the code slightly more complex. If this is the case, then I think we simply have an honest disagreement.

If you disagree with my belief that there's something special about a variance of 0 (whether or not it merits extra code), perhaps I can take you up on your offer to submit a PR "for the case that var is exactly 1.0." :-)

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@behzadnouri

Pls keep this discussion civil, both @seth-p and @jaimefrio are smart guys. You are welcome to refute and criticize arguments and have honest disagreements. But noone wants personal attaches.

That said, I would encourage you to provide a counter-argument to @seth-p points.

Python and pandas strive to be practical, efficient, have a nice API, simple, and give the exactly to the nth decimal place correct answers. Not all of these are possible at the same time.

We take different approaches to numerical stability, e.g. : #6817 (for efficiency), and here #8002 to provide numerical stable values.

I think it is most practical to provide the user an exactly 0 value when it is close to 0 that its immaterial. For 99.9% of use cases that will suffice. I am open to include a precision keyword to alleviate those cases where it doesnt.

@behzadnouri
Copy link
Contributor

@jreback you are adding corner cases to the code. this will make it less maintainable, less efficient and more complex, and I still don't see any practical use.

for the record:

  • [1.386 - 2.1*(0.66), 0, .3 - .1 * 3] is repeated observations
  • links posted: 1, 2, 3

given that I had provided this example above:

>>> 1.386 - 2.1 * .66
-2.220446049250313e-16

but we are still talking about variance of 0 versus 10^-16, i will not comment on this pr further;

@jaimefrio
Copy link
Contributor Author

@behzadnouri The problem we are trying to solve is the following, which I have just run with current master:

>>> x0 = np.array([0, 5, 5, 5])
>>> x1 = np.array([1, 5, 5, 5])
>>> y = np.random.rand(4)
>>> pd.rolling_corr(x0, y, 3)
array([        nan,         nan,  0.97657194,         nan])
>>> pd.rolling_corr(x1, y, 3)
array([        nan,         nan,  0.97657194,  0.        ])

I hope we agree that last value should be "the same" in both cases. With floating point, there may be more than one valid definition of what "the same" exactly means, but I am not aware of any in which zero is the same as NaN! This huge instability in the computed values happens because the "true" result we are after is 0 / 0, and we (@seth-p and myself) don't think that "because floating point" is a sufficient justification for this function to basically have undefined behavior, under a test case (rolling correlation with a sequence that happens to be constant over the window) that is far from a theoretical construct. The route we have chosen to fix it is to explicitly detect sequences that would produce exactly zero if computed with an infinite precision, both in the denominator and numerator. If you have a better idea, suggestions (or PRs!) are always welcome.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2014

@jaimefrio this is fine. can you add a release note in v0.15.0.txt (api section), referencing this PR number (as their is no issue I believe). Also do we need a note / warning in docs and/or doc-string (maybe a note here?)

@jaimefrio
Copy link
Contributor Author

I have mixed feelings about this... Although he had trouble presenting it, I kind of resonate with @behzadnouri's complaints about this approach. But regardless of philosophical principles, rolling_corr still needs a solution, so I have just sent #8326, that implements numerically stable versions of both roll_cov and roll_corr in algos.pyx. Neither roll_var nor roll_corr have exact zero detection over there, but roll_corr can both detect exact zeros in the denominator, and exactly identical sequences.

Why vouch for this? Because it keeps roll_var and roll_cov simple and fast for their most general use cases, and only integrates the extra complexity where it is needed: in making roll_corr work properly under all circumstances. Looking at the branching mess going on in roll_corr and then looking at the infinitely cleaner roll_cov is I think the best argument.

So my vote is for closing this and making #8326 work. It is going to require some effort, mostly to get the center keyword to work on rolling_cov and rolling_corr without resorting to a unary rolling function's implementation of it. And the tests in place seem overly restrictive to me: we are testing for exact equality of floating point operations involving products, divisions and square roots. While we probably could make those tests pass (they currently don't), I don't think that is a goal we should aspire to.

I would really like to hear @jreback's and @seth-p's thoughts on this whole mess. Even if you think that the standalone roll_var and roll_cov functions should have exact zero detection built in, I'd rather put it in #8326, than do half the work here and the other half over there.

@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

closing this in favor of fixing up with #8326

@jreback jreback closed this Sep 19, 2014
@seth-p
Copy link
Contributor

seth-p commented Sep 19, 2014

I don't think center is a problem, as that is handled in the rolling_* functions by appending the requisite number of NaN values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants