Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #37303 - Invalid variance fix #37384

Merged
merged 4 commits into from
Jan 25, 2019
Merged

Conversation

vishnugt
Copy link
Contributor

@colings86 colings86 added the :Analytics/Aggregations Aggregations label Jan 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@polyfractal
Copy link
Contributor

Thanks @vishnugt! Just had a quick skim, would you mind adding a few tests that shows each aggregator does the right thing in this scenario? That way we don't accidentally break this functionality in the future.

After the tests are added I'll kick off a CI build and give it a more thorough review. Thanks again!

@vishnugt
Copy link
Contributor Author

vishnugt commented Jan 14, 2019

Hey @polyfractal, I have added a test case for the getVariance method in InternalExtendedStats class, which seems to be the core part.
The reason I've added the condition in other places is just to make sure the variance won't be negative.

Regarding the test case, I had to hardcode a couple of values, because I couldn't reliably generate the expected result in any particular range.

@polyfractal
Copy link
Contributor

Great, thanks @vishnugt! I think hard-coding the values is perfectly fine in this case, it's a very specific scenario that's being fixed. Kicking off a CI build!

@elasticmachine test this please

@polyfractal
Copy link
Contributor

@elasticmachine test this please

@polyfractal
Copy link
Contributor

Heya @vishnugt, we're having some test troubles in CI unrelated to this PR. I'll kick off another build once we resolve the CI stuff, because otherwise it's going to just keep failing for unrelated reasons. Sorry for the delay!

@vishnugt
Copy link
Contributor Author

Hey @polyfractal, thanks for letting me know that its an issue unrelated to this PR.

@polyfractal
Copy link
Contributor

@elasticmachine test this please

@polyfractal
Copy link
Contributor

Heya @vishnugt, CI has cleared up a bit. Would you mind merging master into the PR and I'll kick off another set of builds?

Thanks!

@vishnugt
Copy link
Contributor Author

Hey @polyfractal, I just merged the master into the PR, would you mind kicking off a CI build?

Thanks!

@polyfractal
Copy link
Contributor

@elasticmachine test this please

@polyfractal
Copy link
Contributor

Hmm I agree, I think it's related to #37792

Would you mind merging in most recent master once more? On that note, if you'd like I can help out merging in master to your branch in the future, since CI can be finicky at times. Would help expedite getting this in :)

@vishnugt
Copy link
Contributor Author

Hey @polyfractal, I just merged this with the recent master.

I would be more than happy to let you help me with merging, I sent you a collaborator request for my fork of elasticsearch, though I'm not quite sure if this is what you meant. Please let me know if I have to do anything else on my side to let you help merge.

@polyfractal
Copy link
Contributor

@elasticmachine test this please

I would be more than happy to let you help me with merging, I sent you a collaborator request for my fork of elasticsearch, though I'm not quite sure if this is what you meant. Please let me know if I have to do anything else on my side to let you help merge.

Ah oops, my fault. All that's needed is a checkbox on the PR itself that allows repo maintainers to push to the PR (it should be checked by default, I just like to double-check with contributors so they know what's going on, when new commits show up)

Will help with merging and getting this passed, thanks!

@vishnugt
Copy link
Contributor Author

Ah okay, thanks for clarifying!
And yay, it passed all the checks this time!

@polyfractal
Copy link
Contributor

adcc08a87d810d4d

Merging! Thanks for your patience with our CI, it's had a rough week :)

@polyfractal polyfractal merged commit 27c3fb8 into elastic:master Jan 25, 2019
polyfractal pushed a commit that referenced this pull request Jan 25, 2019
Due to floating point error, it was possible for variances to become negative which should never happen.  This bugfix sets variance to zero if it becomes negative as a result of fp error.
@vishnugt
Copy link
Contributor Author

Thank you so much, it's my first PR, hope to send more :)

Thanks for helping till the end!

@polyfractal
Copy link
Contributor

No problem, you did all the hard work :) Looking forward to future PRs!

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 25, 2019
* elastic/master: (68 commits)
  Fix potential IllegalCapacityException in LLRC when selecting nodes (elastic#37821)
  Mute CcrRepositoryIT#testFollowerMappingIsUpdated
  Fix S3 Repository ITs When Docker is not Available (elastic#37878)
  Pass distribution type through to docs tests (elastic#37885)
  Mute SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard
  SQL: Fix casting from date to numeric type to use millis (elastic#37869)
  Migrate o.e.i.r.RecoveryState to Writeable (elastic#37380)
  ML: removing unnecessary upgrade code (elastic#37879)
  Relax cluster metadata version check (elastic#37834)
  Mute TransformIntegrationTests#testSearchTransform
  Refactored GeoHashGrid unit tests (elastic#37832)
  Fixes for a few randomized agg tests that fail hasValue() checks
  Geo: replace intermediate geo objects with libs/geo (elastic#37721)
  Remove NOREPLACE for /etc/elasticsearch in rpm and deb (elastic#37839)
  Remove "reinstall" packaging tests (elastic#37851)
  Add unit tests for ShardStateAction's ShardStartedClusterStateTaskExecutor (elastic#37756)
  Exit batch files explictly using ERRORLEVEL (elastic#29583)
  TransportUnfollowAction should increase settings version (elastic#37859)
  AsyncTwoPhaseIndexerTests race condition fixed (elastic#37830)
  Do not allow negative variances (elastic#37384)
  ...
@colings86 colings86 removed the v7.0.0 label Feb 7, 2019
@mysticaltech
Copy link

mysticaltech commented Jun 17, 2020

Hey folks, I am sorry to say this is just a patch for a totally wrong variance calculation method. The sum of square method used here should not be used for floating point calculations as it introduces a phenomenon called catastrophic cancellation hence the negative values, instead you should perform calculation with the simple Welford algorithm as presented here https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

@polyfractal
Copy link
Contributor

Heya @mysticaltech, thanks for the note! I mentioned this in a different ticket (#50416 (comment)), but forgot to open a followup ticket to change the algo. Thanks for the ping, I'll open one to track this particular enhancement 👍

@mysticaltech
Copy link

Thanks @polyfractal, that's great, if I knew enough about Elastic Search I would try to help, the Welford algorithm is quite simple and very stable numerically. If it can help, here's a Javascript implementation adapted from the wikipedia page python one https://gist.github.com/mysticaltech/7ceb16eaa6ad00209e6fd87a55785991

@polyfractal
Copy link
Contributor

Thanks @mysticaltech :) I suspect the hard part will actually be unrelated to the algo entirely, it'll be around dealing with backwards compat with older versions (since we'll need to support the old algo serialization, etc). Cheers!

@mysticaltech
Copy link

mysticaltech commented Jun 18, 2020

@polyfractal Yes I understand, it makes sense. But the variance should end up being the same, unless for cases where there are too much error and folks adapted to that. That said, I fully agree that such a change would be too risky (even though theoretically it shouldn't be).

So I would suggest keeping all variance methods by default using the sum of squares, and adding an optional flag to switch to Welford. This way, folks will have to manually specify Welford.

And if the M2 moment aggregate calculation is added, this can be done in one-pass if I'm not mistaken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid variance computed in extended_stats aggregation
5 participants