Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero #3220

Closed
wants to merge 1 commit into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Nov 12, 2014

See inline comment about the bug. I also did some code clean-up. @dbtsai I moved update to a private method of MultivariateOnlineSummarizer. I don't think it will cause performance regression, but it would be great if you have some time to test.

var i = 0
while (i < n) {
// merge mean together
if (other.currMean(i) != 0.0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong because we still need to consider the weight.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23252 has started for PR 3220 at commit 5ef601f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23252 has finished for PR 3220 at commit 5ef601f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23252/
Test PASSed.

* Adds input value to position i.
*/
private[this] def add(i: Int, value: Double) = {
if (value != 0.0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this a dumb question -- and this isn't a change in this PR -- but why can't a sample of value 0 be added to the summarizer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add it, and get the same result. However, it's computationally cheap if we don't add zero into the summarizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 affects the mean, and could affect min/max, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However, we know the total # of samples, and # of nonzero in each column, so if # of samples and # of nonzero are different, and we find the min is some positive number, then the actually min will be zero since we have zero somewhere which we don't add into summarizer.

For max, the same logic will be applied.

For mean, we can fix this effect by realMean(i) = currMean(i) * (nnz(i) / totalCnt)

As a result, for sparse dataset, we only need to add the nonzero into the summarizer, and it will be O(\bar{n}) instead of O(n) where \bar{n} is the average nonzero elements in one sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I get it now.

@dbtsai
Copy link
Member

dbtsai commented Nov 12, 2014

LGTM. Thanks.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 12, 2014

@dbtsai Thanks! I've merge this into master and branch-1.2. I will send patches for branch-1.0 and branch-1.1 later.

@dbtsai
Copy link
Member

dbtsai commented Nov 12, 2014

Thanks.

@asfgit asfgit closed this in 84324fb Nov 12, 2014
asfgit pushed a commit that referenced this pull request Nov 12, 2014
See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3220 from mengxr/SPARK-4355 and squashes the following commits:

5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up

(cherry picked from commit 84324fb)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 14, 2014
andrewor14 This backports the bug fix in #3220 . It would be good if we can get it in 1.1.1. But this is minor.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3251 from mengxr/SPARK-4355-1.1 and squashes the following commits:

33886b6 [Xiangrui Meng] Merge remote-tracking branch 'apache/branch-1.1' into SPARK-4355-1.1
91fe1a3 [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero
asfgit pushed a commit that referenced this pull request Mar 9, 2015
…n correctly

This backports the bug fix in #3220.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3850 from mengxr/SPARK-4355-1.0 and squashes the following commits:

ae9b94a [Xiangrui Meng] ColumnStatisticsAggregator doesn't merge mean correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants