Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1583: Fix a bug that using java.util.HashMap by mistake #500

Closed
wants to merge 1 commit into from
Closed

SPARK-1583: Fix a bug that using java.util.HashMap by mistake #500

wants to merge 1 commit into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Apr 23, 2014

JIRA: https://issues.apache.org/jira/browse/SPARK-1583

Does anyone know why using java.util.HashMap rather than mutable.HashMap? Some methods of java.util.HashMap are not generics and compiler can not help us find similar problems.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Apr 23, 2014

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Apr 23, 2014

Mostly because java's HashMap is faster than Scala's ...

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14360/

@zsxwing
Copy link
Member Author

zsxwing commented Apr 23, 2014

If I understand correctly, SaveStageAndTaskInfo in SparkListenerSuite lacks appropriate synchronization to guarantee memory visibility. That's why this test fails sometimes.

andy327 pushed a commit to tresata-opensource/spark that referenced this pull request Apr 23, 2014
I'm back with another less trivial suggestion for ALS:

In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus.

There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values.

The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1.

The implications for the algorithm are simple:
* the confidence function value must not be negative, and so can become 1 + alpha*|r|
* the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative

This in turn entails just a few lines of code change in `ALS.scala`:
* `rs(i)` becomes `abs(rs(i))`
* When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added

I think it's a safe change because:
* It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked)
* It's the simplest direct extension of the paper's algorithm
* (I've used it to good effect in production FWIW)

Tests included.

I tweaked minor things en route:
* `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt"
* RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights

Excuse my Scala style; I'm sure it needs tweaks.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#500 from srowen/ALSNegativeImplicitInput and squashes the following commits:

cf902a9 [Sean Owen] Support negative implicit input in ALS
953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt
@rxin
Copy link
Contributor

rxin commented Apr 23, 2014

Jenkins, retest this please.

@rxin
Copy link
Contributor

rxin commented Apr 23, 2014

Thanks @zsxwing. I've restarted the test. Do you have time to fix that flaky test?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14382/

@rxin
Copy link
Contributor

rxin commented Apr 23, 2014

Thanks. I've merged this.

@asfgit asfgit closed this in a664606 Apr 23, 2014
asfgit pushed a commit that referenced this pull request Apr 23, 2014
JIRA: https://issues.apache.org/jira/browse/SPARK-1583

Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems.

Author: zsxwing <zsxwing@gmail.com>

Closes #500 from zsxwing/SPARK-1583 and squashes the following commits:

7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake

(cherry picked from commit a664606)
Signed-off-by: Reynold Xin <rxin@apache.org>
@zsxwing
Copy link
Member Author

zsxwing commented Apr 24, 2014

Do you have time to fix that flaky test?

Sure. I need some time to confirm my guess.

@rxin
Copy link
Contributor

rxin commented Apr 24, 2014

Actually it's probably fixed here already: https://github.com/apache/spark/pull/516/files

@zsxwing
Copy link
Member Author

zsxwing commented Apr 24, 2014

Looks great.

@zsxwing zsxwing deleted the SPARK-1583 branch May 18, 2014 09:50
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
JIRA: https://issues.apache.org/jira/browse/SPARK-1583

Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems.

Author: zsxwing <zsxwing@gmail.com>

Closes apache#500 from zsxwing/SPARK-1583 and squashes the following commits:

7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
I'm back with another less trivial suggestion for ALS:

In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus.

There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values.

The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1.

The implications for the algorithm are simple:
* the confidence function value must not be negative, and so can become 1 + alpha*|r|
* the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative

This in turn entails just a few lines of code change in `ALS.scala`:
* `rs(i)` becomes `abs(rs(i))`
* When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added

I think it's a safe change because:
* It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked)
* It's the simplest direct extension of the paper's algorithm
* (I've used it to good effect in production FWIW)

Tests included.

I tweaked minor things en route:
* `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt"
* RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights

Excuse my Scala style; I'm sure it needs tweaks.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#500 from srowen/ALSNegativeImplicitInput and squashes the following commits:

cf902a9 [Sean Owen] Support negative implicit input in ALS
953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt
yifeih pushed a commit to yifeih/spark that referenced this pull request Mar 5, 2019
This PR reverts back to using Scala 2.11

* Revert "Fix distribution publish to scala 2.12 apache#478"
* Revert "[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0"
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
* Refactor for logs and results arch directory

Now we have multiple different custom logs path exist in
different OpenLab jobs.

This patch try to build a consist mechanism and usage in
order to avoiding end user and developer's confusion:

Add $LOGS_PATH, $RESULTS_PATH global env.
Prepare the {{ ansible_user_dir }}/workspace/logs,
and {{ ansible_user_dir }}/workspace/test_results.
All logs files (like debug log) should be stored in $LOGS_PATH,
and the final test_results (like binaries, artifact) should be
stored in $RESULTS_PATH.

Close: theopenlab/openlab#238
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants