-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator #17085
[SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator #17085
Conversation
Test build #73528 has finished for PR 17085 at commit
|
@sethah @Lewuathe @thunterdb @WeichenXu123 @jkbradley @actuaryzhang @srowen would you be able to take a look? I've split the larger pull request into three parts as suggested. |
ping @sethah @Lewuathe @thunterdb @WeichenXu123 @jkbradley @actuaryzhang @srowen could you please take a look? thank you! |
48800eb
to
d5acd46
Compare
Test build #89406 has finished for PR 17085 at commit
|
d5acd46
to
17c1626
Compare
Jenkins retest this please |
Test build #89414 has finished for PR 17085 at commit
|
17c1626
to
aca6255
Compare
ping @sethah @WeichenXu123 @jkbradley @actuaryzhang @srowen could you please take a look? I've updated the PR to latest and made it similar to the multiclass PR that was merged: #17086 |
Test build #99947 has finished for PR 17085 at commit
|
Test build #99948 has finished for PR 17085 at commit
|
Test build #99952 has finished for PR 17085 at commit
|
Test build #99946 has finished for PR 17085 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks OK to me. What about the classification evaluators? is there the same meaningful notion of weights? I'd imagine it's possible but not sure I've seen weighted accuracy, etc.
@@ -52,7 +52,7 @@ class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with S | |||
private var totalCnt: Long = 0 | |||
private var totalWeightSum: Double = 0.0 | |||
private var weightSquareSum: Double = 0.0 | |||
private var weightSum: Array[Double] = _ | |||
private var currWeightSum: Array[Double] = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think the rename was necessary, but it is OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, it looks like the build failed because the private variable conflicts with the public variable that was defined:
/**
- Sum of weights.
*/
override def weightSum: Double = totalWeightSum
I think this may be the best name for the public variable so I would prefer to keep it. The private variable now follows the naming convention of the other private array variables so I think this makes sense.
@srowen yes, exactly, there is a third PR here for classification: #17084 The original PR had all three but it was recommended that I break it up into 3 parts so I closed it and opened three separate PRs: |
0cb2daf
to
f708edb
Compare
Test build #99982 has finished for PR 17085 at commit
|
Test build #99984 has finished for PR 17085 at commit
|
Merged to master |
…ed weight column for regression evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: apache#16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. The updates to the regression metrics were based on (and updated with new changes based on comments): https://issues.apache.org/jira/browse/SPARK-11520 ("RegressionMetrics should support instance weights") but the pull request was closed as the changes were never checked in. ## How was this patch tested? I added tests to the metrics class. Closes apache#17085 from imatiach-msft/ilmat/regression-evaluate. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…ed weight column for regression evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: apache#16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. The updates to the regression metrics were based on (and updated with new changes based on comments): https://issues.apache.org/jira/browse/SPARK-11520 ("RegressionMetrics should support instance weights") but the pull request was closed as the changes were never checked in. ## How was this patch tested? I added tests to the metrics class. Closes apache#17085 from imatiach-msft/ilmat/regression-evaluate. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data.
I've closed the PR: #16557
as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update.
The updates to the regression metrics were based on (and updated with new changes based on comments):
https://issues.apache.org/jira/browse/SPARK-11520
("RegressionMetrics should support instance weights")
but the pull request was closed as the changes were never checked in.
How was this patch tested?
I added tests to the metrics class.