-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Descale feature contribution for Linear Regression & Logistic Regression #345
Conversation
… & the features are standardized during training
Codecov Report
@@ Coverage Diff @@
## master #345 +/- ##
==========================================
+ Coverage 86.79% 86.79% +<.01%
==========================================
Files 336 336
Lines 10895 10921 +26
Branches 347 570 +223
==========================================
+ Hits 9456 9479 +23
- Misses 1439 1442 +3
Continue to review full report at Codecov.
|
@TuanNguyen27 can you also please explain what problem are you solving in the PR description? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some more tests to cover new scenarios
case Continuous(_, _, _, variance) => math.sqrt(variance) | ||
// for (binary) logistic regression we only need to multiply by feature standard deviation | ||
case Discrete(domain, prob) => | ||
def computeVariance(domain: Seq[String], prob: Seq[Double]): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont need the inner function and you can do it faster. simply do:
case Discrete(domain, prob) =>
val (weighted, sqweighted) = (domain zip prob).foldLeft((0.0, 0.0)) { case ((weightSum, sqweightSum), (d, p)) =>
val floatD = d.toDouble
val weight = floatD * prob
val sqweight = floatD * weight
(weightSum+ weight, sqweightSum + sqweight)
}
sqweighted - weighted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another question, since domain
is Seq[String]
how can we be sure that d.toDouble
wont throw an error?! should we handle it gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For regression problem where there are too few unique labels (and they get treated as categoricals), d.toDouble
should work fine. I'm not sure how it will behave on classification. For classification I assume that domain = Array("0", "1")
, is this correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we decide to define domain
as Seq[String]
in the first place? @Jauntbox @leahmcguire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have the raw label rather than the indexed label you will get strings - this was designed to support that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to do toDouble
in this case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any value which is not numeric with throw an exception, e.g. "".toDouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but will there ever be non numeric string in the label field...? i thought they should be filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fix the error message and then LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comparison thing I'd like you to check, then LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to hold this up, lgtm
Thanks for the contribution! Before we can merge this, we need @tovbinm to sign the Salesforce.com Contributor License Agreement. |
Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this. |
Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)
Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Problem context
Spark returns feature contribution on the original scale of the feature, making it hard to compare relative importance between two features of different scales. During training for linear regression, Spark normalizes features and label to zero mean and unit variance, and we want to display feature contribution (which is the model's coefficient) on this normalized scale.
Note: Spark does the same standardization for logistic regression, but it is unclear if the same problem applies to feature contribution for logistic regression.
Describe the proposed solution
Descale feature contribution by multiplying Spark's returned weight with the respective feature standard deviation, and divide by the label standard deviation. This change only involves ModelInsights, and will not affect scoring.
Describe alternatives you've considered
N/A. The descaling needs access to the best trained model, label summary stats & feature summary stats.